Shrinking 'ridiculous' data sets to manageable size

May 14, 2009 By Bill Steele

Two decades ago a renowned statistician described a computer data set of 1 billion bytes as "huge" and 10 trillion bytes as "ridiculous."

Today, thanks to the use of computers to collect and generate data, such ridiculously large data sets are common, from genome databases to search engine logs to Wal-Mart sales data. But the ability to monitor and process the data has not kept up with the ability to create it.

With a new three-year, $551,508 Young Investigator Award from the U.S. Office of Naval Research (ONR), Ping Li, Cornell assistant professor of statistical science, is taking a new mathematical approach. His goal: to "shrink" massive data sets into manageable approximations that can be processed in a reasonable length of time to detect such anomalies as denial-of-service attacks on the Internet or to enable computers to learn from experience for such applications as natural language processing, Web searching and computer vision.

"Instead of storing the whole data, we compute and store a sketch of the data, which is small enough to fit in the memory and still contains enough information to recover crucial relationships of the data," Li explained.

From the resulting sketch, Li says that it is possible, for example, to compute a quantity known as the Shannon entropy, which is, roughly, a measure of the degree of uncertainty in a body of information. A change in this would warn engineers of an anomaly such as a network failure, a large transfer of money or perhaps terrorist chatter. Li also plans to develop and publicly distribute software that can be used as part of machine-learning applications on massive and high-dimensional data sets.

The ONR Young Investigator Program identifies and supports academic scientists and engineers who have received a doctorate or equivalent degrees within the past five years and who show exceptional promise for doing cutting-edge research.

Provided by Cornell University (news : web)


print this article email this article download pdf blog this article bookmark this article     Stumble it Digg this share on Facebook retweet share on Reddit add to delicious
Rate this story - 4.6 /5 (5 votes)


May 14, 2009 all stories

Comments: 0

4.6 /5 (5 votes)
  • Stumble this up

  • Digg this

  • share this

  • hide
  • Related Stories

  • New tool enables powerful data analysis
    created Jan 08, 2009 | popularity not rated yet | comments 0
  • New grant supports emerging field of massive data analysis and visual analytics
    created Aug 06, 2008 | popularity not rated yet | comments 0
  • Statistics Professor Hides Pictures, Messages in Problem Solutions
    created Apr 11, 2007 | popularity not rated yet | comments 0
  • Leading-edge data analytics and visualization enable breakthrough science
    created Apr 10, 2009 | popularity not rated yet | comments 0
  • Model helps computers sort data more like humans
    created Aug 25, 2008 | popularity not rated yet | comments 0



  • hide
  • Relevant PhysicsForums posts

  • Read multiple binary files to ascii
    created Nov 07, 2009
  • Engineering Translation software
    created Nov 06, 2009
  • Changing the language options on your phone.
    created Nov 03, 2009
  • HP strange RPN operation???
    created Nov 02, 2009
  • More from Physics Forums - Computing & Technology

Other News

A system of space solar power system (SSPS)

Japan eyes solar station in space as new energy source

Technology / Energy

created 10 hours ago | popularity 4.6 / 5 (10) | comments 9

It may sound like a sci-fi vision, but Japan's space agency is dead serious: by 2030 it wants to collect solar power in space and zap it down to Earth, using laser beams or microwaves.


Campaigners are stepping up efforts to curb online tracking

Advertisers face resistance to on-line tracking

Technology / Internet

created 9 hours ago | popularity 5 / 5 (2) | comments 0

Campaigners are stepping up efforts to curb online tracking of Internet use by firms that deliver adverts tailored to the specific interests of consumers, as polls reveal widespread unease with the practice.


Software cos. eye key patent case in Supreme Court (AP)

Software cos. eye key patent case in Supreme Court

Technology / Business

created 11 hours ago | popularity 5 / 5 (2) | comments 1

(AP) -- With the technology industry looking on, the Supreme Court on Monday will explore what types of inventions should be eligible for a patent in a pivotal case that could undermine such legal protections ...


Framed for child porn -- by a PC virus

Framed for child porn -- by a PC virus

Technology / Internet

created 2 hours ago | popularity 5 / 5 (3) | comments 1

(AP) -- Of all the sinister things that Internet viruses do, this might be the worst: They can make you an unsuspecting collector of child pornography.


Sony offers 'Cloudy' early to people with its TVs

Technology / Business

created 2 hours ago | popularity not rated yet | comments 0

(AP) -- In a bid to sell living room electronics and spur buzz for "Cloudy with A Chance of Meatballs," Sony Corp. is offering the movie for free to U.S. buyers of its Internet-connected TVs and Blu-ray players starting ...