Shrinking 'ridiculous' data sets to manageable size

May 14, 2009 By Bill Steele

Two decades ago a renowned statistician described a computer data set of 1 billion bytes as "huge" and 10 trillion bytes as "ridiculous."

Today, thanks to the use of computers to collect and generate data, such ridiculously large data sets are common, from genome databases to search engine logs to Wal-Mart sales data. But the ability to monitor and process the data has not kept up with the ability to create it.

With a new three-year, $551,508 Young Investigator Award from the U.S. Office of Naval Research (ONR), Ping Li, Cornell assistant professor of statistical science, is taking a new mathematical approach. His goal: to "shrink" massive data sets into manageable approximations that can be processed in a reasonable length of time to detect such anomalies as denial-of-service attacks on the Internet or to enable computers to learn from experience for such applications as natural language processing, Web searching and computer vision.

"Instead of storing the whole data, we compute and store a sketch of the data, which is small enough to fit in the memory and still contains enough information to recover crucial relationships of the data," Li explained.

From the resulting sketch, Li says that it is possible, for example, to compute a quantity known as the Shannon entropy, which is, roughly, a measure of the degree of uncertainty in a body of information. A change in this would warn engineers of an anomaly such as a network failure, a large transfer of money or perhaps terrorist chatter. Li also plans to develop and publicly distribute software that can be used as part of machine-learning applications on massive and high-dimensional data sets.

The ONR Young Investigator Program identifies and supports academic scientists and engineers who have received a doctorate or equivalent degrees within the past five years and who show exceptional promise for doing cutting-edge research.

Provided by Cornell University (news : web)


print this article email this article download pdf blog this article bookmark this article     Stumble it Digg this share on Facebook retweet share on Reddit add to delicious
Rate this story - 4.6 /5 (5 votes)


May 14, 2009 all stories

Comments: 0

4.6 /5 (5 votes)
  • Stumble this up

  • Digg this

  • share this

  • hide
  • Related Stories

  • New tool enables powerful data analysis
    created Jan 08, 2009 | popularity not rated yet | comments 0
  • New grant supports emerging field of massive data analysis and visual analytics
    created Aug 06, 2008 | popularity not rated yet | comments 0
  • Statistics Professor Hides Pictures, Messages in Problem Solutions
    created Apr 11, 2007 | popularity not rated yet | comments 0
  • Leading-edge data analytics and visualization enable breakthrough science
    created Apr 10, 2009 | popularity not rated yet | comments 0
  • Model helps computers sort data more like humans
    created Aug 25, 2008 | popularity not rated yet | comments 0



  • hide
  • Relevant PhysicsForums posts

  • Help with a camera choice
    created Nov 18, 2009
  • casio calculator that's similar to TI-89
    created Nov 08, 2009
  • Advice on what cell phone to get
    created Nov 08, 2009
  • Changing the language options on your phone.
    created Nov 03, 2009
  • More from Physics Forums - Computing & Technology

Other News

China is the world's largest emitter of the greenhouse gases blamed for global warming

China harnesses mountain wind power

Technology / Energy

created 7 hours ago | popularity 5 / 5 (3) | comments 0

In the mountains above the southwestern Chinese town of Dali, dozens of new wind turbines dot the landscape -- a symbol of the country's sky-high ambitions for clean, green energy.


Analysts say AmEx is most interested in the so-called peer-to-peer services of Revolution

American Express takes aim at PayPal with Revolution

Technology / Internet

created 4 hours ago | popularity not rated yet | comments 0

With its deal to buy Revolution Money, American Express is taking aim at the growing market for online and alternative payments, in a challenge to recognized leader PayPal, analysts say.


Hackers leak e-mails, stoke climate debate

Technology / Internet

created 19 hours ago | popularity 4.5 / 5 (23) | comments 18

(AP) -- Computer hackers have broken into a server at a well-respected climate change research center in Britain and posted hundreds of private e-mails and documents online - stoking debate over whether some scientists have ...


Ubisoft steps up videogame fitness with virtual coach

Technology / Software

created 8 hours ago | popularity not rated yet | comments 0

French videogame powerhouse Ubisoft will have a virtual fitness coach whipping Wii users into shape starting Tuesday.


plug-in hybrid electric vehicle

Pulling the plug on hybrid myths

Technology / Energy

created Nov 19, 2009 | popularity 3.8 / 5 (12) | comments 18

(PhysOrg.com) -- Whether you call them myths, urban legends, fables or old wives' tales, there's a lot of misinformation out there about plug-in electric hybrid vehicles. These vehicles, abbreviated PHEVs, ...