Researchers 'text mine' The New York Times, demonstrating ease of new technology
July 26, 2006Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.
The demonstration is significant because it is one of the earliest showing that an extremely efficient, yet very complicated, technology called text mining is on the brink of becoming a tool useful to more than highly trained computer programmers and homeland security experts.
"We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier," said David Newman, a computer scientist in the Donald Bren School of Information and Computer Sciences at UCI. "To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."
Text mining allows a computer to extract useful information from unstructured text. Until recently, text mining required a great deal of preparation before documents could be analyzed in a meaningful way. A new text-mining technique called "topic modeling" -- which UCI scientists used in their New York Times experiment -- looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics -- all with minimal human effort.
UCI researchers didn't invent topic modeling, but they developed a technique that allows the technology to be used on huge document collections. They also are among the first to demonstrate its ease and effectiveness by applying it to a newspaper archive. The results reveal few surprises, but the application demonstrates the ability of topic modeling to spot trends and make connections in a way that could be applied to more complicated and cumbersome documents such as those used by medical researchers and lawyers.
Newman and UCI researchers Padhraic Smyth, Mark Steyvers and Chaitanya Chemudugunta presented their research at the recent Intelligence and Security Informatics conference in San Diego.
The topic model, applied to the collection of news articles published from 2000 to 2002, identified patterns of words that occurred together in the stories. From those words, researchers were able to identify topics. Information associated with those topics was charted over time, allowing the scientists to pinpoint what months of the year certain topics were most in the news and how much ink they received from year to year.
For example, the model generated a list of words that included "rider," "bike," "race," "Lance Armstrong" and "Jan Ullrich." From this, researchers were easily able to identify that topic as the Tour de France. By examining the probability of words appearing in stories about the Tour de France, researchers learned that Armstrong was written about seven times as much as Ullrich. Charting information over time, researchers discovered that discussion of Tour de France peaked in the summer months but decreased slightly year to year.
"If I were interested in advertising a product related to the Tour de France, I might want to know whether interest in the Tour de France is increasing or decreasing," Newman said. "This might be very important knowledge."
Including the Tour de France, the model automatically identified a total of 400 topics ranging from renting apartments in Brooklyn and diving in Hawaii to voting irregularities and dinosaur bones. As for newsmakers, topics included Tiger Woods, Elian Gonzalez, Denzel Washington and Barbie.
"Text mining is an incredible tool," Newman said. "It already allows a doctor to identify the common thread in old and new medical research. With topic modeling, connections can be drawn faster and more efficiently in large volumes of text."
About topic modeling: UCI researchers performed their experiment using a statistical topic model based on a text model developed at UC Berkeley in 2003. Thanks to an improved solution technique proposed by Mark Steyvers and a research partner, this model has advanced from academic use to something that is now widely used in the research community. Topic modeling looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics. Older text-mining techniques require the user to come up with an appropriate set of topic categories and manually find hundreds to thousands of example documents for each category. This human-intensive process is called supervised learning. In contrast, topic modeling, a type of unsupervised learning, doesn't need suggestions for an appropriate set of topic categories or human-found example documents. This makes retrieving information easier and quicker.
Source: University of California - Irvine
-
Putting lab life under the lens
Feb 09, 2012 |
not rated yet |
0
-
Stanford develops new tool for teaching doctors to treat sepsis
Feb 09, 2012 |
5 / 5 (1) |
0
-
Gender wage gap shrunk faster than previously thought
Feb 06, 2012 |
not rated yet |
0
-
Getting pious with a little help from our friends
Feb 01, 2012 |
5 / 5 (1) |
1
-
YouTube spends $100 million to redefine TV
Jan 17, 2012 |
3 / 5 (2) |
0
-
Engineers build first sub-10-nm carbon nanotube transistor
Feb 01, 2012 |
4.9 / 5 (31) |
30
-
Something old, something new: Evolution and the structural divergence of duplicate genes
Jan 31, 2012 |
4.6 / 5 (7) |
1
-
The hidden nanoworld of ice crystals: Revealing the dynamic behavior of quasi-liquid layers
Jan 30, 2012 |
5 / 5 (3) |
1
-
Stock market network reveals investor clustering
Jan 27, 2012 |
3.9 / 5 (23) |
8
-
Of microchemistry and molecules: Electronic microfluidic device synthesizes biocompatible probes
Jan 26, 2012 |
5 / 5 (1) |
0
-
Calling function with no input argument
16 hours ago
-
Force free body diagram problem on gym equipment
17 hours ago
-
Empirical data regarding shower heads and water
Feb 10, 2012
-
feed hold button on CNC lathe
Feb 09, 2012
-
RFAC in Fortran
Feb 09, 2012
-
dynamics 2/32
Feb 08, 2012
- More from Physics Forums - General Engineering
More news stories
Anonymous knocks CIA website offline (Update)
The website of the Central Intelligence Agency was inaccessible on Friday after the hacker group Anonymous claimed to have knocked it offline.
11 hours ago |
5 / 5 (10) |
16
New error-correcting codes guarantee the fastest possible rate of data transmission
Error-correcting codes are one of the triumphs of the digital age. Theyre a way of encoding information so that it can be transmitted across a communication channel such as an optical fiber o ...
Technology / Computer Sciences
19 hours ago |
4.9 / 5 (8) |
6
|
Google users warned of threat to smartphone wallets
Users of Google smartphone wallets were being warned on Friday that there is a way to crack pass codes intended to thwart thieves from going on illicit shopping sprees.
9 hours ago |
5 / 5 (2) |
0
New power source discovered
(PhysOrg.com) -- Researchers at the Massachusetts Institute of Technology (MIT) and RMIT University have made a breakthrough in energy storage and power generation.
Technology / Energy & Green Tech
18 hours ago |
4.7 / 5 (31) |
8
|
Small modular reactor design could be a 'SUPERSTAR'
(PhysOrg.com) -- Though most of today's nuclear reactors are cooled by water, we've long known that there are alternatives; in fact, the world's first nuclear-powered electricity in 1951 came from a reactor ...
Technology / Energy & Green Tech
19 hours ago |
4.4 / 5 (13) |
25
|
Humans may have helped the decline of African rainforests 3000 years ago
(PhysOrg.com) -- Large areas of rainforests in Central Africa mysteriously disappeared over three thousand years ago, to be replaced by savannas. The prevailing theory has been that the cause was a change ...
The power of estrogen -- male snakes attract other males
A new study has shown that boosting the estrogen levels of male garter snakes causes them to secrete the same pheromones that females use to attract suitors, and turned the males into just about the sexiest ...
Advanced power-grid model finds low-cost, low-carbon future in West
(PhysOrg.com) -- The least expensive way for the Western U.S. to reduce greenhouse gas emissions enough to help prevent the worst consequences of global warming is to replace coal with renewable and other ...
Could Venus be shifting gear?
(PhysOrg.com) -- ESAs Venus Express spacecraft has discovered that our cloud-covered neighbour spins a little slower than previously measured. Peering through the dense atmosphere in the infrared, the ...
Complex wiring of the nervous system may rely on a just a handful of genes and proteins
Researchers at the Salk Institute have discovered a startling feature of early brain development that helps to explain how complex neuron wiring patterns are programmed using just a handful of critical genes. ...
Japan scientist makes 'Avatar' robot
A Japanese-developed robot that mimics the movements of its human controller is bringing the Hollywood blockbuster "Avatar" one step closer to reality.