Scientists devise means to test for phony technical papers

April 24, 2006

Authors of bogus technical articles beware. A team of researchers at the Indiana University School of Informatics has designed a tool that distinguishes between real and fake papers. It's called the Inauthentic Paper Detector -- one of the first of its kind anywhere -- and it uses compression to determine whether technical texts are generated by man or machine.

"This is a potential problem since no existing systems, the Web for example, can or do discriminate between content that is meaningful or bogus," says assistant professor Mehmet Dalkilic, a data mining expert. "We believe that there are subtle, short- and long-range word or even word string repetitions that exist in human texts, but not in many classes of computer-generated texts that can be used to discriminate based on meaning."

Joining Dalkilic on the IPD project are Assistant Professor Predrag Radivojac, informatics doctoral student James Costello, and Wyatt T. Clark, who will graduate in May with a bachelor's degree in informatics.

The IPD system is based on a combination of compression algorithms that reduce the amount of data to save space and speed transmission time.

To begin their study, the team identified two kinds of texts they would analyze. "Authentic text" (or document) is a collection of several hundreds or thousands of syntactically correct sentences that are wholly meaningful. "Inauthentic text" (or document) is a collection of several hundreds of thousands of syntactically correct sentences that, taken all together, have no meaning.

The researchers' work is documented in the very authentic paper, "Using Compression to Identify Classes of Inauthentic Texts," which they presented at the Society for Industrial and Applied Mathematics Conference on Data Mining in Bethesda, Md., this weekend.

The informatics study largely was inspired by a prank pulled by three Massachusetts Institute of Technology students, who in 2004 developed a computer program that churned out randomly generated fake computer science language, essentially a four-page compilation of gibberish. They submitted it as a research paper to an international conference on computer science and informatics – and it was accepted without review.

Radivojac, whose research expertise is machine learning, says the IPD easily detected numerous inauthentic technical papers tested, including the MIT students' spurious submission.

"We hypothesized we could build a reliable and fast model that recognizes fake papers automatically," says Radivojac. "We combined these with machine-learning methods to build a predictor of these kinds of papers."

In general, identifying meaning in a technical document is difficult, Dalkilic says. "We don't claim we have found a way to distinguish between meaning and nonsense, but we do emphasize that there are many nontrivial classes of inauthentic documents that can be easily distinguished based on compression algorithms."

Source: Indiana University School of Informatics


print this article email this article download pdf blog this article bookmark this article     Stumble it Digg this share on Facebook retweet share on Reddit add to delicious
Rate this story - 3.8 /5 (8 votes)


April 24, 2006 all stories

Comments: 0

3.8 /5 (8 votes)
  • Stumble this up

  • Digg this

  • share this

  • hide
  • Related Stories

  • Research team develops systems that process and understand spoken language, especially Basque
    created Mar 10, 2008 | popularity not rated yet | comments 0
  • Amid the flu epidemic, don't forget RSV in young children
    created 21 hours ago | popularity not rated yet | comments 0
  • Novel K-anonimity algorithm safeguards access to data
    created Nov 20, 2009 | popularity not rated yet | comments 0
  • Scientists put interactive flu tracking at public's fingertips
    created Nov 16, 2009 | popularity not rated yet | comments 0
  • Darwin meets Facebook
    created Nov 10, 2009 | popularity not rated yet | comments 0


Other News

NREL Uncovers Clean Energy Leaders State by State

NREL Uncovers Clean Energy Leaders State by State

Technology / Energy

created 3 hours ago | popularity 1 / 5 (1) | comments 0

(PhysOrg.com) -- That California and Texas still lead the United States in generating renewable energy probably is no surprise. But, NREL's 2009 State of the States report shows that several smaller states ...


Intelligence inside metal components

Intelligence inside metal components

Technology / Engineering

created 1hour ago | popularity 5 / 5 (1) | comments 0

Up to now, extreme production temperatures made it impossible to equip metallic components with RFID chips during the operating process. At Euromold in Frankfurt (Dec. 2-5), Germany, Fraunhofer researchers ...


Opera logo

Stable Opera 10.10 browser with Unite now available

Technology / Software

created 4 hours ago | popularity 4.7 / 5 (3) | comments 1

(PhysOrg.com) -- The web browser Opera 10.10 has been released as a stable version, and it has a number of new features to enhance the browsing experience, including "Unite", which is a group of applications ...


Key scientist says politics behind stolen e-mails

Technology / Other

created 4 hours ago | popularity 1 / 5 (1) | comments 4

(AP) -- A leading climate change scientist said hackers breaking into a university's computer server and then posting documents online show the nasty politics of global warming.


Just in time for Black Friday: students turn iPhone into barcode scanner

Just in time for Black Friday: students turn iPhone into barcode scanner

Technology / Software

created 15 hours ago | popularity 4.7 / 5 (3) | comments 0

(PhysOrg.com) -- Comparing prices over the Internet has become a common practice for consumers. Now, just in time for Black Friday, a group of Missouri University of Science and Technology students is putting ...