Computer users are digitizing books quickly and accurately with Carnegie Mellon method

August 14th, 2008

Millions of computer users collectively transcribe the equivalent of 160 books each day with better than 99 percent accuracy, despite the fact that few spend more than a few seconds on the task and that most do not realize they are doing valuable work, Carnegie Mellon University researchers reported today in Science Express.

They can work so prodigiously because Carnegie Mellon computer scientists led by Luis von Ahn have taken a widely used Web site security measure, called a CAPTCHA, and given it a second purpose — digitizing books produced prior to the computer age. When Web visitors solve one of the distorted-letter puzzles so they can register for email or post a comment on a blog, they simultaneously help turn the printed word into machine-readable text.

More than a year after implementing their version, called reCAPTCHA, http://recaptcha.net/ on thousands of Web sites worldwide, the researchers conclude that their word deciphering process achieves the industry standard for human transcription services — better than 99 percent accuracy. Their report, published online today, will appear in an upcoming issue of the journal Science.

Furthermore, the amount of work that can be accomplished is herculean. More than 100 million CAPTCHAs are solved every day and, though each puzzle takes only a few seconds to solve, the aggregate amount of time translates into hundreds of thousands of hours of human effort that can potentially be tapped. During the reCAPTCHA system's first year of operation, more than 1.2 billion reCAPTCHAs have been solved and more than 440 million words have been deciphered. That's the equivalent of manually transcribing more than 17,600 books.

"More Web sites are adopting reCAPTCHAs each day, so the rate of transcription keeps growing," said von Ahn, an assistant professor in the School of Computer Science's Computer Science Department. "More than 4 million words are being transcribed every day. It would take more than 1,500 people working 40 hours a week at a rate of 60 words a minute to match our weekly output."

Von Ahn said reCAPTCHAs are being used to digitize books for the Internet Archive and to digitize newspapers for The New York Times. Digitization allows older works to be indexed, searched, reformatted and stored in the same way as today's online texts.

Old texts are typically digitized by photographically scanning pages and then transforming the text using optical character recognition (OCR) software. But when ink has faded and paper has yellowed, OCR sometimes can't recognize some words — as many as one out of every five, according to the Carnegie Mellon team's tests. Without reCAPTCHA, these words must be deciphered manually at great expense.

Conventional CAPTCHAs, which were developed at Carnegie Mellon, involve letters and numbers whose shapes have been distorted or backgrounds altered so that computers can't recognize them, but humans can. To create reCAPTCHAs, the researchers use images of words from old texts that OCR systems have had trouble reading.

Helping to make old books and newspapers more accessible to a computerized world is something that the researchers find rewarding, but is only part of a larger goal. "We are demonstrating that we can take human effort — human processing power — that would otherwise be wasted and redirect it to accomplish tasks that computers cannot yet solve," von Ahn said.

For instance, he and his students have developed online games, available at http://www.gwap.com , that analyze photos and audio recordings — tasks beyond the capability of computers. Similarly, University of Washington biologists recently built Fold It, http://fold.it/ , a game in which people compete to determine the ideal structure of a given protein.

In addition to von Ahn, authors of the new report include computer science undergraduate Benjamin Maurer, graduate students Colin McMillen and David Abraham, and Manuel Blum, professor of computer science.

Source: Carnegie Mellon University


print this article email this article download pdf blog this article bookmark this article     Digg this Stumble it share on Facebook share on Reddit add to delicious save to Yahoo! bookmarks
4.8/5 after 35 votes

Rank Filter

Move the slider to adjust rank threshold, so that you can hide some of the comments.


Display comments: newest first

  • jburchel - Aug 14, 2008
    • Rank: 4.5 / 5 (2)
    That is very clever.
  • HarryStottle - Aug 15, 2008
    • Rank: 4.3 / 5 (3)
    Just one thing confuses me and doesn't appear to be addressed by the above.

    Given that the point of presenting the "reCaptcha" is that we don't know its content, how, when the human is filling in the web form, do we know that they've entered the correct data? If we don't, then it isn't much of a validation process (how would they differentiate between humans and 'bots?) and if we do, then that implies someone else has already spent their time decoding the knackered text so that we can do a fair comparison.

    Anyone fill in the gaps?
  • MVV - Aug 15, 2008
    • Rank: 5 / 5 (4)
    From the recapcha website :
    "
    But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."
  • googleplex - Aug 15, 2008
    • Rank: not rated yet
    Truly brilliant. An elegantly simple idea to use an otherwise wasted resource. What is astonishing is that the 2 uses (anti-bot and human OCR) are completely un-related.

August 14th, 2008 all stories
Technology / Computer Sciences

Comments: 4
Rank: 4.8/5 after 35 votes

  • Stumble this up

  • Digg this

  • Share it:
  • share on Facebook
  • share on MySpace
  • share on Slashdot
  • rss-newsfeed
  • share on Google
  • share on Reddit
  • add to delicious
  • save to Yahoo! bookmarks
  • share on Windows Live
  • Add to Mixx!
Rating: 4.8/5 after 35 votes

  • Related Stories

  • US Army Invests in 'Thought Helmet' Technology for Voiceless Communication
    created Sep 22, 2008 | popularity not rated yet | comments 0
  • Model helps computers sort data more like humans
    created Aug 25, 2008 | popularity not rated yet | comments 0
  • Scientists to study synthetic telepathy
    created Aug 13, 2008 | popularity not rated yet | comments 0
  • A computer that can 'read' your mind
    created Jun 02, 2008 | popularity not rated yet | comments 0
  • Computer model reveals how brain represents meaning
    created May 29, 2008 | popularity not rated yet | comments 0

Tags

computer

  • Physicists Demonstrate Quantum Memory with Matter Qubits
    Physicists Demonstrate Quantum Memory with Matter Qubits
    Physics / General Physics
    created 18 hours ago | popularity 4.5 / 5 (11) | comments 1
  • 'Holey' Nanosheets for Wastewater Dye Removal
    Nanotechnology / Nanomaterials
    created Jul 01, 2009 | popularity 5 / 5 (5) | comments 1
  • Jellyfish Robot Swims Like its Biological Counterpart
    Jellyfish Robot Swims Like its Biological Counterpart
    Electronics / Robotics
    created Jun 26, 2009 | popularity 4.4 / 5 (7) | comments 1
  • Could Maxwell's Demon Exist in Nanoscale Systems?
    Could Maxwell's Demon Exist in Nanoscale Systems?
    Physics / General Physics
    created Jun 24, 2009 | popularity 4.4 / 5 (18) | comments 29
  • Living Safely with Robots, Beyond Asimov's Laws
    Living Safely with Robots, Beyond Asimov's Laws
    Electronics / Robotics
    created Jun 22, 2009 | popularity 4.6 / 5 (50) | comments 39
  • Other News

    Homeland Security Secretary Janet Napolitano

    US government Internet traffic to be screened: report (Update)

    Technology / Internet

    created 17 hours ago | popularity 5 / 5 (1) | comments 2

    The Obama administration is planning to use the National Security Agency to screen Internet traffic between government agencies and the private sector, the Washington Post reported Friday.


    Volkswagen hopes to turn out its first all-electric car in 2013

    Volkswagen plans electric car in 2013: head

    Technology / Energy

    created 11 hours ago | popularity 1 / 5 (1) | comments 0

    German auto maker Volkswagen hopes to turn out its first all-electric car in 2013, VW head Martin Winterkorn said Friday.


    Japanese veterans in Imperial Army uniforms march in Tokyo

    Japanese imperial army maps to go online

    Technology / Internet

    created 9 hours ago | popularity 1 / 5 (1) | comments 0

    Old Asia-Pacific maps from Japanese Imperial Army archives are going online for modern use, such as studying changes in forest cover or the growth of cities, a Japanese researcher said Friday.


    US wants privacy in new cyber security system (AP)

    US wants privacy in new cyber security system

    Technology / Internet

    created 21 hours ago | popularity 4 / 5 (1) | comments 0

    (AP) -- The Obama administration is moving cautiously on a new pilot program that would both detect and stop cyber attacks against government computers, while trying to ensure citizen privacy protections.


    Racing car powered by chocolate and steered by carrots takes to the track at Goodwood

    Technology / Engineering

    created 15 hours ago | popularity 1 / 5 (2) | comments 0

    A racing car created from potatoes and carrots and powered by chocolate will be put through its paces this weekend at the world’s largest celebration of motorsport.