Skip to: Content
Skip to: Site Navigation
Skip to: Search

  • Advertisements

reCaptcha: How to turn blather into books

Ten seconds of work has digitized libraries, whether the amateur translators know it or not.

(Page 2 of 2)



Mr. Juszel presides over 23 scanning machines, each shrouded in a black tent to keep out light. Working in two shifts from 8:30 a.m. to 11:30 p.m., human operators manually turn pages as two cameras click every seven seconds.

Skip to next paragraph

The group copies up to 2,000 books a week, targeting volumes with expired copyright.

The scanned texts are sent to a server in California, where they’re run through optical character recognition software.
But computer programs are only 80 percent accurate in older books. They stump over blurry lines, places where the ink has bled together over time, and less uniform fonts.

Carnegie Mellon computers send the indecipherable words to more than 100,000 websites that use them in the reCaptcha security checks. Any website or blogger can sign up for the free service.

The Internet user sees two distorted words. One is a control word that the computer already knows. The other is a word that computers failed to read.

Once that word has been identified by multiple people it’s accepted as correct. The system’s accuracy rate of 99.1 percent is about the same as professional human transcribers.

Web users now provide about 3,000 man-hours a day of free labor in 10-second bursts of human computation, correcting more than 10 million words every day. ReCaptchas have solved 5 billion words in less than two years. Most people aren’t even aware that their brain power is being harnessed, although every reCaptcha includes a button that users can click to explain the program.

The Internet Archive now hosts 1.2 million books. The online library includes 100-year-old barber manuals, 19th century Henry James’ novels, and Beatrix Potter’s “The Tale of Benjamin Bunny.” Users have downloaded the most popular tome, Amusements in Mathematics from 1917, more than 2.5 million times.

“We get scholars saying, ‘I don’t have to travel 50 miles to the local rare-books library to sit with a book for an hour with white gloves on. Now I can just sit in the comfort of my own home with a digital copy,’ ” says Juszel.

If a certain text isn’t online, readers around the world can request the specific volume for ten cents a page and find them online by the next day. “They’re speechless,” says Juszel.

The reCaptcha program is also helping to digitize The New York Times newspaper. It’s about half way through archiving every edition printed from 1851 to 1980, when the paper went digital.

“The New York Times will have been transcribed word by word by people around the world in less than a year,” says von Ahn.

“The total number of people who have helped to do this is about 400 million,” he adds. “In other words, about 6 percent of the world’s population has helped digitize the New York Times. They’re not really wasting their time typing reCaptchas.”

For more on von Ahn's work, check out the sidebar, "A better world through games."

Permissions