reCaptcha: How to turn blather into books
Ten seconds of work has digitized libraries, whether the amateur translators know it or not.
Toronto — When you buy a concert ticket on Ticketmaster, post something for sale on Craigslist, or poke an old friend on Facebook, you may not know it, but you’re helping to put millions of books online in a vast free library.
To access these websites, you must decipher two squiggly words to prove that you’re not a computer program designed to spam the site. Once it knows you’re human, the website lets you continue.
Those two decoded words don’t disappear, however. In fact, your brain has deciphered words that had baffled the scanning software used for an enormous project to digitize every public domain book in the world.
“We can coordinate literally millions of people on the Internet to work together to do something that computers cannot do,” says Luis von Ahn, an assistant professor of computer science at Carnegie Mellon University in Pittsburgh.
Mr. von Ahn helped develop the first version of these security puzzles in 2000, stringing together random combinations of words and numbers then distorting the text to make it impossible for automated spammers to decode.
Some 200 million of these words, dubbed “Captchas” for Completely Automated Public Turing test to tell Computers and Humans Apart, are typed every day by people around the world.
“At first, it made me feel good to look at the impact my research has had,” says von Ahn, who grew up in Guatemala.
Then he did the math: “It takes about 10 seconds to type each Captcha. I realized that humanity as a whole is wasting 500,000 hours every day typing Captchas.”
When von Ahn compared that to the 7 million hours it took to build the Empire State Building or the 20 million hours spent constructing the Panama Canal, he wondered, “Is there a way we can make good use of this time?”
In 2007, he came up with reCaptchas. Now, instead of frittering away their time typing random characters, Internet users spell actual words plucked from old books that computers have trouble reading.
The Open Content Alliance, a nonprofit group based in a San Francisco, has enlisted about 150 libraries and research centers to digitize as many printed works as it legally can and post them online for anyone in the world to read.
“Everything on the Internet Archive [archive.org] is free to use and free to download,” says Gabe Juszel, coordinator for the project’s largest scanning center that occupies a dim office at the University of Toronto. “We want to make sure a person in China has the same resources as a grad student here at U of T. After all, there are more Internet cafes than there are libraries in the world.”
Mr. Juszel presides over 23 scanning machines, each shrouded in a black tent to keep out light. Working in two shifts from 8:30 a.m. to 11:30 p.m., human operators manually turn pages as two cameras click every seven seconds.
The group copies up to 2,000 books a week, targeting volumes with expired copyright.
The scanned texts are sent to a server in California, where they’re run through optical character recognition software.
But computer programs are only 80 percent accurate in older books. They stump over blurry lines, places where the ink has bled together over time, and less uniform fonts.
Carnegie Mellon computers send the indecipherable words to more than 100,000 websites that use them in the reCaptcha security checks. Any website or blogger can sign up for the free service.
The Internet user sees two distorted words. One is a control word that the computer already knows. The other is a word that computers failed to read.
Once that word has been identified by multiple people it’s accepted as correct. The system’s accuracy rate of 99.1 percent is about the same as professional human transcribers.
Web users now provide about 3,000 man-hours a day of free labor in 10-second bursts of human computation, correcting more than 10 million words every day. ReCaptchas have solved 5 billion words in less than two years. Most people aren’t even aware that their brain power is being harnessed, although every reCaptcha includes a button that users can click to explain the program.
The Internet Archive now hosts 1.2 million books. The online library includes 100-year-old barber manuals, 19th century Henry James’ novels, and Beatrix Potter’s “The Tale of Benjamin Bunny.” Users have downloaded the most popular tome, Amusements in Mathematics from 1917, more than 2.5 million times.
“We get scholars saying, ‘I don’t have to travel 50 miles to the local rare-books library to sit with a book for an hour with white gloves on. Now I can just sit in the comfort of my own home with a digital copy,’ ” says Juszel.
If a certain text isn’t online, readers around the world can request the specific volume for ten cents a page and find them online by the next day. “They’re speechless,” says Juszel.
The reCaptcha program is also helping to digitize The New York Times newspaper. It’s about half way through archiving every edition printed from 1851 to 1980, when the paper went digital.
“The New York Times will have been transcribed word by word by people around the world in less than a year,” says von Ahn.
“The total number of people who have helped to do this is about 400 million,” he adds. “In other words, about 6 percent of the world’s population has helped digitize the New York Times. They’re not really wasting their time typing reCaptchas.”
For more on von Ahn's work, check out the sidebar, "A better world through games."