Should a single company be left in charge of putting all of the world’s books online?
An impressive list of world-class libraries and book publishers don’t seem to mind. In 2004, they signed on as partners with Google, the Internet search and advertising colossus based in Mountain View, Calif.
Yet some observers have strong concerns about Google Book Search and how the collected thinking of human history will be accessed in the future.
Those anxieties rose late last month when Microsoft announced that it was withdrawing from a rival book-scanning project headed by the nonprofit Internet Archive (archive.org).
About 750,000 books and 80 million journal articles scanned by Microsoft were removed from its servers, but many remain accessible elsewhere, including on servers maintained by the Internet Archive, which has about 440,000 books online.
Microsoft, which said it still intends to give publishers digital copies of their scanned books, may have made a rational business decision from its perspective. But the sudden shift also showed how vulnerable a digitizing project is when it relies on a for-profit company, says Brewster Kahle, executive director of the Internet Archive. Nothing would stop Google from also suddenly shutting down its online book effort or limiting access to it, he says. If money gets tight, “there’s a meeting behind closed doors, and there’s a notice put on the website that it’s shut down,” he says. “That’s what happens.”
Internet access to books is becoming more important, some observers say, as portable book readers, such as Amazon’s Kindle, become more common and as more people expect to find all their reading needs online.
“I wouldn’t say Google is 100 percent of the digital book world, but it’s getting near 90 percent,” says Siva Vaidhyanathan, a cultural historian and media scholar at the University of Virginia, who writes a blog called “The Googlization of Everything.”
Internet Archive has funds to scan 1,000 books per day through the end of the year, Mr. Kahle says, including those at the Library of Congress. He’s exploring new partnerships that would allow the project to continue into 2009 and beyond.
“It’s not the end,” he says, but he concedes that now would be a great time for the next Andrew Carnegie – the 19th-century industrialist turned library-building philanthropist – to step forward and leave his or her own legacy by financing an open, nonprofit, worldwide digital library. “The best works of humankind are not on the Net yet,” he says.
Google has partnered with more than two dozen libraries, including those at Harvard, Stanford, Oxford, and Princeton universities and the New York Public Library. The company uses what amounts to a VIP library card – taking books on loan, scanning them, and then returning them to the library unharmed, says Jon Orwant, engineering manager of Google Book Search. The digitization costs the libraries nothing.
In a separate deal with book publishers, Google scans new books with a less gentle approach. The spines are chopped off and the pages fed through an optical scanner.
Google won’t say how many books it has scanned so far, but it’s certainly in the millions. The company estimates there may be more than 100 million book titles in the world today.
So far, Google isn’t aggressively trying to make money off its book pages, though a few ads and links to buy hard copies from the publisher do appear. Keeping users inside Google’s online “universe” seems to be the company’s long-term motive.
Books published before 1923 have gone out of copyright and can be scanned freely, downloaded, or printed. Google obtains permission from publishers regarding how much of a new book it can display. Though only short “snippets” of these books usually can be viewed, the whole text is still searchable, helping readers decide if it contains information that is useful to them.
Another controversial aspect of Google’s stewardship involves the quality of the digitization. After books are scanned, a process called optical character recognition (OCR) converts each page into a digital file whose words can be read by a computer, which makes it searchable.
Computer programs do a good job with OCR on new titles, but older books with yellowed pages, faded print, or graffiti can prove to be a problem. Google’s final product is “less than 100 percent” accurate, Mr. Orwant concedes.
“Google is doing a very, very poor job.... Their OCR is very inaccurate, the image quality is very poor,” says Lotfi Belkhir, CEO of Kirtas Technologies. The company, in Victor, N.Y., bills itself as the world’s leader in converting books into digital form. “You find cutoff text.... You find dirty text. You find incomplete pages.”
He predicts that much of what Google has digitized so far will need to be rescanned someday to bring it up to acceptable quality.
Mr. Belkhir is contacting libraries that had been working with Microsoft and says they are receptive to letting Kirtas pick up where it left off.
Google’s Orwant defends his project. “We certainly believe we’re doing the world a very good service,” he says. “We’re digitizing all this content. We’re making it as open as the laws allow.”
Google always gives a digital copy back to its partners, Orwant says. “We’re never the only people with a copy.” And because Google’s contracts with the libraries are nonexclusive, the libraries are free to work with others to scan their collections as well.
But that’s not enough for critics. “I don’t blame the company, but the question is ‘What do we as citizens want out of our information system?’ ” says Mr. Vaidhyanathan at the University of Virginia.
“If we assume that a healthy, diverse, and accessible body of information is essential to science, politics, creativity, literature,” he says, “then we really have to step back and say, ‘Do we really want to put this one company in the position of being the filter for the world’s information?’ ”