They work a few hundred yards from one of the Library of Congress's most prized possessions: a vellum copy of the Bible printed in 1455 by Johann Gutenberg, inventor of movable type. But almost six centuries later, Jane Mandelbaum and Thomas Youkel have a task that would confound Gutenberg.
The researchers are leading a team that is archiving almost every tweet sent out since Twitter began in 2006. A half-billion tweets stream into library computers each day.
Their question: How can they store the tweets so they become a meaningful tool for researchers – a sort of digital transcript providing insights into the daily flow of history?
Thousands of miles away, Arnold Lund has a different task. Mr. Lund manages a lab for General Electric, a company that still displays the desk of its founder, Thomas Edison, at its research headquarters in Niskayuna, N.Y. But even Edison might need training before he'd grasp all the dimensions of one of Lund's projects. Lund's question:
How can power companies harness the power of data to predict which trees will fall on power lines during a storm – thus allowing them to prevent blackouts before they happen?
The work of Richard Rothman, a professor at Johns Hopkins University in Baltimore, is more fundamental: to save lives. The Centers for Disease Control and Prevention (CDC) in Atlanta predicts flu outbreaks, once it examines reports from hospitals. That takes weeks. In 2009, a study seemed to suggest researchers could predict outbreaks much faster by analyzing millions of Google searches.
Spikes in queries like "My kid is sick" signaled a flu outbreak before the CDC knew there would be one. That posed a new question for Dr. Rothman and his colleague Andrea Dugas:
Could Google help predict influenza outbreaks in time to allow hospitals like the one at Johns Hopkins to get ready?
They ask different questions. But all five of these researchers form part of the new world of Big Data – a phenomenon that may, for better or worse, revolutionize every facet of life, culture, and, well, even the planet. From curbing urban crime to calculating the effectiveness of a tennis player's backhand, people are now gathering and analyzing vast amounts of data to predict human behaviors, solve problems, identify shopping habits, thwart terrorists – everything but foretell which Hollywood scripts might make blockbusters. Actually, there's a company poring through numbers to do that, too.
Just four years ago, someone wanted to do a Wikipedia entry on Big Data. Wikipedia said no; there was nothing special about the term – it just combined two common words. Today, Big Data seems everywhere, ushering in what advocates consider some of the biggest changes since Euclid.
Want to get elected to public office? Put a bunch of computer geeks in a room and have them comb through databases to glean who might vote for you – then target them with micro-tailored messages, as President Obama famously did in 2012.
Want to solve poverty in Africa? Analyze text messages and social media networks to detect early signs of joblessness, epidemics, and other problems, as the United Nations is trying to do.
Eager to find the right mate? Use algorithms to analyze an infinite number of personality traits to determine who's the best match for you, as many online dating sites now do.
What exactly is Big Data? What makes it new? Different? What's the downside?
Such questions have evoked intense interest, especially since June 5. On that day, former National Security Agency analyst Edward Snowden revealed that, like Ms. Mandelbaum or Rothman, the NSA had also asked a question:
Can we find terrorists using Big Data – like the phone records of hundreds of millions of ordinary Americans? Could we get those records from, say, Verizon?
The dark side of Big Data involves much more than Snowden's disclosure, or what the US does. And what made Big Data possible did not happen overnight. The term has been around for at least 15 years, though it's only recently become popular.
"It will be quite transformational," says Thomas Davenport, an information technology expert at Babson College in Wellesley, Mass., who co-wrote the widely used book "Competing on Analytics: The New Science of Winning."
What exactly will it transform? To find out, let's go back to the beginning.
* * *
Big Data starts with ... a lot of data. Google executive chairman Eric Schmidt has said that we now uncover as much data in 48 hours – 1.8 zettabytes (that's 1,800,000,000,000,000,000,000 bytes) – as humans gathered from "the dawn of civilization to the year 2003."
You read that right. The head of a company receiving 50 billion search requests a day believes people now gather in a few days more data than humans have done throughout almost all of history.
Mr. Schmidt's claim has doubters. But similar assertions crop up from people not prone to exaggeration, such as Massachusetts Institute of Technology researcher Andrew McAfee and MIT professor Erik Brynjolfsson, authors of the new book "Race Against the Machine."
"More data crosses the Internet every second," they write, "than were stored in the entire Internet 20 years ago."
A key driver of the growth of data is the way we've digitized many of our everyday activities, such as shopping (increasingly done online) or downloading music. Another factor: our dependence on electronic devices, all of which leave digital footprints every time we send an e-mail, search online, post a message, text, or tweet.
Virtually every institution in society, from government to the local utility, is churning out its own torrent of electronic digits – about our billing records, our employment, our electricity use. Add in the huge array of sensors that now exist, measuring everything from traffic flow to the spoilage of fruit during shipment, and the world is awash in information that we had no way to uncover before – all aggregated and analyzed by increasingly powerful computers.
Most of this data doesn't affect us. Amassing information alone doesn't mean it's valuable. Yet the new ability to mine the right information, discover patterns and relationships, already affects our everyday lives.
Anyone, for instance, who has a navigation screen on a car dashboard uses data streaming from 24 satellites 11,000 miles above Earth to pinpoint his or her exact location. People living in Los Angeles and dozens of other cities now participate, knowingly or not, in the growing phenomenon of "predictive policing" – authorities' use of algorithms to identify crime trends. Tennis fans use IBM SlamTracker, an online analytic tool, to find out exactly how many return of serves Andy Murray needed to win Wimbledon.
When we use sites like SlamTracker, companies take note of our browsing habits and, through either the miracle or the meddling of Big Data, use that information to send us personal pitches. That's what happens when AOL greets you with a pop-up ad (Slazenger tennis balls – 70 percent off!).
In their book, "Big Data: A Revolution That Will Transform How We Live, Work, and Think," Kenneth Cukier and Viktor Mayer-Schönberger mention Wal-Mart's discovery, gleaned by mining sales data, that people preparing for a hurricane bought lots of Pop-Tarts. Now, when a storm is on the way, Wal-Mart puts Pop-Tarts on the shelves next to the flashlights.
But what excites and concerns people about Big Data is more far-reaching than that. One way of seeing the bigger picture: taking a closer look at some of the people in the digital trenches.
* * *
I follow Mandelbaum and Mr. Youkel down a corridor of the Library of Congress, past exhibits redolent of history and what you might expect from what we call "America's library," with its 38 million books on 838 miles of shelving.
They open a door. We pass behind people staring at huge computer screens and enter a room that doesn't look as if it belongs in a library at all. It's the size of a gym, with fluorescent lights overhead and tall metal boxes rising from the floor.
"The tweets come here," Mandelbaum says.
It's been three years since Twitter approached the library with a question. What the online networking service started in 2006 had become a new way of communicating. Would there, Twitter asked, be historical value in archiving tweets?
"We saw the value right away," says Robert Dizard, deputy director of the library. "[Our] mission is, preserve the record of America."
Certainly the record of what millions of Americans say, think, and feel each day would be a treasure-trove for historians. But was the technology feasible, and – important for a federal agency – cost-effective to handle the three V's that form the fingerprint of a Big Data project – volume, velocity, and variety?
The library said yes. But the task is daunting.
Volume? It will archive 172 billion tweets in 2013 alone, about 300 each from the world's 500 million-plus tweeters.
Velocity? That means absorbing more than 20 million tweets an hour, 24 hours a day, seven days a week, each stored in a way that can last.
Variety? There are tweets from a woman who may run for president in 2016 – and from Lady Gaga. And they're different in other ways.
"Sure, a tweet is 140 characters," says Jim Gallagher, the library's director of strategic initiatives. "But there are 50 fields. We need to record who wrote it. Where. When."
Because many tweets seem banal, the project has inspired ridicule. When the library posted its announcement of the project, one reader wrote in the comments box: "I'm guessing a good chunk ... came from the Kardashians."
But isn't banality the point? Historians want to know not just what happened in the past but how people lived. It is why they rejoice in finding a semiliterate diary kept by a Confederate soldier, or pottery fragments in a colonial town.
It's as if a historian today writing about Lincoln could listen in on what millions of Americans were saying on the day he was shot.
Youkel and Mandelbaum might seem like an odd couple to carry out a Big Data project: One is a career Library of Congress researcher with an undergraduate degree in history, the other a geologist who worked for years with oil companies. But they demonstrate something Babson's Mr. Davenport has written about the emerging field of analytics: "hybrid specialization."
For organizations to use the new technology well, traditional skills, like computer science, aren't enough. Davenport points out that just as Big Data combines many innovations, finding meaning in the world's welter of statistics means combining many different disciplines.
Mandelbaum and Youkel pool their knowledge to figure out how to archive the tweets, how researchers can find what they want, and how to train librarians to guide them. Even before opening tweets to the public, the library has gotten more than 400 requests from doctoral candidates, professors, and journalists.
"This is a pioneering project," Mr. Dizard says. "It's helping us begin to handle large digital data."
For "America's library," at this moment, that means housing a Gutenberg Bible and Lady Gaga tweets. What will it mean in 50 years? I ask Dizard.
He laughs – and demurs. "I wouldn't look that far ahead."
* * *
Arnold Lund is looking ahead. Lund has a Ph.D. in experimental psychology. He holds 20 patents, has written a book on managing technology design, and directs a variety of projects for General Electric.
Last year, a tree fell on power lines behind my house. As the local utility repaired things, an electrical surge crashed my computer, destroying all the contents. Lund's power line project has my attention.
"For power companies, one of the largest expenses is managing foliage," he says. "We lay out the entire geography of a state – and the overlay of the power grid. We use satellite data to look at tree growth and cut back where there's most growth. Then [we] predict where the most likely [problem] is. We have 50 different variabilities to see the probability of outage."
In that one compressed paragraph, I see three big changes Mr. Cukier and Mr. Mayer-Shönberger say Big Data brings to research. It's what we might call the three "nots."
Size, not sample. For more than a century, they point out, statisticians have relied on small samples of data from which to generalize. They had to. They lacked the ability to collect more. The new technology means we can "collect a lot of data rather than settle for ... samples."
Messy, not meticulous. Traditionally, researchers have insisted on "clean, curated data. When there was not that much data around, researchers [had to be as] exact as possible." Now, that's no longer necessary. "Accept messiness," they write, arguing that the benefits of more data outweigh our "obsession with precision."
Correlation, not cause. While knowing the causes behind things is desirable, we don't always need to understand how the world works "to get things done," they note.
Lund's lab exemplifies all three. First, his "entire geography" and 50 variables involve massive sets of data – information streaming in from sensors, satellites, and other sources about everything from forest density to prevailing wind direction to grid loads. Second, he looks for "probability" not "obsessive precision."
Correlation? Lund values cause, but the reason behind, say, tree growth interests him less than spotting correlations that might spur action. "Ah – that tree," he exclaims, as if he is an engineer in the field. "Better get the trucks out ahead of the storm!"
Cukier and Mayer-Schönberger cite the United Parcel Service to bolster their argument about correlation. UPS equips its trucks with sensors that identify vibrations and other things associated with breakdowns. "The data do not tell UPS why the part is in trouble. They reveal enough for the company to know what to do."
Lund's boss, GE chief executive officer Jeff Immelt, also talks about sensor data. The company is now investing $1 billion in software and analytics, which includes putting sensors on its jet engines to help enhance fuel efficiency. Mr. Immelt has said that just a 1 percent change in "fuel burn" can be worth hundreds of millions of dollars to an airline.
"You save an oil guy 1 percent," Immelt said at a conference this spring, "you're his friend for life."
While Lund has talked glowingly about how much data his projects can collect, he wants to make sure I know data isn't everything. "As a scientist," he says, "I know the biggest challenge is finding the right questions. How do you find the questions important to business, society, and culture?"
* * *
Rothman has questions, too. "We work in emergency rooms," he says about himself and Dr. Dugas. "We're the boots on the ground."
Rothman's work has involved emergency medicine and the nexus between public health and epidemics, including influenza, which kills as many as 500,000 people a year around the world and about 45,000 in the US.
The two researchers wanted to find out if the Google national study held lessons for Baltimore and their emergency room (ER). They studied Google queries for the Baltimore area – queries about flu symptoms, or chest congestion, or where to buy a thermometer. If they could spot spikes, that might help solve one crucial problem.
"Crowding," Dugas says. "Huge issue."
When epidemics start, people rush to hospitals. Waiting rooms fill up.
If Google trends showed a spike just as epidemics started, ERs could staff up and reserve more space for the surge of patients. The link between Google spikes and hospital visits in Baltimore turned out to be strong, especially for children. As soon as the first news reports surfaced about the 2009 H1N1 virus, pediatric ER visits at Hopkins increased – at the peak by as much as 75 percent.
But when the two researchers looked closer, they found something unexpected. No flu. It turned out that news reports about H1N1 elsewhere fueled a rush to ERs in Baltimore – what one researcher called "fear week."
"If you just looked at correlation for flu, you'd say it was a false trend," says Dugas.
Even so, she and Rothman found the data important for ERs: No matter why people are coming in the door, they need to staff up. The Baltimore study also showed the importance of finding out what was behind all those medically related Google searches – in other words, not just correlation but cause.
Like GE's Lund, Rothman emphasizes the value of "the questions you're asking."
* * *
Evidence that Big Data promises enormous benefits is more than anecdotal. MIT's Mr. Brynjolfsson did a study in 2012 examining 179 companies. He found those whose decisions were "data-driven" had become 5 to 6 percent more productive in ways only the use of data could explain.
On the other hand, consider just this one data point: If you type "Big Data Dark Side" into Google, you'll get 40 million results. Despite the potential, there's also peril.
The dark side of Big Data concerns Laura DeNardis, Internet scholar, author of three books, and professor at American University's school of communication in Washington. She and others worry – not exclusively – about three questions. Does the new technology (1) erode privacy, (2) promote inequality, and (3) turn government into Big Brother?
She points to public health data as one potential source of abuse. Her concern echoes that of critics who fear that supposedly anonymous patient records are not anonymous at all. As far back as the 1990s, a Massachusetts state commission gave researchers health data about state workers, believing this would help officials make better health-care decisions. William Weld, then governor of Massachusetts, assured workers their files had been scrubbed of the data that could identify them.
One Harvard University computer science graduate student took this promise of privacy as a challenge. Using just three bits of data, Latanya Sweeney showed how to identify everyone – including Weld, whose diagnoses, medications, and entire medical history Ms. Sweeney, now a professor at Harvard, gleefully sent to his office.
Today there are far more powerful ways to identify people from records supposed to keep things private. And there are concerns other than our health records. Dr. DeNardis worries about how much companies know about our social media habits.
"Take a look at the published privacy policies of Apple, Facebook, or Google," she says. "They know what you view, when you make a call, where you are. People consent to that by selecting 'I agree' to privacy terms. But how carefully are they read?"
She's not alone. Jay Stanley of the American Civil Liberties Union describes one example of what companies can do with what they know about us: "credit-scoring."
"Credit card companies," he wrote in a blog, "sometimes lower a customer's credit card limit based on the repayment history of other customers at stores where a person shops."
Do we want Master Card to lower our credit-card limits, thinking we're a risk, just because people who frequent the stores we do don't pay their bills?
In addition to individual privacy, critics worry about Big Data's impact in more expansive ways, such as the growing gap between rich and poor nations. Large American companies can hire hundreds of data analysts. How can Bangladesh compete? Will this aggravate the global digital divide?
Perhaps most worrisome to people at the moment is the government's use of Big Data to monitor its own citizens, or others, in the name of national security. "The American people," President Obama said a few days after the NSA story broke, "don't have a Big Brother who is snooping into their business."
Did Obama mean George Orwell's term doesn't include governments secretly monitoring calls, e-mails, audio, and video of citizens suspected of nothing? Commandeering information from firms like Yahoo and Google?
The questions that arose from Snowden's revelations in June encompass issues of privacy, confidentiality, freedom, and, of course, security. The Obama administration argues that monitoring personal information keeps the country safe, asserting that PRISM has helped foil 54 separate terrorist plots against the US.
Some lawmakers on Capitol Hill dispute that number, though, and in recent weeks momentum has been building in Washington to rein in the NSA. Not only has support increased on the left and right to adopt more oversight of its surveillance program, polls show a hardening of public opinion about snooping, too.
Meanwhile, there is no doubt about the fury in other countries when the news broke – especially in Germany, where critics have compared American monitoring of foreigners' phone calls and e-mails with that of Stasi, the former hated East German secret police.
In fact, some of those most upset about the NSA revelations include Americans alarmed about what the new technology means outside US borders. Suzanne Nossel, head of the PEN American Center, which works to free writers and artists around the world imprisoned for free speech, worries about the government use of data from private companies to stifle dissent.
"It's not new," she says, citing the Chinese dissident Shi Tao, imprisoned by China in 2004 for posting political commentary on foreign websites, and still locked up. "Yahoo China had assisted the Chinese government. They used [Yahoo data] to convict him."
But then Ms. Nossel talks about the recent unrest in Turkey, where the Turkish military shot and arrested dozens of protesters in Istanbul's Taksim Square. To find more of what they called "looters," the Turkish government went to Twitter and Facebook for help – and announced that Facebook was "responding positively," something Facebook has denied.
And Nossel sees a difference between 2004 and now. Talking about the most repressive governments in the world, she argues that "the government ability to sweep and search is [now] so great, it tips the scale. No technology on the side of human rights advocates can confront it. That's new – and chilling."
* * *
What have we learned? There's a notable "Sesame Street" episode from years back in which Cookie Monster wanders into a library and drives the librarian crazy by asking over and over for a cookie. "This is a LIBRARY!" the librarian finally screams, forgetting to whisper. "We have books! Just books!"
That's certainly been our image of what libraries do. "You can still find books here," Mandelbaum reminds me, standing in a room full of processors.
But figures over the past decade seem to show that books – those rectangular things with pages we turn – are slowly on the way out in the Digital Age. That's less significant than it might seem, though. After all, we value books because of the knowledge they hold. We've changed the way we convey knowledge many times. Big Data is another source of knowledge. Will it become a more integral part of tomorrow's libraries?
It is perhaps fitting that one of the "Sesame Street" characters most in tune with the future is ... the Count. He counts everything. His role is to teach kids the importance of counting. Big Data allows us to count everything – and analyze what we find. But are numbers enough?
Brynjolfsson and Mr. McAfee compare Big Data to Leeuwenhoek's development of the microscope in the 1670s. They are, after all, both tools. They let people see lots of things that have always been around. Of course, the microscope also prompted us to ask questions we could never ask before. Big Data does that, too.
Still, while Big Data can predict a flu outbreak or where trees fall, it can't, by itself, resolve the economic and moral dilemmas we have. Whether to keep power running, help patients faster, or preserve the record of America, Big Data teaches us what's out there, not what's right.
There's nothing inherently wrong with Big Data. What matters, as it does for Arnold Lund in California or Richard Rothman in Baltimore, are the questions – old and new, good and bad – this newest tool lets us ask.
• Robert A. Lehrman is a novelist and former White House chief speechwriter for Vice President Al Gore. Author of 'The Political Speechwriter's Companion,' he teaches at American University and co-runs a blog, PunditWire.