The following is an excerpt from Schneier's latest book, "Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World," published earlier this month by W. W. Norton & Company.
When a powerful organization is eavesdropping on significant portions of our electronic infrastructure and can correlate the various surveillance streams, it can often identify people who are trying to hide. Here are four stories to illustrate that.
- Chinese military hackers who were implicated in a broad set of attacks against the US government and corporations were identified because they accessed Facebook from the same network infrastructure they used to carry out their attacks.
- Hector Monsegur, one of the leaders of the LulzSec hacker movement under investigation for breaking into numerous commercial networks, was identified and arrested in 2011 by the FBI. Although he usually practiced good computer security and used an anonymous relay service to protect his identity, he slipped up once. An inadvertent disclosure during a chat allowed an investigator to track down a video on YouTube of his car, then to find his Facebook page.
- Paula Broadwell, who had an affair with CIA director David Petraeus, similarly took extensive precautions to hide her identity. She never logged in to her anonymous e-mail service from her home network. Instead, she used hotel and other public networks when she e-mailed him. The FBI correlated registration data from several different hotels—and hers was the common name.
- A member of the hacker group Anonymous called “w0rmer,” wanted for hacking US law enforcement websites, used an anonymous Twitter account, but linked to a photo of a woman’s breasts taken with an iPhone. The photo’s embedded GPS coordinates pointed to a house in Australia. Another website that referenced w0rmer also mentioned the name Higinio Ochoa. The police got hold of Ochoa’s Facebook page, which included the information that he had an Australian girlfriend. Photos of the girlfriend matched the original photo that started all this, and police arrested w0rmer aka Ochoa.
Maintaining Internet anonymity against a ubiquitous surveillor is nearly impossible. If you forget even once to enable your protections, or click on the wrong link, or type the wrong thing, you’ve permanently attached your name to whatever anonymous provider you’re using. The level of operational security required to maintain privacy and anonymity in the face of a focused and determined investigation is beyond the resources of even trained government agents. Even a team of highly trained Israeli assassins was quickly identified in Dubai, based on surveillance camera footage around the city.
The same is true for large sets of anonymous data. We might naïvely think that there are so many of us that it’s easy to hide in the sea of data. Or that most of our data is anonymous. That’s not true. Most techniques for anonymizing data don’t work, and the data can be de-anonymized with surprisingly little information.
In 2006, AOL released three months of search data for 657,000 users: 20 million searches in all. The idea was that it would be useful for researchers; to protect people’s identity, they replaced names with numbers. So, for example, Bruce Schneier might be 608429. They were surprised when researchers were able to attach names to numbers by correlating different items in individuals’ search history.
In 2008, Netflix published 10 million movie rankings by 500,000 anonymized customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using at that time. Researchers were able to de-anonymize people by comparing rankings and time stamps with public rankings and time stamps in the Internet Movie Database.
These might seem like special cases, but correlation opportunities pop up more frequently than you might think. Someone with access to an anonymous data set of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchant’s telephone order database. Or Amazon’s online book reviews could be the key to partially de-anonymizing a database of credit card purchase details.
Using public anonymous data from the 1990 census, computer scientist Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million people, could likely be uniquely identified by their five-digit ZIP code combined with their gender and date of birth. For about half, just a city, town, or municipality name was sufficient. Other researchers reported similar results using 2000 census data.
Google, with its database of users’ Internet searches, could de-anonymize a public database of Internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine’s search data. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.
Researchers have been able to identify people from their anonymous DNA by comparing the data with information from genealogy sites and other sources. Even something like Alfred Kinsey’s sex research data from the 1930s and 1940s isn’t safe. Kinsey took great pains to preserve the anonymity of his subjects, but in 2013, researcher Raquel Hill was able to identify 97 percent of them.
It’s counterintuitive, but it takes less data to uniquely identify us than we think. Even though we’re all pretty typical, we’re nonetheless distinctive. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This is also true for our book-reading habits, our Internet-shopping habits, our telephone habits, and our Web-searching habits. We can be uniquely identified by our relationships. It’s quite obvious that you can be uniquely identified by your location data. With 24/7 location data from your cellphone, your name can be uncovered without too much trouble. You don’t even need all that data; 95 percent of Americans can be identified by name from just four time/date/location points.
The obvious countermeasures for this are, sadly, inadequate. Companies have anonymized data sets by removing some of the data, changing the time stamps, or inserting deliberate errors into the unique ID numbers they replaced names with. It turns out, though, that these sorts of tweaks only make de-anonymization slightly harder.
This is why regulation based on the concept of “personally identifying information” doesn’t work. PII is usually defined as a name, unique account number, and so on, and special rules apply to it. But PII is also about the amount of data; the more information someone has about you, even anonymous information, the easier it is for her to identify you.
For the most part, our protections are limited to the privacy policies of the companies we use, not by any technology or mathematics. And being identified by a unique number often doesn’t provide much protection. The data can still be collected and correlated and used, and eventually we do something to attach our name to that “anonymous” data record.
In the age of ubiquitous surveillance, where everyone collects data on us all the time, anonymity is fragile. We either need to develop more robust techniques for preserving anonymity, or give up on the idea entirely.
Bruce Schneier is a security technologist, author, and chief technology officer of Resilient Systems. His latest book, "Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World," is copyright © 2015 by Bruce Schneier. This excerpt was published with permission of the publisher, W. W. Norton & Company, Inc. All rights reserved. Bruce blogs at schneier.com and tweets at @schneierblog.