Frank Pasquale unravels the new machine age of algorithms and bots
In his book "The Black Box Society," Pasquale exposes secret algorithms behind the scenes of corporate America.
Slate recently said Frank Pasquale's new book, "The Black Box Society: The Secret Algorithms That Control Money and Information," attempts to "come to grips with the dangers of 'runaway data' and 'black box algorithms' more comprehensively than any other book to date.'
I recently spoke with Pasquale about his new book and about how algorithms play a major role in our everyday lives — from what we see and don't see on the Web, to how companies and banks classify consumers, to influencing the risky deals made by investors. Edited excerpts follow.
Selinger: What's a black box society?
Pasquale: The term ‘black box’ can refer to a recording device, like the data-monitoring systems in planes, trains, and cars. Or it can mean a system whose workings are mysterious. We can observe its inputs and outputs, but can’t tell how one becomes the other. Every day, we confront these two meanings. We’re tracked ever more closely by firms and the government. We often don’t have a clear idea of just how far this information can travel, how it’s used, or its consequences.
Selinger: Why are secret algorithms so important to the story you’re telling?
Pasquale: Sometimes there are runaway algorithms, which, by themselves, take on very important decisions. They may become even more important in the future. For example, autonomous weapon systems could accidentally trigger skirmishes or even wars, based on misinterpreted signals. Presently, algorithms themselves cause problems or snafus that are not nearly as serious but still foreshadow much more troubling developments. Think of the uncontrolled algorithmic trading that led to a major stock market disruption in 2010, or nearly destroyed the firm Knight Capital. Similar technology is now used by small businesses, with occasionally devastating results, as the program "The Spark" at the CBC recently pointed out. Credit scores can also have a direct, negative impact on individuals, without them knowing the basis for sharp changes in their scores.
But one thing I emphasize in the book is that it’s not only – and often not primarily – the algorithms, or even the programmers of algorithms, who are to blame. The algos also serve as a way of hiding or rationalizing what top management is doing. That’s what worries me most – when “data-driven” algorithms that are supposedly objective and serving customers and users, are in fact biased and working only to boost the fortunes of an elite.
Selinger: Are you talking about people diffusing power and masking shady agendas through algorithms and hoping they won’t get caught? Or are you suggesting that delegating decisions to algorithms is a strategy that actually immunizes folks from blame?
Pasquale: I think both things are happening, actually. There are people at the top of organizations who want to take risks without taking responsibility. CEOs, managers, and others can give winks and nudges that suggest the results they want risk analysts and data scientists to create. Algorithmic methods of scoring and predictive analytics are flexible, and can accommodate many goals. Let’s talk about an example that’s currently being litigated. It concerns a ratings agency that exaggerated the creditworthiness of mortgage-backed securities.
One key part of the case comes down to whether 600,000 mortgages should have been added to a sample of 150,000 that clearly served the interests of the firm’s main clients, but was increasingly less representative of the housing market. Turns out that once the sample increases the loans at the heart of the housing crisis start to look more risky. And so here’s the problem. By avoiding the data, you can give AAA ratings to many more clients who pay top dollar for the ratings than you’d be able to if you used more accurate information.
'At present, algorithms are ripe for manipulation and corruption.' – Pasquale
It’s not as if there’s something wrong with math or models. And there’s nothing wrong with computational algorithms in themselves. The problem is that, at present, algorithms are ripe for manipulation and corruption. New disclosure requirements could help. But complexity can defeat those, too. For example, the rating agencies now, after Dodd-Frank, must disclose certain aspects of their models and use of data. But firms that want to model the value of mortgage-backed securities can deploy proprietary software, sometimes costing millions of dollars, to do so. That once again obfuscates matters for people who can’t access the software. There is a persistent worry that outsourcing regulation to risk modeling creates more financial uncertainty and instability.
Selinger: There have always been strategies for getting nefarious things done while hiding the dirt on your hands. Managers can create perverse incentive structures and then blame employees for the inevitable malfeasance. Just demonize “bad apples” who broke the rules, and deny that the titled system had anything to do with their behavior. So, what’s new here?
Pasquale: Purportedly scientific and data-driven business practices are now being billed as ways of making our world more secure and predictable. For good faith actors, those aspirations are laudable. But less scrupulous managers have found ways of taking risks on the basis of contestable models containing massaged data. And this, in turn, creates more instability.
There’s a paradox at the heart of the black box society. We’re constantly driven to justify things scientifically, even when that’s not possible. So in far too many contexts, there’s pressure to find some, any kind of data, in order to meet arbitrary or self-serving “quality” standard. All too often, in finance, dubious data can enter manipulable models created for opportunistic traders answering to clueless CEOs.
Selinger: Any other examples spring to mind that illustrate problems with lack of algorithmic accuracy or transparency?
Pasquale: It’s also a major issue in credit scoring for individuals. Exposes have shown how careless the big credit bureaus are with requests for correction. Even more frighteningly, in a report called "The Scoring of America," Pam Dixon and Bob Gellman have shown that there are hundreds of credit scores that people don’t even know about, which can affect the opportunities they get.
In terms of processing the data, there are some worries about major Internet firms. But because of the black box nature of the algorithms, it’s often hard to definitively prove untoward behavior. For example, various companies have complained that Google suddenly, and unfairly, dropped them in search engine results. Foundem, a British firm, was a most noted example. It has argued that Google reduced its ranking (and those of other shopping sites) in order to promote its own shopping alternatives. Yelp has also complained about unfair practices.
Critical questions came up. Can we trust that Google is putting user interests first in its rankings? Or are its commercial interests distorting what comes up in results? Foundem said they were being disappeared to help make room for Google Shopping. And when you consider how vital Google now considers it to be, to compete with Amazon, it makes some sense that the firm would do more to shade its results to favor its own properties in subtle ways that are barely detectable to the average consumer.
Then, there’s the use of data. Who knows exactly how Google is using all the data they collect on us? There are documented examples of secondary uses of data, in other sectors, that are very troubling. Many people were surprised to learn, back in 2008, that their prescription records were being used by insurers to determine whether they should get coverage. Basically, to apply for insurance, they had to waive their HIPAA protections, and allow the insurer to consult data brokers who produced health scores predicting how sick they were likely to get. And the insurers had their own “red flags” – for example, anyone who’d been on Prozac was assumed to be too risky.
'Presently, lots of people consider being on top of Twitter’s trending topics, or Google or Amazon search results, an important bragging right. But if these results are relatively easy to manipulate, or are really dictated by the corporate interests of the big Internet firms, they should be seen less as the “voice of the people” than as a new form of marketing.' – Pasquale
So we’ve covered bad data, bad processing of data, and bad uses of data. Sometimes, all three concerns come together. Think of the controversy about Twitter during the time of Occupy Wall Street. In 2011 tons of people were tweeting #OccupyWallStreet, but it never came up as a trending topic – that is, one of the topics that appears on all users’ home pages. In response to accusations of censorship, Twitter stated that its algorithms don’t focus on popularity, but rather the velocity of accelerating trends. According to Twitter, the Occupy trend may have been popular, but it was too slow to gain popularity to be recognized as a trending topic.
That may be right. And the folks at Twitter are perfectly entitled to make that decision. But I have some nagging concerns. First, is bot behavior counted? There are so many bots on Twitter. And it’s easy to imagine some savvy group of programmers manipulating algorithms to get their favored topics pride of place. That’s a data integrity problem. The data processing problem is related. Clearly popularity has to be some part of the “trending” algo. No hashtag is getting to the top just by suddenly having 10 mentions instead of one. But how much does it matter? No one outside the company knows. Or if they do, they may well be jealously guarding that secret, to gain some commercial advantage.
This brings us to use. Presently, lots of people consider being on top of Twitter’s trending topics, or Google or Amazon search results, an important bragging right. But if these results are relatively easy to manipulate, or are really dictated by the corporate interests of the big Internet firms, they should be seen less as the “voice of the people” than as a new form of marketing. Or, to use Rob Walker’s term, “murketing.” For example, has anyone audited the data behind Twitter’s trending topics? I don’t think so. There’s no reliable way to access the data and algos you’d need to do a scientifically valid job there to be convincing. It may seem trivial now, but the more these kind of methods are used in the mass media, the more important they’ll be.
Selinger: Was the public response to how Twitter characterized Occupy Wall Street an indication of how little most of us understand about how algorithms are constructed? Or was it a justified indictment of Twitter for not making relevant information publicly available?
Pasquale: Media literacy is important, but in an era of score-driven education, it’s exactly the type of humanities education that’s on the chopping block. So new media should shoulder an obligation here. They could have something relatively unobtrusive but standard, like an asterisk, that links to a page explaining how information is being presented and what decisions go into it. Even right now, if I log onto Twitter, I won’t see anything that explains what the trends are.
Of course, total transparency provides an incentive to game the system. But at least some broad outline of the standards used, and purposes of, categories like “Trends” would offer a baseline for understanding what’s going on.
Let’s also recall that when the activists heard about the trending problem they developed a new API called Thunderclap that lets a whole bunch of tweets come to users at once. Twitter responded by suspending Thunderclap’s access to the platform. Maybe that decision reflects users’ interests. It might be annoying to get 100 tweets on the same topic at once in one’s timeline. But there are other ways of managing such issues. It just might take more time to implement them. But when a firm needs to demonstrate a potential for rapidly scaling revenue growth to Wall Street, it can’t afford to experiment. It’s left making rapid, blunt decisions.
Lilly Irani recently explained why that might be the case, referencing the heuristics and perhaps algorithms of venture capitalists and other investors. If a firm is classified as a tech firm, as opposed to a more traditional one, it’s often just assumed that it can scale to serve much larger numbers of people without proportionally increasing its own labor costs. So the investors’ algorithms favor firms that use algorithms to deal with controversies, problems, and information overload.
'Unsurprisingly, they found that even educated, power users don’t have a good idea of how Facebook’s algorithmic curation works.' - Pasquale
That might be fine if major Internet firms were only market actors seeking to maximize profits. But they also have major social, cultural, and political impact. And they brag about that impact. Think, for instance, about US tech firms’ self-proclaimed role in the Arab Spring protests.
My big point, here, is that people need to have a better understanding of when algorithmic standards are actually promoting users’ interests, as opposed to their monetization.
Selinger: Can you draw any general lessons from reflecting on what went wrong in some of the cases where missteps occurred?
Pasquale: With respect to the dominant media platforms discussed in my book, ranging from Twitter to Apple, Facebook, and Google, I think they’re obligated to enhance new media literacy. To go further, I would say that Facebook – which is much worse than Twitter, because it’s so algorithmically filtered, and so often misleading – has a duty to allow its users to understand how its filtering works. For example, it should permit them to see everything their friends post, if they want to. An API to do just that has been released to researchers. Unsurprisingly, they found that even educated, power users don’t have a good idea of how Facebook’s algorithmic curation works.
Selinger: Let’s get into the “Right to be Forgotten.” Do you disagree with the typical American response to the issue? Are we missing out on anything by insisting that First Amendment protection absolves Google of needing to remove links to information people deem irrelevant or outdated? Bottom line: Can we do better?
Pasquale: We can. There are two really clear examples of the U.S. embracing the erasure of information, or, at the very least, the non-actionability of information. These are the Fair Credit Reporting Act and expungement. Thanks to the Fair Credit Reporting Act, bankruptcies need to be removed after a certain amount of time elapses, so that the information doesn’t dog people forever. Human Resources is going to look for the worst thing on reports that come their way, and in many cases, not hire someone who declared bankruptcy. Expunging various types of criminal records is a very important right, especially since right now many employment practices are driven algorithmically and will knock out a candidate just for having been arrested, without human review on a case-by-case basis.
Over time we’ve realized that it’s important to give people second chances. Now that we have so many algorithmically driven decisions, and now that one piece of information can wreck someone’s life, it’s incredibly important for people to be able to get some information out of the system altogether. But even if one believes that no information should be “deleted” — that every slip and mistake anyone makes should be on their permanent record forever — we could still try to influence the processing of the data to reduce the importance of misdeeds from long-ago.
Selinger: What, if any, chance do we see of having such ideals implemented here when the current American response to the Right to be Forgotten seems to be … that’s the “European way” of looking at things?
Pasquale: Well, if there are First Amendment absolutists in the Senate willing to filibuster, all bets are off. But there’s still hope. Think about the medical records that were part of the Sony hacks, and the iCloud nude photo hacks. It’s very troubling to think that, even if the original hackers were punished, later “re-publishers” of the same material could just keep it available perpetually in thousands of “mirror sites.”
Of course, the First Amendment does permit republication of material that’s a matter of public concern. But not everything is really of “public concern.” I’ve talked to Congressional staffers about this. People are starting to realize that you can’t have a regime where everything is deemed a matter of public concern and you can have persistent, uncontrolled, runaway publications of images and sensitive data that serve no public purpose. That epiphany will provide a foothold for something like the Right to be Forgotten to emerge in the United States.
Selinger: Does this way of looking at things cast aspersions on the freedom available to data brokers? They’ve got access to lots of sensitive information, including medical data, and little stands in their way of profiting from selling it over and over again.
Pasquale: Data brokers have sprung up out of relatively humble soil, like direct marketing and shoppers’ lists, and become a multibillion dollar industry that (in the aggregate) has the goal of psychologically, medically, politically, profiling everyone in the world. People don’t fully appreciate the extent to which data brokers can trade information amongst themselves to create ever more detailed profiles of individuals.
People are starting to realize that you can’t have a regime where everything is deemed a matter of public concern and you can have persistent, uncontrolled, runaway publications of images and sensitive data that serve no public purpose. That epiphany will provide a foothold for something like the Right to be Forgotten to emerge in the United States.
In a 2011 case, IMS Health v. Sorrell, the Supreme Court did recognize a First Amendment right for a data broker to gather information in order to help its clients' target marketing. But that case made clear that it did not affect HIPPA. So to come back to the point about public concern, let’s say that, for example, a law eventually passed that subjected data brokers to HIPPA-like regulations and limited the information they could use about people’s health status. Right now they literally have lists of millions of people said to be diabetics, have cancer, have AIDS, be depressed, be impotent, et cetera.
I believe that individuals have the right to stand in relation to the list-creators, pursuant to a future data protection law, as they now do in relation to their health care providers, pursuant to HIPAA. Thanks to HIPAA, I can review my medical records at a HIPAA-covered entity, and object to many uses of them, and even see who looked at them. In other words, I can demand an “accounting of disclosures.” If data brokers really want to extend IMS Health to stop similar rules applying to them … well, then, they might just create precedent to destroy HIPAA itself. And in that case, your doctor could claim a “First Amendment right” to tell everyone he knows about your medical conditions. Or nurses with access to the records could do the same. It’s an absurd result.
Selinger: I’d like to end our conversation by switching gears slightly so that we can talk about what philosophers call the epistemology of big data. Boosters see big data as capable of revealing everything, from who we really are to how law enforcement can do predictive policing. Is this exaggeration? If so, is this is an issue for the black box society?
Pasquale: I’d like to write a piece called “The impoverished epistemology of big data” because it’s driving a lot of decisions in crudely behavioristic ways. For example, we constantly hear the story of the credit card company that found out that people who buy felt pads [to put on furniture legs so that chairs and couches and desks won't scratch the floor] are more reliable — for paying back their debts — than their credit scores would lead you to believe. This becomes a story about big data unlocking secret signals that can help us arrange the world in a more just and equitable manner because now the people who buy felt pads finally will be rewarded. I’m skeptical, and for several reasons.
'People don’t fully appreciate the extent to which data brokers can trade information amongst themselves to create ever more detailed profiles of individuals.' – Pasquale
First, when this gets out there you’ll have a certain class of people who will buy the felt pads just to get better credit scores. If I predict the weather better, clouds aren’t going to conspire to outwit me. When a retailer predicts consumer behavior, he may well create a gameable classifier. This means, a once-reliable indicator becomes less reliable as it’s better known. Now maybe that is a good thing. Maybe the world would be better and more sustainable if people who bought felt furniture pads were systematically rewarded, and floors were systematically saved thanks to their efforts. But, second, is it fair that the credit card company gets to analyze us this way, all too often opaquely? In the same article where the felt pads were examined, a credit card company also admitted to using evidence of marriage counseling as a determinant of rates and credit limits. I don’t think what’s written in the terms of service should give a company the right to sort me with such a category. We’re talking about health privacy norms, here. And, quite possibly, if this factor is widely known, it will discourage people who need marriage counseling from getting it.
Third, I want to know why felt pads indicate reliability. Big data mavens could come up with some just-so story. For example, maybe the pad users are more likely to get their full security deposit back, or will have a higher resale value for their house, with its lovely, pristine, unscratched floors. But who knows if that’s the case? Maybe they’re just more uptight. Maybe they’re richer on average, and that is the key variable driving all the rest. In that case, what began as a moralistic way of rewarding the virtuous floor-protectors among us, ends up just being one more way to reward the rich for already being rich.
There is no guarantee that big data will be used to help us better understand society. It could just as easily be a shortcut for reproducing old prejudices that uses the veneer of science to unscientifically reinforce them. That was a worry in the White House Report on big data, and it’s a leitmotif of my book.
I see lots of cases like this where people want to tell moralistic stories about big data. But I doubt that there’s enough social science research to support the claims being made. The Wall Street Journal recently published a piece about determining who is obese by seeing who has a minivan, no children, and a premium cable subscription. Some said, in response, “Oh, that shows once again the damaging effects of television viewing.” But again, who knows what’s really causal. Maybe the really key variable is lack of opportunity, which could be driving over consumption of food and TV. It’s just too stereotypically convenient to assume that the TV viewing drives the obesity, or that the obesity drives the TV viewing. We need much richer, narrative interpretations of society to capture the complexity of the problem here.