How a typo made the Amazon cloud go dark for scores of internet users
The outage was an abrupt reminder that the internet is not as invincible as its near seamless fusion with our lives suggests.
—How big is Amazon’s cloud? Big. So big, in fact, that its cloud storage arm, Amazon Web Services, is larger than the equivalent service offered by the next three players – Microsoft, Google, and IBM – combined.
That is why it was such a big deal when an Amazon team member, who accidentally entered a couple of wrong bits of code during some routine maintenance on Tuesday, was able to knock out large portions of the internet for around four hours.
AWS hosts a number of high-profile, heavily trafficked websites and services including AirBnb, Netflix, reddit, and Quora, many of whose pages were not loading during the outage. And although the internet giant moved quickly to fix the problem, the mishap was one of the periodic reminders we get that the internet is not as invincible as its near seamless fusion with our lives suggests.
In a public apology issued by Amazon, the company explained that the fat-finger incident occurred while an employee from Amazon Simple Storage (S3) was working to speed up the S3 billing process. “Using an established playbook executed a command,” as Amazon put it, the worker’s intention was to temporarily offline a small number of servers in the S3 subsystems, but the error took down a lot more.
“In this instance, the tool used allowed too much capacity to be removed too quickly,” Amazon said. “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.”
Or, as The Washington Post’s Brian Fung put it: “Translation: Employees will no longer be able to unplug whole parts of the Internet by mistake.”
In Amazon’s case, its rise to the top of the so-called Infrastructure as a Service (IaaS) tree, began in 2006, when it, in all its frugality, started buying up or leasing existing data centers dotted across northern Virginia, “a central region for internet backbone,” according to The Atlantic.
However, the fact that Amazon didn’t build new servers from scratch also means they’re old, potentially making them more susceptible to crashing.
The timing of the crash couldn’t have been worse. It came on the same day that Amazon was holding one of its AWSome Days, where it promotes the advantages of AWS and educates people how to use it. BGR.com’s Mike Whener wrote about the unfortunate timing from Edinburgh, Scotland:
Amazon loves to talk about how great its products and services are – just like any other massive company — so the fact that it holds frequent conferences celebrating and educating people about Amazon Web Services (AWS) isn’t particularly odd. But for one of those events to land on the exact same day that AWS’s storage services bites the dust and takes a huge chunk of the internet down with it? Now that’s some serious bad luck.