Yesterday, the internet broke…

Yesterday, the internet broke due to an outage with Amazon’s widely used S3 service. S3 is Amazon’s storage service and is widely used by companies like Netflix, Pinterest, Expedia, Nest and even the U.S. Securities and Exchange Commission to store items like images, files, and videos. Amazon S3 is pretty much a giant hard drive for a lot of companies. So, the service is a pretty big deal and is estimated to carry around 40% of the content available online, and is not just limited to the internet browsing, it contains content for apps, security cameras and a host of other services.
What exactly happened?

Yesterday, around 12:30 pm we started receiving reports from our customers that websites and applications weren’t working as expected, after launching our investigation we noticed that items stored on S3 were not loading. After checking in with Amazon, noticed an alert on their dashboard, indicating that they were experiencing “high failure” rates with the S3 service. Adding insult to injury, shortly, thereafter we noticed that other Amazon services were also affected too! S3 is one of the keystones of Amazon’s cloud services and many of their other services rely on S3 for storage as well, when S3 is down, the whole platform degrades. Luckily, by 5:00 pm yesterday all services were restored. Amazon has indicated that they understand the root of the problem, but has not released any additional details as of yet. But given the depth of the issue and rarity of S3 failure, we can speculate that this must have been a pretty serious malfunction.

I thought the internet was redundant, what gives, how do I prevent this from affecting me?

For starters, the internet is very redundant, and we don’t experience outages like this very often, the last time something like this happened was Christmas several years ago. Given the huge scale and a massive number of sites, that AWS supports this is pretty impressive. As a result of this stability many companies — even huge ones, forego the costs of setting up services across multiple data centers or cloud providers (providing redundancy). The service outage yesterday was limited to a single region (US EAST) at AWS and can be prevented by setting up services with High Availability in mind.

Do I need a High Availability (HA) setup?

To be clear, for our purposes, a high availability setup is one in which multiple failures can occur across data centers or components and your service still remains stable and available.

For most of our customers, this decision comes down costs and service necessity:

(1) Are the costs of aggregating data and having it constantly available ‘worth it’ in comparison to the cost of what should be a short-lived outage?

(2) Is this service mission critical?

To come to a conclusion we help our customers, look at the costs of an outage — both real costs and brand costs and weigh those costs against the implementation and ongoing costs of running a high availability solution. If it turns out that the software is mission critical, or the costs of a few hour outage are unacceptable we recommend going with the high availability setup.

My application and data is in the cloud, should I worry? Probably not, cloud providers like AWS, Azure, Google Cloud, etc. constantly monitor their services for hiccups, and have a stability record that is simply amazing. As shown yesterday Amazon’s ability to identify, isolate, and rectify the issue is of extremely high priority. AWS is one of Amazon’s most profitable products and much of that is because of its record for high-quality service and outstanding reliability.