What we learned from AWS's us-east-1 outage

Max Rozen

Max Rozen / December 08, 2021

In case you missed it, for several hours on December 7, 2021, AWS's us-east-1 region had an outage impacting multiple AWS APIs, taking out various websites across the internet.

According to our own monitoring at OnlineOrNot, the outage started at 2021-12-07 15:32 UTC and began to recover well at 2021-12-07 22:48 UTC (with minor signs of life for a few minutes around 2021-12-07 20:08 UTC).

Had we relied solely on AWS to update their status page before reacting, we would have been waiting a while. In fact, AWS took approximately an hour to update their status page to indicate anything was happening at all.

After things got serious, they added a banner to the top of their status page, explaining what was happening:

AWS Status Page

The retrospective

Some context

OnlineOrNot is an uptime checker service provides uptime and correctness checks for websites, web apps, and APIs. While our infrastructure is hosted across 20 AWS regions, a single point of failure was hosted in us-east-1.

What went wrong

OnlineOrNot's uptime check service stopped running regularly when us-east-1 started having issues.

Every 60 seconds, our service is kicked off via an AWS Cloudwatch event rule hosted in us-east-1.

When that event fires, a single AWS Lambda in us-east-1 queries our database to determine which URLs need checking, and sends off messages via SQS queues to trigger our checkers in 20 AWS regions.

What went well

Mostly by luck, the database was unaffected and our web app remained online. This let us quickly deploy a banner letting our users know something was wrong:

OnlineOrNot displaying its warning banner

What we learned

Since the outage we've implemented redundancy and the ability to failover to other AWS regions in our uptime check service.

On top of this, we'll be looking at duplicating more of our stack in other cloud providers to ensure this doesn't happen to OnlineOrNot again on AWS, nor any other cloud service.

Do you and your colleagues dread going on-call?

I send one email every month with an article like this one, to help you improve the time you spend on-call, and keeping your service online in general.

Lots of SREs and folks in operations like them, and I'd love to hear what you think as well. You can always unsubscribe.