What we learned from AWS's us-east-1 outage
Max Rozen / December 08, 2021
In case you missed it, for several hours on December 7, 2021, AWS's us-east-1 region had an outage impacting multiple AWS APIs, taking out various websites across the internet.
According to our own monitoring at OnlineOrNot, the outage started at 2021-12-07 15:32 UTC and began to recover well at 2021-12-07 22:48 UTC (with minor signs of life for a few minutes around 2021-12-07 20:08 UTC).
Had we relied solely on AWS to update their status page before reacting, we would have been waiting a while. In fact, AWS took approximately an hour to update their status page to indicate anything was happening at all.
After things got serious, they added a banner to the top of their status page, explaining what was happening:
The retrospective
Some context
OnlineOrNot is an uptime checker service provides uptime and correctness checks for websites, web apps, and APIs. While our infrastructure is hosted across 20 AWS regions, a single point of failure was hosted in us-east-1.
What went wrong
OnlineOrNot's uptime check service stopped running regularly when us-east-1 started having issues.
Every 60 seconds, our service is kicked off via an AWS Cloudwatch event rule hosted in us-east-1.
When that event fires, a single AWS Lambda in us-east-1 queries our database to determine which URLs need checking, and sends off messages via SQS queues to trigger our checkers in 20 AWS regions.
What went well
Mostly by luck, the database was unaffected and our web app remained online. This let us quickly deploy a banner letting our users know something was wrong:
What we learned
Since the outage we've implemented redundancy and the ability to failover to other AWS regions in our uptime check service.
On top of this, we'll be looking at duplicating more of our stack in other cloud providers to ensure this doesn't happen to OnlineOrNot again on AWS, nor any other cloud service.