How I accidentally told 19k people Hacker News was down
Max Rozen / August 12, 2022
Hacker News is a forum run by Y Combinator that tends to focus on topics relating to computer science and entrepreneurship. During an average month, Hacker News serves almost 10 million visitors.
Hacker News had a series of extremely rare outages in early July 2022:
On July 8, 06:20am UTC, the hard drives in Hacker News's primary server stopped working, and they switched over to the failover server in just over an hour.
The failover server chugged along for almost six and a half hours before the hard drives also failed at 12:47pm UTC.
At this point, it looks like all four hard disks reached 40,000 hours of uptime, and a bug in their firmware caused them to fail.
OnlineOrNot is a new kind of Incident Management software that runs on Cloudflare Pages. OnlineOrNot provides uptime monitoring, with integrated status pages that automatically update.
The architecture for our Status Page service is relatively simple: it's a Remix frontend app (we use Remix as a React framework that lets us write React components, and output HTML/CSS with minimal JS). We do not currently use a client-side framework. As a result of this approach, our average page weight is around 13KB (instead of 200KB).
Our Remix app runs on Cloudflare Pages, which itself uses Cloudflare's global network, without direct access to my database.
To workaround this limitation, we built an Express API server running on fly.io. While I usually opt for a serverless solution for early versions of features, I wanted to avoid cold-starts. Another advantage of using fly.io is that we can deploy replicas of the API server globally. However thanks to Cloudflare Pages' ability to access Cloudflare's cache, this won't be necessary as most requests under high load won't hit my origin server.
While testing the status page feature described above, I decided to monitor Hacker News. To my surprise, the first result came back saying Hacker News was down, and my system automatically created an incident on the status page.
When Hacker News came back up (running on the failover server), naturally, I created a thread on Hacker News, thinking no one would notice or care that Hacker News went down.
That thread quickly reached the front page, was in the number one spot for about half an hour, and thousands of users were sent to my Hacker News Status Page:
Eventually Hacker News went down again as the disks in their failover server also failed, and my automated monitoring kicked in again, creating a second incident page.
This second outage lasted roughly seven hours, with no updates from the Hacker News team for several hours (understandably so, the failover servers failed in the middle of the night). During this time, Google indexed the status page I built, and temporarily gave me the top search result on Google for "hacker news status".
Now that I know the feature works (and at scale - I'm grateful for the load test!), I'll be releasing an embarrassingly under-featured MVP this week.