On moving over a million uptime checks per week onto fly.io
Max Rozen / June 29, 2022
I went from "I'll just play around with it, maybe give it a toy workload" to "holy shit, what if I quickly rewrite my business's AWS Lambda + SQS stack to fit entirely within their free tier" in about 90 minutes.
It wasn't that simple in the end, but I did manage to migrate most of my active workload from AWS Lambda to fly.io.
Table of contents
- The project
- What that looks like on AWS Lambda
- The naive approach to migrating off AWS Lambda
- Replacing SQS with Redis
- Messing around with RAM and finding out
- The cost-benefit calculation
- What I kept on AWS Lambda
I built OnlineOrNot as the 200th uptime monitor for the Internet. The idea was: I hate all the other solutions, I'm a frontend developer that can handle infra and backend, let's solve the problem the way I would solve it, while listening to any customers I might get along the way.
On the backend, an uptime monitor is simple(ish):
- A process to query the database, find the checks that are ready to run, and queue them up as a job
- A queue to hold jobs while waiting to be run
- A process to pick up jobs from the queue, and run the check (and do something with the result)
To save money on the off-chance no one other than me ever used OnlineOrNot, I built it on AWS Lambda. Costs nothing if no one is using it (apart from the database, which I kept within the free-tier while building anyway).
On AWS Lambda, the system I built looks like this:
- One AWS Lambda function that runs every minute to query the database
- A Simple Queue Service (SQS) queue to hold jobs
- One AWS Lambda function to pick up jobs, invoke (yet another) AWS Lambda function that performs the check, get the result, and do something with it
The vast majority of my AWS bill was coming from that last function - every time there's an uptime check that hits a timeout, I get billed for the full 10 seconds it spends waiting, twice (once for the check, once for the function waiting for the check), even though the functions aren't doing anything in that time.
My initial idea was to take that last AWS Lambda function (the one that has to wait for every timeout) code, use the SQS Consumer library to pick jobs off the SQS queue, build a Dockerfile to run that code, and boom, easily moved my biggest Lambda cost into a free VM.
It turned out that over a million uptime checks per week is a lot, and the SQS Consumer library couldn't keep up. It also didn't process the jobs in order, which made some jobs never even run their uptime check!
With the main solution to processing SQS queues not being fit for my use, I spent a day thinking about how to re-architect my entire solution to run in VMs.
After a bit of research, I found BullMQ. BullMQ replaces my SQS queue with a Redis instance, and handles the job creating and running for me.
After re-architecting, my fly.io-based solution looks like this:
- A permanently running VM running Redis with persistent storage (in case it goes down)
- A permanently running VM that uses node-cron to run every minute, and queues up uptime checks to run
- A permanently running VM that picks up jobs from Redis, runs the checks
Initially I gave each VM 256MB of RAM, flipped a feature flag to stop processing checks on AWS Lambda, and let fly.io start picking up checks to queue instead.
It turns out if you overload a VM with too many jobs, some jobs will "stall" - be neither in the queue ready for work, nor actively worked on by the job runner.
Thankfully I built around this possibility already with my AWS Lambda stack (I requeue if a job hasn't run when we expected it to), and the code held up when permanently running in VMs.
From my customer's perspective, some of their uptime checks would just run every 10 minutes instead of every 5 minutes. While better than failing to run entirely, it wasn't a great outcome.
It took me about 30 minutes each morning over a week to fine tune the number of worker VMs, RAM allocation, and concurrency before working out I just needed a single VM with 512MB RAM.
I've known this day was coming for about a year, ever since my first customer with over a thousand websites came in, smashed my AWS bill, and made me rethink my unlimited pricing.
Don't make paid resources unlimited, kids. Not even once.
What happened was, my AWS Lambda bill jumped to over $100 USD/mo, and I started thinking "well, surely a continuously running VM would be cheaper", and it is! I roughly worked out how much it'd cost me to rewrite a serverless app to continuously run in a VM (about a week, 2 hours per day before work), and sat on the idea for a year, since building features for my customers was a better use of my time.
What I got wrong in my initial calculation was not factoring in the freedom of not having to worry about a surge of usage (whether from free trial accounts, or broken code). With a continuous running VM, your pricing is capped by:
amount of RAM provisioned * number of instances * amount of CPU provisioned
whereas with AWS Lambda, your pricing is uncapped:
amount of RAM provisioned * number of milliseconds your function takes * number of invocations
With AWS Lambda, if you get a sudden rush of curious trial users from a viral blog post, or you've deployed a function that infinitely calls itself, suddenly you're on the hook for a huge bill (entirely bound by how lucky/unlucky you are).
When the same thing happens to a continuously running VM processing jobs in a queue, the wait times increase. If you screwed up the programming, maybe your VM runs out of memory, and reboots. The UX degrades, your monitoring picks it up and lets you know, and you scale up the VM to handle the load (for a couple of bucks a month more).
I'm keeping several full replicas of my stack still sitting on AWS Lambda (in a few AWS regions), ready to turn back on if a certain number of jobs don't run within a given timeframe.
I also still run Google Chrome/Puppeteer for Browser Checks on AWS Lambda, but now I'm invoking them from my fly.io VM.