Scaling AWS Lambda and Postgres to thousands of uptime checks
Max Rozen / November 18, 2021
When you're building a serverless web app, it can be pretty easy to forget about the database. You build a backend, send some data to a frontend, write some tests, and it'll scale to infinity with no effort, right?
Especially not with Postgres. As the number of users of your frontend increases, your app will open more and more database connections until the database is unable to accept any more.
It gets worse.
If you're building with AWS Lambda functions and also calling them as asynchronous worker functions that increase in usage as your web app becomes more popular, you're going to run into scaling issues fast.
With a bit of help from your architecture, you can remove the need for each worker function to talk to your database, and greatly improve your scalability. Curious? Read on.
Table of contents
OnlineOrNot (where this article is hosted) started its life as an extremely simple uptime monitoring service.
As a user you could sign in, add a URL you wanted to monitor, get alerts (only via email), and pay for a subscription.
Architecture-wise, it was pretty simple too.
It consisted of:
- a single AWS RDS database instance running Postgres
- a Lambda function that fired every minute to check the database for URLs to check
- an SQS queue
- a Lambda function that got triggered via SQS, which would check the URLs, and write the results to the database
If you have a bit of experience with building on serverless, you can probably immediately see where my mistake was.
As the web app grew in popularity, each URL that got added for monitoring would need its own database connection when writing results.
That might have been fine if I was developing a web app for internal business at a small company (build for the problems you know you have now, rather than problems you might have), but I'm running OnlineOrNot as a bootstrapped business that keeps growing.
Shortly after thinking to myself "this is fine", I found myself needing to check up to 100 URLs at a time, every minute. Each time the jobs would finish, all 100 Checker Lambda functions would try to write to the database at (more or less) the same time, and my frontend would be unable to open new database connections.
To buy myself time, I scaled up the database a few tiers (the more memory your RDS Postgres instance has, the more simultaneous connections it can receive), and went looking for ways to fix the issue.
While implementing a fix for the issue, I managed to land my largest customer to that point. They wanted to monitor over a thousand URLs at a time, every minute. I ran the numbers - if I kept upgrading database tiers every time I needed to scale, I would eventually run out of database tiers to upgrade to.
The fix in the end was to:
- Make a new AWS Lambda function to batch URLs. It's job is to be triggered with as many URLs as possible in a batch, and then directly call a Checker Lambda function for each one
- Update the Checker Lambda function to return its results, rather than making it write to the database
The architecture ended up looking like this:
As a result, the application now scales relative to how many batcher Lambda functions you have running, rather than the individual worker Lambda functions. This pattern is also known as fanout architecture.
With this fix in place, I was able to downgrade my database instance back to where I started, with only a minor percentage increase in CPU usage.