Monitoring your web application as a small team
When you're part of a small team running a system with thousands of users or more, it can be pretty daunting to think about going on holiday, or even relaxing for a weekend.
"What if it goes down, and I'm not there to fix it?!" you ask yourself.
While you can never really guarantee that nothing will go wrong, you can take some steps to minimise your risk of things going wrong.
Table of contents
Before you start
Choose boring technology
This mainly comes to mind for databases, but can also apply to the frameworks you choose to use: pick boring technology that has been around for ages, with a community behind it. That way when things go wrong, you won't be the only person in the world trying to solve your problem.
An early startup I worked on used a NoSQL database for clearly very relational data. Even if you ignore the fact that we weren't using the right tool for the job - we would constantly run into issues that you couldn't find on StackOverflow/Google.
In fact, the only result on Google for our problems was often an article written by us.
Since that project, I've pushed for using Postgres (released in 1996) as much as possible, and in the last five years the only outage I've seen was due to under-provisioning (we tried to use a lower RAM server to save on running costs, and it backfired spectacularly).
Choose technology you already know
In a similar vein to picking boring technology, using technology you already know leaves you with just the "simple" task of building your product.
For example, if every project your team has worked on in the last year was built in Rails + React, maybe just use Rails + React to build your project. Building your product is time-intensive as it is, without having to worry about whether you're doing it the $INSERT_LANGUAGE_HERE way.
Opt for simplicity over complexity, especially considering your team size
By simplicity, I mean perhaps running your service on a sharded multi-region database isn't the best idea for your team of two devs - a single large database instance with redundancy would work just as well (up to a certain point).
While it may technically be "better" or "correct", when you have fewer resources to investigate problems, sometimes it's easier to just bump your database up a few resource tiers (scale vertically), rather than to scale onto multiple servers (scaling horizontally). Of course, if it looks like your service needs to handle hundreds of thousands of users, perhaps then you should consider horizontal scaling.
If you find yourself constantly getting alerted and having to fight fires to keep your service running, it might be time to simplify your architecture.
While running the business
Use managed services where possible
Sure, you can run your database on any VPS hosting provider in the world for cheap, but then it's on you to handle:
- Keeping the server updated
- Monitoring the server
- Ensuring the backup script runs
- Keeping backups, deleting old ones
- Fighting fires when things go wrong
Alternatively, you can outsource the actual running of your database to AWS. When things go wrong, you then have access to their immense support resources to resolve the issue.
I take a similar approach with payments (Stripe), email (Postmark/ConvertKit), and tracking errors (Sentry/Bugsnag/Rollbar).
Deploy at a good time
Generally speaking, deploying before heading to lunch, dinner, a holiday, or a weekend trip isn't a great idea. You need to ensure you've got time to rollback, or hotfix the change you made.
Some people prefer only releasing at quiet times of the day, when the system doesn't have many users. The upside of this is that subsequent outages would impact the least number of users, but the downside is that it probably won't be the best time for you.
Depending on the number of users you have, it might be worth looking into feature flags. Feature flags let you separate your deployments, from releases, when it comes to delivering features.
In other words, you "deploy" your change with the feature flag turned off, then once you've checked that it works on a small subset of users, you can roll out the change to your whole userbase.
Write two-way database migrations
Sometimes the fastest way to resolve an outage is to rollback the service to the last known "good version". A key part of this is to ensure your database migrations work in both directions - both when deploying the new version, and when tearing down the new version to release a previous version.
The alternative here is to roll forward with new database migrations when there's an issue, but I prefer being able to revert my database to the last known "good version".
Write good error pages
Communication is key. Good error pages let the user know that it's not their fault. The last thing you want is for your user to feel stupid after trying to submit a form and your service not responding to the request.
On top of that, a bit of communication goes a long way towards reducing the number of users messaging you when things go wrong.
Avoid alert fatigue
I previously wrote about saving your team from alert fatigue, but the gist of it is that you should tailor your alerting to the business impact of the outage.
If you've got a well-tested application where you know nothing should go wrong, you should monitor your uptime every minute, and send an alert immediately via phone call/SMS when the app becomes unresponsive.
On the other hand, if your business's legacy app becomes unreachable at the same time each day due to a database backup job running, you can setup your alerts to send after several minutes of downtime.
For more general alerts, like disk space reaching 75% on your server, or a database backup job failing, send the alerts to a Slack/Discord/Microsoft Teams channel as an FYI.
Write yourself a runbook
Keep a list of steps to run through when your service has issues. It can act as a checklist so you don't miss anything when you inevitably try to fix things after getting an alert at 3am.
As an added bonus, when you start building your dev team, you'll already have a procedure in place for what to do when the service goes down - they won't have to relearn the same mistakes again.
Mike Julian's Practical Monitoring recommends the following contents for a runbook:
- What is this service, what does it do?
- Who is responsible for it?
- What dependencies does it have?
- What does the infrastructure for it look like?
- What metrics and logs does it emit, and what do they mean?
- What alerts are set up for it, and why?
As well as that, make sure your alerts contain links to your runbook, so you're not franctically searching for it.
Automate your recovery
If parts of your runbook involve running a series of commands in a terminal, chances are you've got yourself a script you can automatically run without waking up a human.
It's worth keeping in mind - if the system can automatically recover, it's not worth waking up a human to check what happened.
Keep track of what goes wrong
Over time, you'll get a chance to observe your system in action. You'll find some parts of the codebase will cause more outages than others, which will nudge you to write better tests for it, or refactor to a better implementation, or better exception handling or validation.
Schedule yourself some time for these fixes, or at least add tasks to the backlog that let you track your effort to fixing these hot zones.
Run post-mortem meetings
You need to know what caused an issue to ensure you don't repeat the mistake again.
After you resolve incidents, be sure to give yourself some time (it doesn't have to be immediately after the incident) to review root causes, and come up with actions to ensure the incident doesn't happen again.
As your team grows, you want to foster a blameless culture around post-mortems. If your team fears being in trouble for mistakes, they'll either try to hide them or downplay their impact. Google's SRE book has some handy tips on building a postmortem culture.