How to handle monitoring as a solo founder
When you're a single person running a system with thousands of users or more, it can be pretty daunting to think about going on holiday, or even relaxing for a weekend.
"What if it goes down, and I'm not there to fix it?!" you ask yourself.
While you can never really guarantee that nothing will go wrong, you can take some steps to minimise your risk of things going wrong.
Table of contents
- Before you start
- While running the business
This mainly comes to mind for databases, but can also apply to the frameworks you choose to use: pick boring technology, that has been around for ages, with a community behind it. That way when things go wrong, you won't be the only person in the world trying to solve your problem.
An early startup I worked on used a NoSQL database for clearly very relational data. Even if you ignore the fact that we weren't using the right tool for the job - we would constantly run into issues that you couldn't find on StackOverflow/Google.
Since that project, I've pushed for using Postgres (released in 1996) as much as possible, and in the last five years the only outage I've seen was due to under-provisioning (we tried to use a lower RAM server to save on running costs).
In a similar vein to picking boring technology, using technology you already know leaves you with just needing to run the business.
For example, if every project you've worked on in the last year was built in Rails + React, maybe just use Rails + React to build your project. Building features is time-intensive as it is, without having to worry about whether you're doing it the "<INSERT_NEW_LANGUAGE_HERE> way".
By simplicity, I mean perhaps running your service on a sharded multi-region database isn't the best idea for your team of two devs - a single large database instance with redundancy would work just as well (up to a certain point).
While it may technically be "better" or "correct", when you have fewer resources to investigate problems, sometimes it's easier to just bump your database up a few resource tiers (scale vertically), rather than to scale onto multiple servers (scaling horizontally). Of course, if it looks like your service needs to handle hundreds of thousands of users, perhaps then you should consider horizontal scaling.
If you find yourself constantly getting alerted and having to fight fires to keep your service running, it might be time to simplify your architecture.
Sure, you can run your database on any VPS hosting provider in the world for cheap, but then it's on you to handle:
- Keeping the server updated
- Monitoring the server
- Ensuring the backup script runs
- Keeping backups, deleting old ones
- Fighting fires when things go wrong
Alternatively, you can outsource the actual running of your database to Heroku or AWS. When things go wrong, you then have access to their immense support resources to resolve the issue.
I take a similar approach with payments (Stripe), email (Mailgun/Postmark/Convertkit), and tracking errors (Sentry/Bugsnag/Rollbar).
Generally speaking, deploying before heading to lunch, dinner, a holiday, or a weekend trip isn't a great idea. You need to ensure you've got time to rollback, or hotfix the change you made.
Some people prefer only releasing at quiet times of the day, when the system doesn't have many users. The upside of this is that subsequent outages would impact the least number of users, but the downside is that it probably won't be the best time for you.
Depending on the number of users you have, it might be worth looking into feature flags. Feature flags let you separate the deployment stage, from the roll-out stage when it comes to delivering features.
In other words, you "deploy" your change with the feature flag turned off, then once you've checked that it works on a small subset of users, you can roll out the change to your whole userbase.
Sometimes the fastest way to resolve an outage is to rollback the service to the last known "good version". A key part of this is to ensure your database migrations work in both directions - both when deploying the new version, and when tearing down the new version to release a previous version.
Communication is key. Good error pages let the user know that it's not their fault. The last thing you want is for your user to feel stupid after trying to submit a form and your service not responding to the request.
On top of that, a bit of communication goes a long way towards reducing the number of users messaging you when things go wrong.
I previously wrote about where to send your monitoring alerts, but the gist of it is that you should tailor your alerting to the business impact of the outage.
If you've got a well-tested application where you know nothing should go wrong, you should monitor your uptime every minute, and send an alert immediately via phone call/SMS when the app becomes unresponsive.
On the other hand, if your business's legacy app becomes unreachable at the same time each day for 2 minutes due to a database backup job running, you can setup your alerts to send after 4 minutes of downtime (shameless plug: this is a feature OnlineOrNot offers).
For more general alerts, like disk space reaching 75% on your server, or a database backup job failing, send the alerts to Slack/Discord/Microsoft Teams as an FYI.
Keep a list of steps to run through when your service has issues. It can act as a checklist so you don't miss anything when you inevitably try to fix things after getting an alert at 3am.
As an added bonus, when you start building your dev team, you'll already have a procedure in place for what to do when the service goes down - they won't have to relearn the same mistakes again.
Mike Julian's Practical Monitoring recommends the following contents for a runbook:
- What is this service, what does it do?
- Who is responsible for it?
- What dependencies does it have?
- What does the infrastructure for it look like?
- What metrics and logs does it emit, and what do they mean?
- What alerts are set up for it, and why?
As well as that, make sure your alerts contain links to your runbook, so you're not franctically searching for it.
If parts of your runbook involve running a series of commands in the Terminal, chances are you've got yourself a script you can automatically run without waking up a human.
It's worth keeping in mind - if the system can automatically recover, it's not worth waking up a human to check what happened.
Over time, you'll get a chance to observe your system in action. You'll find some parts of the codebase will cause more outages than others, which will nudge you to write better tests for it, or refactor to a better implementation, or better exception handling or validation.
Schedule yourself some time for these fixes, or at least add tasks to the backlog that let you track your effort to fixing these hot zones.
You need to know what caused an issue to ensure you don't repeat the mistake again.
After you resolve incidents, be sure to give yourself some time (it doesn't have to be immediately after the incident) to review root causes, and come up with actions to ensure the incident doesn't happen again.
As your team grows, you want to foster a blameless culture around post-mortems. If your team fears being in trouble for mistakes, they'll either try to hide them or downplay their impact. Google's SRE book has some handy tips on building a postmortem culture.