Guidelines for writing better runbooks
You're getting paged for the latest service your team built and deployed. You have no idea how to debug it, and it's the first time you've been paged for the service.
"No worries, I'll just go through the list of steps in the runbook, I'm sure it'll be fine... wait, runbook just says 'Ask Dave'?!"
You start paging the developers who built the service (especially Dave), and over a few anxious hours, you and the team manage to resolve the issue, and go back to bed.
In this article, we're going to improve our runbooks today, so no one in your team needs to experience the above scenario.
Table of Contents
These aren't in any particular order, and come from a few years experience working at startups, scale-ups, and enterprise.
The first runbook your team writes is probably going to suck, but that's okay!
Getting started is the first step towards being good at something.
You're going to want to put your runbooks somewhere easy to edit. While a git repo is fine, something like Notion, Confluence or Google Docs is going to be easier to update.
Note that an internal wiki/repo probably isn't the best idea - particularly if your team needs VPN/office network access to read the runbooks, and your VPN goes down.
As the authors of your runbooks, you're affected by what's known as the curse of knowledge - essentially, you assume people will have the background knowledge to understand what to do.
Newcomers to your team aren't yet affected by this, and can help you uncover assumed knowledge in your runbooks. Have them pair with your on-call team members as they resolve incidents - they'll get more comfortable with on-call before their first roster, and will notice undocumented gaps that your team just knows.
As I mentioned earlier, while the initial goal is to document the manual process required to resolve an incident, a later goal is to automate those steps.
The more often you use a runbook, the more evidence you gather that perhaps parts of the runbook should be automated.
A simple table at the bottom of the runbook would help with this:
|Date last used||Used by|
People shouldn't need to interpret what the steps in your runbook could mean. Write as though the person reading it just woke up at 2am, and just wants to go back to bed.
The steps should explicitly tell them what to do.
For example, instead of writing:
- Investigate issues in Splunk
a better runbook step would be:
- Login to the Splunk dashboard at URL
- Run this query: "QUERY GOES HERE"
- In case of weird results, run this command: "COMMAND HERE"
- Weird results look like this: "SCREENSHOT GOES HERE"
It helps to add key phrases from your alerts, or even the exact error message thrown by your system.
# How to fix: "Error: Query defined in resolvers, but not in schema"1. If you see this error in environment X, you need to run this command:...
It'll make it easier to find a solution to the exact problem the system is having (by including the searched phrase in your heading, you'll also game certain documentation search engines to rank the terms higher).