Writing your first runbooks
Incidents are a stressful time for your team: your service isn't working the way you expect and your customers want to know what's going on. The last thing you want to do is let your team improvise everything when it comes to responding to incidents.
Google's own SRE book has great overall tips for incident management, part of which involves "develop(ing) and document(ing) your incident management procedures in advance", which this article dives into.
Table of Contents
- What I mean by "runbook"
- When to write your first runbook
- What to write
- Do not allow perfectionism to block you from releasing your runbooks
- Use new team members as a starting point
- Do not worry about automation (yet)
Depending on where you work, you'll hear different words describing relatively similar concepts around runbooks:
- Standard Operating Procedure (or SOP): a standard set of steps required by the business to perform a process or task
- Runbook: a set of steps required to respond to an incident or alert
- Playbook: what we call the overall set of steps when responding to an incident. Can involve several runbooks
You shouldn't worry too much about terminology, just know we're going to be writing instructions on how to perform the manual process of resolving an incident with a system your team runs.
Ideally, you would write a runbook as you're deploying a service for the first time. You would go through the manual process you take to resolve any issues that come up, and write down each step, in detail, so anyone on your team can follow it while half-asleep.
Realistically though, chances are you'll be writing your first runbook after something goes wrong. For a lot of organisations, it takes an incident to really examine the way you work and learn where the potential improvements are.
Most commonly, something will go wrong, it'll take your on-call engineer a few minutes to figure out what happened, page the team members that know what to do, and it'll take the team a few hours to resolve the issue.
During the post-incident review, someone will ask "why did it take so long to fix?", and one of the inevitable answers will be "we didn't have a runbook in place".
Honestly, anything is better than nothing. As time goes on, and your team gains operational experience, your runbooks will improve.
To start, you want to be documenting the manual steps necessary to get your service working again. You want to tell folks exactly what to do:
- Login to the Splunk dashboard at URL
- Run this query: "QUERY GOES HERE"
- In case of weird results, run this command: "COMMAND HERE"
- Weird results look like this: "SCREENSHOT GOES HERE"
In other words, favour explicit steps over implicit steps.
If any of your runbook steps say "Do the usual troubleshooting procedure", you need to rewrite that step, following the advice above.
If any step of your runbook says something like "If you need to do X, follow the procedure here...", you need to explain the decision making process that leads the reader towards figuring out if they need to do X.
Depending on your organisation, you may have stakeholders that'll want every aspect of the system documented in the runbook before it's "ready".
Without feedback on using the runbook (such as during real incidents), you'll quickly start getting diminishing returns on the time you spend trying to perfect your runbooks.
You're better off spending that time resolving issues that cause common alerts to fire for your on-call engineers.
New team members are a gift for getting started, and improving your runbooks. They aren't yet affected by the curse of knowledge, and tend to be curious about how things work.
Have them sit in on incidents, join incident response chat rooms, and have them write down any questions they have as the incident is worked on. The questions they come up with are perfect starting points for a runbook.
Automating your runbooks comes later (chances are, it won't cover every single step of your runbook anyway).
To start with though, you're going to want to document the entire manual procedure before you begin to automate the easiest steps.
In short: write today, automate tomorrow.
If you're early in your incident response journey, your team isn't ready for automation. Your team likely has too much tribal knowledge, and needs to work on documenting their manual processes first.