Improving your on-call schedule with runbooks

Incidents are a stressful time for your team: your service isn't working the way you expect and your customers/stakeholders want to know what's going on. The last thing you want to do is let your team improvise everything when it comes to responding to incidents.

Google's own SRE book has great overall tips for incident management, part of which involves "develop(ing) and document(ing) your incident management procedures in advance", which this article dives into.

Table of Contents

What is a runbook
Your first runbook
Improving your existing runbooks

What is a runbook

Depending on where you work, you'll hear different words describing relatively similar concepts around runbooks:

Standard Operating Procedure (or SOP): a standard set of steps required by the business to perform a process or task
Runbook: a set of steps required to respond to an incident or alert
Playbook: what we call the overall set of steps when responding to an incident. Can involve several runbooks

You shouldn't worry too much about terminology, just know we're going to be writing instructions on how to perform the manual process of resolving an incident with a system your team runs.

Your first runbook

When to write your first runbook

Ideally, you would write a runbook as you're deploying a service for the first time. You would go through the manual process you take to resolve any issues that come up, and write down each step, in detail, so anyone on your team can follow it while half-asleep.

Realistically though, chances are you'll be writing your first runbook after something goes wrong. For a lot of organisations, it takes an incident to really examine the way you work and learn where the potential improvements are.

Most commonly, something will go wrong, it'll take your on-call engineer a few minutes to figure out what happened, page the team members that know what to do, and it'll take the team a few hours to resolve the issue.

During the post-incident review, someone will ask "why did it take so long to fix?", and one of the inevitable answers will be "we didn't have a runbook in place". One of your team members will go off on a search for how to create a runbook, and they'll probably end up reading this article.

How to write a runbook

Honestly, anything is better than nothing. As time goes on, and your team gains operational experience, your runbooks will improve as gaps are found.

To start, you want to be documenting the manual steps necessary to get your service working again. You want to tell folks exactly what to do:

Login to the Splunk dashboard at URL
- Run this query: "QUERY GOES HERE"
- In case of weird results, run this command: "COMMAND HERE"
  - Weird results look like this: "SCREENSHOT GOES HERE"

In other words, favour explicit steps over implicit steps.

If any of your runbook steps say "Do the usual troubleshooting procedure", you need to rewrite that step, following the advice above.

If any step of your runbook contains "If you need to do X, follow the procedure here...", you need to explain the decision making process that leads the reader towards figuring out if they need to do X.

Do not allow perfectionism to block you from releasing your runbooks

Depending on your organisation, you may have stakeholders that'll want every aspect of the system documented in the runbook before it's "ready".

Without feedback on using the runbook (such as during real or simulated incidents), you'll quickly start getting diminishing returns on the time you spend trying to perfect your runbooks.

You're better off spending that time resolving issues that cause common alerts to fire for your on-call engineers.

Use new team members as a starting point

New team members are a gift for getting started, and improving your runbooks. They aren't yet affected by the curse of knowledge, and tend to be curious about how things work.

Have them sit in on incidents, join incident response chat rooms, and have them write down any questions they have as the incident is worked on. The questions they come up with are perfect starting points for a runbook.

Do not worry about automation (yet)

Automating your runbooks comes later (chances are, it won't cover every single step of your runbook anyway).

To start with though, you're going to want to document the entire manual procedure before you begin to automate the easiest steps.

In short: write today, automate tomorrow.

If you're early in your incident response journey, your team isn't ready for automation. Your team likely has too much tribal knowledge, and needs to work on documenting their manual processes first.

Improving your existing runbooks

Now imagine it's 2am.

You're getting paged for the latest service your team built and deployed. You have no idea how to debug it, and it's the first time you've been paged for the service.

"No worries, I'll just go through the list of steps in the runbook, I'm sure it'll be fine... wait, runbook just says 'Ask Dave'?!"

You start paging the developers who built the service (especially Dave), and over a few anxious hours, you and the team manage to resolve the issue, and go back to bed.

In the rest of this article, we're going to improve our runbooks today, so no one in your team needs to experience the above scenario.

Optimize for editing

The first runbook your team writes is probably going to suck, but that's okay!

Getting started is the first step towards being good at something.

You're going to want to put your runbooks somewhere easy to edit. While a git repo is fine, something like Notion, Confluence or Google Docs is going to be easier to update.

Note that a self-hosted wiki/repo probably isn't the best idea - particularly if your team needs VPN/office network access to read the runbooks, and your VPN goes down.

Encourage newcomers to use your runbooks and make edits

As mentioned earlier, authors of your runbooks are affected by the curse of knowledge, and you will assume people will have the background knowledge to understand what to do.

Newcomers to your team aren't yet affected by this, and can help you uncover assumed knowledge in your runbooks. Have them pair with your on-call team members as they resolve incidents - they'll get more comfortable with on-call before their first roster, and will notice undocumented gaps that your team just knows.

This will also help the team feel comfortable with editing your runbooks - they're not perfect, and will need changing over time.

Keep track of how often you need to use your runbooks

As I mentioned earlier, while the initial goal is to document the manual process required to resolve an incident, a later goal is to automate those steps.

The more often you use a runbook, the more evidence you gather that perhaps parts of the runbook should be automated.

A simple table at the bottom of the runbook would help with this:

| Date last used | Used by | | -------------- | ------- | | 2022/01/01 | rozenmd | | ... | ... |

Then, at a regular frequency (say, once a month or so), review your runbooks for automation opportunities.

Explicitly tell the reader what to do

People shouldn't need to interpret what the steps in your runbook could mean. Write as though the person reading it just woke up at 2am, and just wants to go back to bed.

The steps should explicitly tell them what to do. If you're starting with existing runbooks, it's worth auditing them to ensure the content is explicit, rather than implicit.

Make your runbook easier to search

It helps to add key phrases from your alerts, or even the exact error message thrown by your system.

For example:

# How to fix: "Error: Query defined in resolvers, but not in schema"

1. If you see this error in environment X, you need to run this command:
   ...

It'll make it easier to find a solution to the exact problem the system is having (by including the searched phrase in your heading, you'll also game certain documentation search engines to rank the terms higher).