Improving your team's on-call experience
Your engineers probably dislike going on-call for your services.
Some might even dread it.
It doesn't have to be this way.
With a few changes to how your team runs on-call, and deals with recurring alerts, you might find your team starting to enjoy it (as unimaginable as that sounds).
I wrote this article as a follow-up to Getting over on-call anxiety. While that article handles what individual engineers can do to psychologically prepare themselves for on-call, this article describes what the team can do to improve the on-call experience overall.
Table of Contents
- Pay your staff
- Good processes make a good on-call experience
- Continuous incremental improvement is key
- Have a primary on-call, and a secondary on-call
- Handover between your on-call engineers
This one's more for management than engineers, but going on-call overnight is additional work, whether or not an engineer gets paged.
Pay a bonus on top of their regular salary for the week your engineers go on-call.
Particularly if you expect them to do their job well.
I'm not going to sugar-coat it: if your team relies on engineers to regularly improvise while on-call, they're going to have a bad time.
Runbooks, up-to-date documentation, and standard operating procedures give your on-call engineers direction on how to respond to, and escalate, incoming alerts. You don't want them to be figuring this stuff out by themselves at 2am.
If you've already got these in place, have your newest team members review them. Chances are, there are blindspots that you and your team can't see due to familiarity with the services and their docs.
If you're the new member of the team, take the initiative to improve the documentation yourself. Try going through runbooks and see if you have enough context to complete the steps. A common mistake senior engineers will make is assuming their team mates know where to run commands, or where certain repos are (particularly when engineers give a service one name, while the codebase lives in a repo with a completely different name).
Your documentation should be as explicit as possible.
Take each alert that fires as an opportunity to improve your service. Ideally, when things break, take the time to ensure an alert never fires again.
Getting paged constantly while on-call is a symptom of a broken system. Either the system is too unstable, and needs time invested to make it resilient, or your alerts are too verbose and their thresholds need tweaking, or they require no action, and need to be deleted.
Over time, applying long-term fixes to the causes of your alerts will make your service more resilient, and make your team's time on-call a lot less stressful.
Google make this possible for their own SRE's by blocking them from spending more than 50% of their time on operational work:
We cap the amount of time SREs spend on purely operational work at 50%; at minimum, 50% of an SRE’s time should be allocated to engineering projects that further scale the impact of the team through automation, in addition to improving the service. — Google's SRE Book
In order to help drive the continuous improvement aspect of on-call, it helps to have an engineer on secondary on-call.
While the primary on-call engineer answers pages at any time of the day for a week, the secondary on-call engineer's main job is to make the primary on-call engineer's life easier. They can assist primary with investigations when a particularly bad incident occurs, but their main focus should be improving service reliability (by improving the codebase) and generally ensuring primary doesn't get paged each week for the same reason.
While in the past we've had secondary on-call for only one week at a time, we found that due to ramp up time, a month of secondary on-call worked better.
As secondary on-call doesn't answer pages overnight, you can typically assign newer engineers to the role so they get familiar with your team's documentation and runbooks, and can help improve them further.
You should hold a meeting to handover on-call responsibilities between engineers as one shift ends and another begins. We typically hold these in the middle of the week, as most major issues for the week typically surface by then.
In the meeting, the engineer whose on-call shift is ending talks through the alerts that were fired, whether they were resolved, or if they still require attention.
Let anyone on your team attend these meetings (particularly new employees), as it helps them get familiar with the service that they'll eventually be on-call for.
I wrote this article from the perspective of an employee of a large tech company (thousands of engineers), which implemented Google's SRE book while being aware of our own needs.
In our case, we have our regular software engineers that build the service also handle on-call for it.
Be aware that blindly implementing the recommendations provided here will not automatically make your company successful. Large tech companies adopt these practices as a means of managing their success - not because these practices alone make them successful.
That being said, if your on-call staff are constantly fighting fires, or the on-call experience isn't generally improving week to week, hopefully some of these recommendations will help.