Improving your team's on-call experience
Your engineers probably dislike going on-call for your services.
Some might even dread it.
It doesn't have to be this way.
With a few changes to how your team runs on-call, and deals with recurring alerts, you might find your team starting to enjoy it (as unimaginable as that sounds).
I wrote this article as a follow-up to Getting over on-call anxiety. While that article handles what individual engineers can do to psychologically prepare themselves for on-call, this article describes what the team can do to improve the on-call experience overall.
This point is important, which is why it's the first tip in this article: you cannot hire an SRE/DevOps team to magically make your operational problems go away.
If your engineers build a service and it's business-critical, they need to be on-call for it.
Handling on-call this way incentivizes your team to improve the service's operations over time, and reduces the risk of "just throw it over the fence to the ops folks" mentality that can develop when a team isn't on the hook for their own code.
On-call shouldn't just be an additional responsibility to respond to alerts on top of your team's regular scheduled work (no sprint work, no picking up tasks from the backlog, nothing).
Instead when folks are rostered on-call, they should be working on tasks that improve the quality of life for on-call.
This stream of work will come up naturally if you take each alert that fires as an opportunity to improve your service. When things break, take the time to ensure an alert never fires again.
Especially if the alert is unactionable.
A quick note on unactionable alerts: If your service has alerts that do not have a single, clear, human response, the alert should be deleted. If it's a flaky alert, your team should prioritize fixing the issue as soon as possible.
Getting paged constantly while on-call is a symptom of a broken system. Either the system is too unstable, and needs time invested to make it resilient, or your alerts are too verbose and their thresholds need tweaking, or they require no action, and need to be deleted.
Over time, applying long-term fixes to the causes of your alerts will make your service more resilient, and make your team's time on-call a lot less stressful.
Google make this possible for their own SRE's by blocking them from spending more than 50% of their time on operational work:
We cap the amount of time SREs spend on purely operational work at 50%; at minimum, 50% of an SRE’s time should be allocated to engineering projects that further scale the impact of the team through automation, in addition to improving the service.
If your team relies on engineers to regularly improvise while on-call, they (and your customers) are going to have a bad time.
Runbooks, up-to-date documentation, and standard operating procedures give your on-call engineers direction on how to respond to, and escalate, incoming alerts. You don't want them to be figuring this stuff out by themselves at 2am.
If you've already got these in place, have your newest team members review them. Chances are, there are blindspots that you and your team can't see due to familiarity with the services and their docs.
If you're the new member of the team, take the initiative to improve the documentation yourself. Try going through runbooks and see if you have enough context to complete the steps. A common mistake senior engineers will make is assuming their colleagues know where to run commands, or where certain repos are (particularly when engineers give a service one name, while the codebase lives in a repo with a completely different name).
Your documentation should be as explicit as possible.
In order to help drive the continuous improvement aspect of on-call, it helps to have an engineer on secondary on-call.
While the primary on-call engineer answers pages at any time of the day for a week, the secondary on-call engineer's role is to make the primary on-call engineer's life easier. They can assist primary with investigations when a particularly bad incident occurs, but their main focus should be improving service reliability (by improving the codebase/writing tests, fixing flaky alerts, deleting unactionable alerts) and generally ensuring primary doesn't get paged each week for the same reason.
While in the past we've had secondary on-call for only one week at a time, we found that due to ramp up time, a month of secondary on-call worked better.
As secondary on-call doesn't answer pages overnight, you can typically assign newer engineers to the role so they get familiar with your team's documentation and runbooks, and can help improve them further.
You should hold a meeting to handover on-call responsibilities between engineers as one shift ends and another begins. We typically hold these in the middle of the week, as most major issues for the week typically surface by then.
In the meeting, the engineer whose on-call shift is ending talks through the alerts that were fired, whether they were resolved, or if they still require attention. Commonly recurring themes (types of alerts that fire often) should be added to the on-call backlog, and prioritized to be fixed by the on-call engineers.
Let anyone on your team attend these meetings (particularly new employees), as it helps them get familiar with the service that they'll eventually be on-call for.
Going on-call overnight is additional work, whether or not an engineer gets paged.
You need to pay a bonus on top of their regular salary for the week your engineers go on-call. In many countries, this is now a legal requirement (you may want to double check this with your legal team).
Particularly if you expect them to do their job well.
I wrote this article as a former-employee of a very large tech company (thousands of engineers). In our case, we had our regular software engineers that build the service also handle on-call for it.
Be aware that blindly implementing recommendations will not automatically make your company successful. Large tech companies adopt these practices as a means of managing their success - not because these practices alone make them successful.
That being said, if your on-call staff are constantly fighting fires, or the on-call experience isn't generally improving week to week, hopefully some of these recommendations will help.