Saving your team from alert fatigue
It's a story as old as the web itself: someone on your team gets excited to install a new tool.
The tool promises to finally give you a clear view into the problems your users have with your product.
Your team agrees to give it a go.
The errors start coming...
...and they don't stop coming...
Soon enough, most of your team has either created an email filter to manage all the alerts, or has unsubscribed themselves entirely. Just like all the other tools.
Welcome to alert fatigue.
What this article covers:
What is Alert Fatigue
In short, alert fatigue happens when you start to become desensitized to alerts.
It tends to happen after receiving multiple alerts with similar content (often with limited context into the problem, so not enough information to fix the alert).
Alert fatigue can strike relatively quickly when there's a large number of alerts (often irrelevant to the developer), and there's not enough time to resolve all of the issues.
Beating Alert Fatigue
This section suggests a few ways you can beat alert fatigue.
It boils down to: making alert management a team effort, continuously cleaning up your alerts, and categorizing your alerts.
Make alert management a team effort
There is nothing worse than being paged for a system you didn't create, with an alert you have no autonomy over.
If your team has an on-call roster, it should have autonomy to create, modify, and remove alerts as it requires. Sometimes alerts are flaky, and need to be temporarily disabled or have its thresholds changed, and sometimes an alert no longer fits the team's needs, and should be removed entirely.
Clean up your alerts
Cleaning up your alerts doesn't have to be a big task to do in one go, you can continuously clean-up alerts as part of an on-call roster.
The idea is to look at every single alert as they fire, and figure out if there's a clear human response required. If an alert fires, and there's no human action required as a result, it shouldn't be taking your team's attention, and needs to be removed.
Getting paged constantly while on-call is a symptom of a broken system.
Either the system is too unstable, and needs time invested to make it resilient, or your alerts are too verbose and their thresholds need tweaking, or they require no action, and need to be deleted.
Categorize your alerts and notifications
The beauty of having an on-call roster within your team, is that your team agrees that one person gets to focus entirely on either fighting fires, or improving the on-call experience for the week.
Grooming your alerts greatly improves your team's on-call experience. Broadly speaking, sending every notification or alert your system generates to the same location is a mistake.
There are three types of alert that we care about, and they need to go to different places, as they help us do different things. Not every alert needs to wake someone up. Not every alert needs to go to a Slack channel.
While it's easy to sit on my armchair here and give you clear categories, in reality your team will often get high priority alerts that aren't really high priority, and low priority alerts that really should be looked at immediately. Part of your team's on-call responsibilities should be to review the alerts that fired each week, and re-categorize alerts as necessary.
High priority alert - immediate action required
This covers things like: your website being completely unreachable after several checks in a 5 minute period, your SSL certificate expiring, or customers being unable to pay for your product.
These are the types of alerts that need go to someone's phone (SMS, call or pager app under a high priority), and wake them up if they happen outside of business hours.
Once your team is aware of the alert, they start up an incident room, and start working through your team's runbook for the service affected.
Medium priority alert - awareness required
This covers things like: your system's database backup failing, and starting to run out of disk space on your database server.
They're still particularly bad things, but nothing that immediately impacts your customers.
These types of alerts that can go under a lower priority in your pager tool or to a team channel in Slack/Discord/Microsoft Teams. Since they're lower priority, your team should only investigate if there's no active incident, and generally only during business hours.
Low priority alert - things to fix during spare time/diagnostics
This covers things like: individual 5xx errors, server timeouts, and JavaScript errors.
Most commonly, these are the alerts your team gets from Sentry. It may be tempting to dump them into a Slack channel, but your team will quickly start to mute/ignore the channel.