Guidelines for picking where to send monitoring alerts
If you've ever had to be on the receiving end of a monitoring system that uses email for alerts, you know how noisy things can get. Particularly if you're working in an agency or freelance-like environment, with dozens of client sites to maintain.
You get so many emails that you start looking into integrations with third-party services like Zapier, and coming up with more and more complex rules to try reduce the noise, such as:
- wait until a site has been down for 5 minutes before sending an email
- wait until a site has been down for 30 minutes before forwarding the email to your helpdesk
The trouble isn't so much with email - it's a pretty decent way to receive and store important information - but rather that the alerts are for humans, and humans tend to have limited attention spans. Every alert you get eats up more and more of your attention span, and the lower your attention span, the more likely you are to miss the real "huge business impact" alerts.
So let's take a look at what type of alerts we should care about, and where to send them.
Basically, you should send an alert for anything that needs to be responded to immediately, such as your site going down (and your business's ability to make money with it). These alerts should be able to wake an on-call person up, so they can fix the issue ASAP.
If you can't trust that your site is actually down when receiving an alert, perhaps it's time to re-evaluate your monitoring service.
When was the last time you woke up for an email? It's not particularly common, so here are some suggestions for where your alerts should go.
Mike Julian's Practical Monitoring categorises alerts into three groups, depending on their use-case:
- Immediate action required
- Events that should trigger these kinds of alerts: your site being unreachable, users being unable to pay for products, SSL certificate expired.
- These are the alerts that should be going to Phone/SMS/Pager, and should wake people up (if you respond to outages overnight).
- Awareness needed, but immediate action not required
- Events that should trigger these kinds of alerts: your database backup failed, warnings that your server is starting to run out of disk space
- These are the types of alerts that should go to a team channel, such as Slack, Discord, or Microsoft Teams
- You could also send these types of alerts to email, though it might rapidly get too noisy for a single person
- Record for historical/diagnostic purposes
- Events that should trigger these kinds of alerts: your system returning a 5xx error for a request, server timeouts
- These are the types of alerts your system should be sending to a logging service.
Don't use email as a "catch-all" for your monitoring system's alerts. You should split your alerts by use-case, and send them to different locations:
- Alerts that require immediate action should go to Phone/SMS/Pager
- Alerts that just need awareness, but no immediate action should go to a team channel in Slack/Discord/Microsoft Teams
- General system errors should be going to your logging service