Dealing with Noisy Error Monitoring

Say you've been tasked with monitoring an application, so you set up some alerts to let you know when errors are coming in.

The minutes roll by, the errors start coming...

...and they don't stop coming...

Oh my, there seems to be quite a few errors coming through. Alerting on each error isn't going to help, better report on changes in the error rate instead right?

Not quite.

While there's no shortage of vendors that'll sell you on the benefits of error rate alerting, you need to get back to basics first.

Instrument your application

If you haven't already, you'll want to instrument your individual API endpoints.

For each endpoint, you'll want to be monitoring the requests, errors, and how long each call takes. Doing so makes it easier to find regressions in each endpoint when things do go wrong.

There are a few services out there for instrumenting your app, it's worth looking at either Sentry or Honeycomb.

For example, tracking down the cause of an increase in 4xx errors becomes a matter of comparing the current week's requests/errors/duration to the previous week's for each endpoint.

Clean up your alerts

Go through each type of error you're seeing that triggers an alert, and figure out what the appropriate response would be.

If the error isn't actionable, then it shouldn't create an alert. Have your developers change it to an INFO or a WARN log instead.

If the error IS actionable, and your application has already attempted to self-heal the issue (assuming self-healing is possible), THEN you should alert.

If the error IS actionable, BUT only after a certain threshold, you might be dealing with errors caused by a retry loop. Errors that occur as part of a sequence that eventually succeeds aren't real errors. If after several retry attempts the action still fails, that whole sequence of events should be counted as one error. Have your developers modify the error reporting accordingly.

As part of your team's on-call schedule, you should be documenting every alert that gets made, and actions taken to resolve each issue (and your alerts should provide instructions on what to do via a runbook). Common alerts should be prioritised and fixed by your developers.

Summary

If you're struggling with the signal to noise ratio of your company's alerting, you need to reflect on what the point of all of those alerts is.

An alert should mean someone needs to do something NOW.

It should not mean "Hey Jordan, check it out, CPU is at 69%!"

Another way to think of this is - if you're getting an alert for an error, it should be a priority to fix that error. If it isn't, what's the point?

Want to discuss this article? Let me know on Twitter @rozenmd

Noisy error alerting can make you stop paying attention to your monitoring - which isn't great for when your site ACTUALLY goes down! If you're getting low signal to noise ratio out of your alerting, here are some tips to help: 👇

Do you and your colleagues dread going on-call?

I send one email every month with an article like this one, to help you improve the time you spend on-call, and keeping your service online in general.

Lots of SREs and folks in operations like them, and I'd love to hear what you think as well. You can always unsubscribe.