Dealing with Noisy Error Monitoring
Say you've been tasked with monitoring an application, so you set up some alerts to let you know when errors are coming in.
The minutes roll by, the errors start coming...
...and they don't stop coming...
Oh my, there seems to be quite a few errors coming through. Alerting on each error isn't going to help, better report on changes in the error rate instead right?
While there's no shortage of vendors that'll sell you on the benefits of error rate alerting, you need to get back to basics first.
If you haven't already, you'll want to instrument your individual API endpoints.
For each endpoint, you'll want to be monitoring the requests, errors, and how long each call takes. Doing so makes it easier to find regressions in each endpoint when things do go wrong.
For example, tracking down the cause of an increase in 4xx errors becomes a matter of comparing the current week's requests/errors/duration to the previous week's for each endpoint.
Go through each type of error you're seeing that triggers an alert, and figure out what the appropriate response would be.
If the error isn't actionable, then it shouldn't create an alert. Have your developers change it to an INFO or a WARN log instead.
If the error IS actionable, and your application has already attempted to self-heal the issue (assuming self-healing is possible), THEN you should alert.
If the error IS actionable, BUT only after a certain threshold, you might be dealing with errors caused by a retry loop. Errors that occur as part of a sequence that eventually succeeds aren't real errors. If after several retry attempts the action still fails, that whole sequence of events should be counted as one error. Have your developers modify the error reporting accordingly.
As part of your team's on-call schedule, you should be documenting every alert that gets made, and actions taken to resolve each issue (and your alerts should provide instructions on what to do via a runbook). Common alerts should be prioritised and fixed by your developers.
If you're struggling with the signal to noise ratio of your company's alerting, you need to reflect on what the point of all of those alerts is.
An alert should mean someone needs to do something NOW.
It should not mean "Hey Jordan, check it out, CPU is at 69%!"
Another way to think of this is - if you're getting an alert for an error, it should be a priority to fix that error. If it isn't, what's the point?
Want to discuss this article? Let me know on Twitter @rozenmd