Communicating to Users During Incidents
Imagine you're having a regular day at work, opening up your browser, double checking something for a client in that web app your team built for them, when suddenly, you see this screen:
You hit refresh a few times, just to be sure.
Nope. Still down.
What happens next depends on how well your team has planned for incidents like this (some folks call it unplanned downtime).
Incidents can be a frustrating time for everyone.
For you, as someone on the operational side of the app, it's frustrating because you're trying to fix the problem as fast as possible, with as little distraction as possible.
On the user side of things, their work day is interrupted, and without good communication, they're not sure if someone is currently working on the issue, or when it'll be fixed.
You can think of communicating with your users during an incident as a scale.
On the left, there's "ignore your users", or under-communicate. The vast majority of organisations are on this side of the scale.
On the right, there's "send each of your users an SMS every few minutes to let them know how it's going", or over-communicate.
Both of these extremes are annoying - the sweet spot is somewhere in the middle.
While it's rare for a company to intentionally ignore their users during an incident, even waiting to confirm exactly what's happening before acknowledging an issue can be extremely frustrating for users.
A famous example of under-communication is AWS's Service Health Dashboard. AWS regularly takes over an hour to even acknowledge anything is wrong, let alone describe the issue and whether it's being worked on.
As I mentioned earlier, chances are, your organisation is on the left hand side of the scale: under-communicating with your users during incidents.
To reduce frustration for everyone, you're going to want to do a few things during your incidents:
You'll want to let your users know something is not right, and someone is investigating.
If you're only working with internal users within your organisation, send an SMS, email or a Slack message letting your users know something is wrong, and that someone is looking into the issue.
If you're working with external users, you're going to want a dashboard to display your product's uptime or status (on separate infrastructure to your app!) that users know to look when things aren't working properly.
You might be wondering whether or not you should notify your users about incidents before you've confirmed there's a user-facing impact. As an idealist, I'd suggest all incidents should be posted on your status dashboard, as the cost of getting it wrong is your user's trust - but certain organisations value other things more than user trust.
When working at a previous employer, we would send an update as often as every 30 minutes, depending on how critical the system was to the business. In this case, "We're still looking into it" was a completely valid update.
For some audiences (particularly if you have a status dashboard), this might be too often, and you'll need to tweak how often you communicate.
General rule of thumb: if your users are reaching out to you (via Phone call, SMS, Tweets, etc), you've probably left it too long without an update.
It shouldn't be up to one person to simultaneously communicate with stakeholders AND fix the issue.
At the very least, you'll want a dedicated person in charge of communicating updates from the team fixing the issue to the outside world. Having a separate person here can help improve how long it takes to resolve the issue, as you spend less time context switching between "Oh no, need to fix the problem! aaaahhhh!" and "We are aware of the issue with [SERVICE], and are working to restore it ASAP. We will notify you once we have any updates".
In some organisations this role is called "Communications Engineer" or "Communications Manager" - either way, the person doesn't have to be technical, they just need to be capable of talking to engineers and stakeholders.
In short, once you're aware of an incident, tell your users!
Once you've told your users, send regular updates.
Finally, having a separate person in charge of communication, between the team fixing the issue and the rest of the business, can help reduce your Mean Time to Resolution by reducing context switching.