Learn
Postmortem Templates
After things go wrong and the incident is resolved, it's time to learn.
Generally in tech, the act of reviewing the cause of an incident is known as a postmortem, though there may be other names used such as "Post-Incident Review (PIR)" or "Correction of Error (COE)"
Regardless of what you call this process, you want to use a template so that your team has a standardized way of documenting what went wrong, and what your organization will do differently in the future to avoid repeat issues. It also helps to store these documents in a centralized place so that your teams have the opportunity to learn from each other's mistakes (Google Docs, Confluence/Notion, etc).
A note on root causes in the cloud
It's common for postmortem templates to include a five whys analysis to determine a root cause.
Be aware that when building software in the cloud, there is rarely a "true" root cause due to the sheer complexity of building distributed software. It's still worth attempting to find improvements to your organization's processes to avoid repeat incidents.
Amazon's template
- What happened?
- What was the impact on customers and your business?
- What was the root cause?
- What data do you have to support this?
- Especially metrics and graphs
- What were the critical pillar implications, especially security?
- When architecting workloads you make trade-offs between pillars based upon your business context. These business decisions can drive your engineering priorities. You might optimize to reduce cost at the expense of reliability in development environments, or, for mission-critical solutions, you might optimize reliability with increased costs. Security is always job zero, as you have to protect your customers.
- What lessons did you learn?
- What corrective actions are you taking?
- Actions items
- Related items (trouble tickets etc)
Source: https://wa.aws.amazon.com/wat.concept.coe.en.html
Atlassian's template
- Incident summary (what happened, why, incident severity, how long did the incident last?)
- Leadup (the series of events that lead to the incident)
- Fault (describe how the software misbehaved)
- Impact (determine how many users were impacted, over which time period)
- Detection (when did the team detect the incident, and how)
- Response (who responded to the incident, when, and what did they do?)
- Recovery (how was the system restored, what steps were needed to restore the system to a functioning state?)
- Timeline (an incident timeline, using the UTC timezone to standardize time)
- Five whys (describe the incident, ask why it happened, then ask why that happened, recursively, five times)
- Root cause (what needs to be changed to avoid this happening again)
- Backlog check (review your backlog to see if any unplanned work could have prevented this issue)
- Recurrence (use the root cause to see if other incidents had the same root cause. Ask why the incident happened again)
- Corrective actions (describe the work needed to prevent the incident happening again, who is responsible for completing the work, and when)
Source: https://www.atlassian.com/incident-management/postmortem/templates
Summary
In short, it doesn't matter which template you pick (or if you create one from scratch), as long as it's consistently used by your organization.