After things go wrong and the incident is resolved, it's time to learn.
Regardless of what you call this process, you want to use a template so that your team has a standardized way of documenting what went wrong, and what your organization will do differently in the future to avoid repeat issues. It also helps to store these documents in a centralized place so that your teams have the opportunity to learn from each other's mistakes (Google Docs, Confluence/Notion, etc).
It's common for postmortem templates to include a five whys analysis to determine a root cause.
Be aware that when building software in the cloud, there is rarely a "true" root cause due to the sheer complexity of building distributed software. It's still worth attempting to find improvements to your organization's processes to avoid repeat incidents.
- What happened?
- What was the impact on customers and your business?
- What was the root cause?
- What data do you have to support this?
- Especially metrics and graphs
- What were the critical pillar implications, especially security?
- When architecting workloads you make trade-offs between pillars based upon your business context. These business decisions can drive your engineering priorities. You might optimize to reduce cost at the expense of reliability in development environments, or, for mission-critical solutions, you might optimize reliability with increased costs. Security is always job zero, as you have to protect your customers.
- What lessons did you learn?
- What corrective actions are you taking?
- Actions items
- Related items (trouble tickets etc)
- Incident summary (what happened, why, incident severity, how long did the incident last?)
- Leadup (the series of events that lead to the incident)
- Fault (describe how the software misbehaved)
- Impact (determine how many users were impacted, over which time period)
- Detection (when did the team detect the incident, and how)
- Response (who responded to the incident, when, and what did they do?)
- Recovery (how was the system restored, what steps were needed to restore the system to a functioning state?)
- Timeline (an incident timeline, using the UTC timezone to standardize time)
- Five whys (describe the incident, ask why it happened, then ask why that happened, recursively, five times)
- Root cause (what needs to be changed to avoid this happening again)
- Backlog check (review your backlog to see if any unplanned work could have prevented this issue)
- Recurrence (use the root cause to see if other incidents had the same root cause. Ask why the incident happened again)
- Corrective actions (describe the work needed to prevent the incident happening again, who is responsible for completing the work, and when)
In short, it doesn't matter which template you pick (or if you create one from scratch), as long as it's consistently used by your organization.