Every engineering organization should be continually striving to create more resilient applications and processes. Incidents are a reality in any organization, but effective post-mortems help improve the incident response process, resolve technical and organizational issues that cause incidents and build increasingly resilient applications.
Yet not all post-mortems are equally effective. Too often, post-mortems don’t provide actionable recommendations or focus too much on individual ‘self-improvement’ instead of process and tooling changes. I’ve run many post-mortems, and have come up with some best practices to ensure that time spent in post-mortems is not wasted. Here they are.
The first, and perhaps most important, best practice to follow during the post-mortem process is to make it completely blameless. Everyone involved in the process has to feel confident that there won’t be any negative repercussions as a result of the post-mortem, whether it’s losing a job or getting a poor performance review.
This is actually more difficult than it sounds. It requires not only an organizational culture that doesn’t punish engineers for things that come out in the post-mortem, but also a skilled facilitator who ensures that the post-mortem discussion doesn’t become accusatory. This is a fine line: Asking ‘why did you do X’ can be a neutral, inquisitive question that is very appropriate as you get to the bottom of an incident, but it can also sound like assigning blame. The person running the post-mortem has to proactively police the process for anything that sounds accusatory.
When possible, don’t mention anyone by name in the written post-mortem.
Find the lesson
It would be a mistake to think that the post-mortem’s goal is to find out what happened. If that is all the post-mortem achieves, it would be a waste of time. The reason it’s important to understand what happened is so that the company can take steps to prevent the same set of failures, whether they are technical, organizational or both, from happening again.
Ideally, the lessons from a post-mortem should be focused on processes or tools to implement rather than changes in individual behavior. Ultimately, no incident is caused purely by an individual action, but rather by a failure to have the right tools and processes in place that put an individual or system in a position to cause an incident.
Have a detailed timeline
The post-mortem is the time to pick apart the incident and incident response in detail. A timeline makes it much easier to understand what happened leading up to the incident and during the response, providing a starting point for understanding the delta between how the system works in reality and the team’s mental map of how it works. A timeline also helps uncover gaps in the incident response process.
Just as communication is critical during incident response, collaboration is crucial for post-mortems. Everyone in the team will have a slightly different perspective and a different area of expertise. The final post-mortem report should draw on as many voices as possible to get the fullest picture of what happened and how the problem can be avoided in the future.
Call out positives
One frequent oversight in post-mortems is failing to mention things that went well. For example:
- There was a bad deployment, but the engineer was alerted one minute later.
- We’ve set up our system to make rollbacks easy, so the rollback was done within another minute.
- After a previous incident we made a custom dashboard to address problems like this. It worked and allowed us to resolve an incident quickly.
Positive call-outs can be made by name. The goal is to create a culture that rewards actions that help the team and foster team unity.
The biggest mistake when it comes to post-mortems is a failure to take the recommendations seriously, at both an individual and organizational level. This manifests as:
Engineers who don’t put in the effort to pull out the hard-hitting action items that will really help the organization improve either the application’s resiliency or the incident response process, so that this same type of post-mortem never has to be done again. Wishy-washy takeaways like “I’ll be more careful next time” are too frequent in post-mortems but ultimately are useless.
Leaders who are unwilling to take action on the advice laid out in the post-mortem. Sometimes a post-mortem will recommend investments in new tools, for example. If leadership doesn’t buy in to the either organizational changes or new automation tools that the post-mortem found would be helpful, the post-mortem lessons go unlearned.
Timelines are a critical part of effective post-mortems, but can be time-consuming to compile. With effx, you can pull up a detailed timeline for both individual services and all of the service’s dependencies in minutes.