It’s two days after your trial-by-fire on-call rotation. Even with an outdated service catalog and an incomplete run book, you eventually fixed the problem that was preventing users in Europe from logging in. It took a couple hours and required reaching out to five of your colleagues, but the issue was resolved before the bulk of users in the United States started complaining.
Now it’s time to figure out what happened and why.
Of course, your regular work hasn’t disappeared. As you sort through every log you can think of that might be relevant, including checking your cloud provider’s announcements to see if they had any issues, your work keeps on piling up.
You’re also a little worried about one colleague, who seems to remember things differently from everyone else. You know it’s important to be as accurate as possible, but you don’t want to step on his toes, especially because he’s the most senior engineer who was involved in the incident response.
There are also some details no one agrees on. Did you fix it on the 5th or 6th try? Who came up with the solution that fixed the error? How long did it take each person to respond? One of the colleagues involved is on vacation, so there’s one less person to fact-check events with.
Best practices meet reality
In an ideal world, post-mortems would always involve a collaborative effort to create an accurate timeline that pulls together human actions, changes in the infrastructure, code changes and external changes like a change in customer usage patterns or a cloud provider outage. In the real world, timeline creation is rarely a truly collaborative process. If the person primarily responsible for creating the timeline forgets something —or even intentionally excludes something — there’s often no organizational failsafe to make sure it’s still part of the timeline.
This isn’t just a memory problem, either — information about what happens during an incident isn’t perfect. One of the goals of a post-mortem is to identify places where communication broke down, which is hard to do if there’s only one perspective in the timeline.
Another key component of the post-mortem process is understanding the complete context and how everything is interconnected. If the only tools available to build the timeline are a Google doc, Slack messages to colleagues and sorting through logs, piecing together the complex dependencies is both time-consuming and hard to get right. In real life, it also requires tribal knowledge that makes organizations overly reliant on a couple senior engineers. The other problem with tribal knowledge is that it is assumption-based and can lead individuals to prematurely discount theories about what happened that don’t fit their assumptions about the application. This is a totally subconscious process but can undermine the entire post-mortem.
One source of truth
The challenge during a lot of post-mortems is that they lack a single source of truth. When one engineer writes up a timeline — perhaps soliciting input from other people who were involved but generally not getting any — it’s almost a miracle if something doesn’t get left out. There are plenty of reasons for errors, from the timeline writer simply not knowing that something happened (someone else was paged at the same time, but neither was aware the other was working on the same problem, or two people having a private conversation that didn’t include the person writing the timeline) to inaccurate memory to intentionally leaving out details that make the writer look bad.
Consider this scenario: The pricing service goes down. A couple minutes later the internal admin page goes down. The reason the internal admin page goes down might be that it needs to be able to load the prices of everything, and can’t because of the problem with the pricing service.
At a large company, these two events will likely be treated as two separate incidents. Not only is the operating assumption that the two incidents are isolated, but the team investigating each incident might not even know about the other one, because each service is handled by different teams in different timezones.
During the incident, the team working on the admin page might never have realized that none of their actions mattered — the admin page went back up as soon as the pricing service came back online.
If the timeline is created manually, the admin page team might never know what really happened and might not even realize there is a hard dependency on the pricing service. They would not be able to take appropriate corrective action, which could include removing the hard dependency on the pricing service.
Most engineers would agree that the goal of post-mortems is to learn from incidents, so that the company doesn’t keep having the same problems over and over again. You might continue having incidents, but they should at least be novel. From a technical perspective, this means fixing errors in a way that makes the codebase more resilient over time rather than more brittle. From an organizational perspective, it means understanding how the human element in the software lifecycle, from development through deployment and incident response process can be improved.
Whether you’re trying to understand why a particular deployment caused a problem, trying to improve the mean time to resolution or adjusting your pipeline to catch more bugs before they hit production you need to start with an understanding of what really happened, not what a single individual thinks happened.
In real life, engineering teams don’t have the luxury of spending lots of time compiling an incident timeline. They need a way to easily see all the events that led up to the outage or bug and everything that happened before it was resolved. Making this information accessible and accurate allows everyone to focus on the part of the post-mortem that requires human intelligence: Figuring out the best way to ensure the same type of incident doesn’t happen again.
If you’d like to see how having an automated service-level timeline can help streamline your post-mortem process, we’d be happy to give you a demo.