At a previous job, I worked with a particularly effective incident manager. After working with him for a while, I learned that before going into software engineering he had been an air traffic controller with the Air Force — and he said that there are many similarities between managing incidents in an aircraft and managing software engineering incidents.
My former colleague always came into incidents with a mental checklist that he would work through diligently. Although I don’t think he’s the only engineer with a mental checklist, he was more disciplined about the checklist approach than others. I suspect that is why he was so effective.
Software incidents are somewhat less likely to involve loss of life, but they have real-world consequences that range from loss of revenue and trust for the company to users not being able to access an application they depend on. Perhaps a user is unable to check in to an Airbnb and is stranded late at night; perhaps a doctor isn’t able to pull up medical records. Even if the result of an incident is less likely to be deadly, there are a number of parallels between incident management in a software engineering context and incident management in other contexts, like firefighting or paramedics.
One common thread is the risk of cognitive load — and the need for tools and procedures to overcome it.
The cost of cognitive load
There are only so many things that humans can keep in our heads at the same time — and we can only take one action at a time. We’re also biased — if something bad happened as a result of X last time, we will often give X more weight than we should.
Not only can we not keep unlimited information in our head, there are also limits to the amount of information we can process at one time. When we have too much information — like hundreds of alerts coming in at the same time from different sources — we have a tendency to freeze unless there is already a clearly-established course of action to follow. This is a well-documented problem for pilots, which is why they are known for relying on a written checklist that is continually updated to prevent this cognitive overload.
Even if we don’t freeze, it’s easy to forget a step — especially if it’s a step the business cares about but doesn’t directly contribute to fixing the incident at hand. For example:
- Alerting the executives that there’s an outage
- Updating the customer-facing status page with information about the outage
- Checking links to downstream services after the incident has been resolved
During a crisis, it’s important to free up our cognitive resources to process the information and use our skills and training to find a solution. When we’re overloaded with information inputs and/or trying to remember basic information like who the service owner is for an upstream dependency, it prevents us from doing so.
Reality vs the ideal world
How do you free up your mind to work when you’re overloaded during an incident? I think there are two very important ways. One, make sure you have a checklist to follow — in our industry, these checklists are known as runbooks. Second, ensure that any information you’ll predictably need during an incident — like who owns a service, what the services dependencies are and how to contact the service owner — are easily available. It’s also important to make information accessible in one place, to reduce the amount of toggling between dashboards and context switching necessary.
In real life, runbooks often don’t contain either the detailed, updated checklists needed to relieve engineers of the cognitive load of remembering what steps to take, and in what order. For example:
- The runbook might say to check AWS, but now the service is running on GCP
- The runbook might have the service owner, but it’s someone who no longer works at the company
Having an out-of-date runbook can be almost worse than having no runbook. When some of the information is clearly wrong, it’s natural to doubt if any of it is accurate. It also adds to the stress, making it even more likely that in the heat of the moment you will forget an important step, make a poor decision or waste time following a red herring.
It’s often possible to collect all the information you’d need about dependencies and service ownership during a crisis — but doing so requires checking multiple dashboards and toggling between tools and Google docs or Wiki pages. This wastes valuable time while also adding to the things you need to remember.
Incidents are not always predictable. As engineers, we use a mix of judgement, training and experience to find the quickest possible solution. To do so, we need to have access to an up-to-date, accurate runbook and service catalogue as well as an easy way to see each service’s dependencies.
Effx makes it easier for organizations to proactively ensure runbooks are kept up-to-date instead of learning the hard way when they are not. Schedule a demo and see how it works.