Updated run books with actionable, relevant information give individuals and teams a clear, application-specific roadmap for incident response as well as application-specific best practices for maintenance tasks. When run books are done right, they help teams keep recovery times down and make incident response less stressful. They also dramatically reduce the barrier to adding new team members to the on-call rotation by lowering the amount of tribal, organization-specifc knowledge needed to respond in an incident.
A good run book will eliminate the trial-and-error cycles during incident response, instead giving engineers a clear set of steps to follow. It also helps organizations respond consistently — without a run book, different engineers might manage either incident response or routine maintenance tasks inconsistently instead of according to the best practices specific to that application. Without a run book, everyone is improvising. That increases the risk of prolonged incidents and/or problems during maintenance tasks.
Run books are also critical to maintaining services. Services aren’t deployed and forgotten — the longest phase in a service’s lifecycle is the maintenance phase. The run book is at least as important as the design document.
Yet run books are often overlooked. I’ve never worked at an organization that gets them completely right. Here are some best practices that will help organizations create run books that make onboarding easier, incident response faster and maintenance smoother.
Write the run book
Obviously, the first best practice is to have a run book — and, relatedly, to store it in a place that’s easily accessible to engineers in the organization when an incident happens or when someone needs to perform maintenance tasks on an application.
This might sound basic, but too often there’s a great run book… but the only way to find it is by digging through a Slack thread, or finding a post mortem from a year ago. The end result of scenarios like that is basically the same as not having a run book, because no one is able to find it when they need it.
Think about organization
After the basic best practice of having a run book, the next most important thing is to make it very easy to find information in it. The run book’s primary audience is engineers who just got an alert — they need a way to find relevant information as quickly as possible. The other reason someone would be reading the run book is because there’s a routine maintenance task that needs to be handled, like a database migration. Starting with two major sections, one for incident response and the other for maintenance, is a good top-level approach to creating a run book.
For the incident response portion of the run book, a good way to think of it organizationally is as a flow chart. The most common types of alerts should be addressed specifically. Ideally the hit rate for the types of alerts should be quite high, so 90% of the time that someone gets an alert, there will be a course of action outlined in the run book specific to that alert. Grouping those alerts thematically, so you can organize the run book into sections that cover different types of alerts, makes navigation easier.
Since you can’t anticipate every type of incident, there should also be a clear debugging process outlined in the run book for someone to follow if the problem isn’t one of the common alert types. If not, engineers will be left making it up as they go in precisely the most stressful, worst-case scenario when they really need a clear debugging checklist to follow.
It’s helpful to do this organizational work before starting to write the run book, but this will also serve as the basis for the table of contents, which is how the run book should start.
Make it searchable
The run book should always start with a detailed table of contents, so it’s easy to see which types of alerts are covered in detail and to navigate to that part of the run book easily.
Even with a good table of contents, run books need to be searchable with Command F. Run books are often long, verbose documents, but engineers need to be able to jump to the relevant information very quickly. The best way to ensure the document is searchable is to include keywords that would come from the alert in the document, so the reader can search for the alert type. At the same time, you should limit the use of those keywords, so they only appear when really needed, so that a search won’t result in too many hits.
The goal of a run book is to help an engineer fix a problem as quickly as possible, not for the engineer to understand why the team made certain architectural decisions. Background information should be available, but ideally as a link to a separate document, not as part of the run book. Including too much background information can also make the document more difficult to search, because search terms like ‘memory limits’ or even the service name, will get so many hits that the search won’t be useful.
Instead, run books should focus on actionable information: If X happens, do Y to fix it.
The run book should also include warnings about actions NOT to take, if there’s a chance that someone less familiar with the service would reasonably try something that could have negative consequences. It’s also appropriate to warn readers about the possible or likely consequences of following the actions recommended in the run book.
Effx’s modern service catalogue encourages teams to write and update run books for their services, and provides an easy way to navigate to it on the service dashboard.