Incident response from monolith to microservices
Moving to a microservices architecture is a major change for any engineering organization. It impacts everything. From how teams collaborate and the tools they use, to the deployment of their system, and even how incidents are handled in production.
What's different about all-remote incident response?
At first glance, incident response seems to be the one thing relatively unaffected by the transition to all-remote. Even in a small company, only around 20 to 30% of incidents are handled ‘in person,’ with a team of people all huddled around the same conference table.
How to write a runbook
Updated run books with actionable, relevant information give individuals and teams a clear, application-specific roadmap for incident response as well as application-specific best practices for maintenance tasks.
Best practices for incident response
When things go wrong, money and reputation are on the line. Minutes matter. Incident response is stressful and high-stakes. Nonetheless, there is a right and a wrong way to go about incident response. Some organizations and some individuals seem to excel at incident response, and others struggle.
How to run a post-mortem
Every engineering organization should be continually striving to create more resilient applications and processes. Incidents are a reality in any organization, but effective post-mortems help improve the incident response process, resolve technical and organizational issues that cause incidents and build increasingly resilient applications.
How much visibility is enough?
During an incident, visibility into the system is the only way to approach debugging methodically. Without visibility, engineers who respond to an incident have nothing to rely on but tribal knowledge and gut instinct to find the cause of the problem and resolve it.
Why timelines matter during an incident
The first step during an incident is finding out what is broken — and how it broke. In most cases, there is a clear action that was taken that potentially caused the incident. Ideally, that action can be undone and the problem easily solved.
How to reduce cognitive load during an incident
At a previous job, I worked with a particularly effective incident manager. After working with him for a while, I learned that before going into software engineering he had been an air traffic controller with the Air Force — and he said that there are many similarities between managing incidents in an aircraft and managing software engineering incidents.