Service-level visibility during an incident

William Li
November 13, 2020

It’s 5 a.m. on your first on-call rotation at a new company. You were hired quickly, when the organization’s most senior engineer retired to Costa Rica, changed her phone number and stopped responding to emails.

Now the Twitter feed is going crazy with users in Europe who say they can’t log in.

What is the name of the service that handles authentication? Was it the Farm service or the Forest service or the Deep Space Nine service?

You spend 10 minutes looking scouring the service catalog — a spreadsheet — before figuring out it’s Deep Space Nine service.

But Deep Space Nine looks fine. So what dependencies does it have? Back to the spreadsheet.

You’ve determined that Deep Space Nine handles authentication and depends on Chicken. Nothing has changed in Deep Space Nine, but Chicken is experiencing an elevated error rate.

Happily, you have a link to the run-book. Less happily, it looks like it hasn’t been updated in over two years.

The contact person is the same senior engineer who just moved to Costa Rica.

Most software issues are like aviation accidents — they happen as a result of human error. Occasionally a problem is due to your cloud provider or a customer’s change in usage, but it’s more likely to be something under your control. The good news is that means you can fix it — provided you have enough information.

In a service-oriented architecture, the thing that humans make changes to are services. So when something goes wrong, the person in charge of fixing it has to be able to quickly identify which service or services are involved, what the upstream and downstream dependencies are, when that service was most recently updated and who to contact for help rolling back the change.

At the time of writing, effx is an 8-person startup and we have 20 services. At our size, it’s still possible for every engineer to keep a mental service catalog in his or her head, but even at just 20 services we’re pushing up on the maximum that any person can keep track of without a good tool. As both teams and number of services grow, the mental service catalog becomes impossible to maintain — especially in the middle of the night as you’re trying to groggily sort out what’s going on. Even a midsized company will have hundreds of services; enterprises often have thousands of developers and thousands of services.

With so many services, it’s nearly impossible to orient yourself during an incident and quickly understand where the problem is likely to be, especially when the complaints are coming from users rather than an alerting system. Even when you know which service to look at, most observability platforms don’t provide service-level granularity or timelines, so it’s challenging to see if the service experiencing a problem has been updated recently or if the problem is likely a result of an upstream dependency.

Understanding what needs to be fixed is the first step in actually fixing the problem. When service catalogues are not up to date or are just challenging to make sense of, incident response times go down.

Leave the Service Catalog Better than You Found It

As engineers, we should all aspire to leave everything we touch in better condition than we found it. No one would intentionally create more technical debt or leave the codebase more brittle than it was before. But when service catalogs are spreadsheets and Wiki pages, leaving them in better condition than we find them is a challenge. As a result, incremental improvement is difficult — updates are more likely to be done by sporadic, Herculean effort of one dedicated team member manually verifying service ownership and contact information. Crowdsourcing the information is next to impossible.

This system all but ensures that accurate, up-to-date information about service ownership is unavailable during an incident.

With effx, it’s easy to update the service catalog on-the-fly, so every engineer can easily change information that’s wrong. This encourages a shift in workflow that empowers engineers to actually leave the service catalog in better condition every time they touch it.

The platform also automatically collects service-level information about deployments and dependencies, so that during an incident it’s easy to see find out which service is responsible for what functionality and to investigate all of the dependencies.

Having a service catalog that is updated automatically (the service and dependency timelines) and through crowd-sourced information (the service ownership and preferred contact information) means that not only will incident response times go down, but each incident will provide additional information to improve the service catalog, making recovery increasingly smooth.

If you’d like to see how the right service-level visibility can help during an outage, we’d be happy to give you a demo.