If you live in a town of 150 people, there’s a pretty good chance that not only do you know the names of everyone who lives in the same town, but you also know how everyone is connected to each other. A lost child doesn’t even need to know his name — everyone in the town already knows it, as well as his home address and his parents’ names.
You’re able to immediately spot things that are unusual too — two people who don’t get usually get along having dinner together; someone missing from a family outing. But once that town is 10 times larger — you’ll still know a lot of people by name, but not all of them. You won’t have a mental map of how everyone is related or be able to immediately understand how two people are connected.
According to Dunbar’s number, developed by an anthropologist, humans are able to keep about 150 human relationships in their head. More than that and it breaks down. This assumes, by the way, that everyone spends a considerable amount of time in social interactions to keep their mental map up-to-date.
So what does this have to do with service catalogues?
The information we need to have at our fingertips about services — especially during an incident — is like the information we would keep in our heads about people. We need to know, first of all, the service’s name. We need to understand how it’s related to other services, both in terms of functionality as well as upstream and downstream dependencies that may or may not be obvious. We need to understand what’s happened recently to that service, as well as to all the other services it depends on.
It’s possible to keep all of that information mentally when you’re dealing with a limited number of services. But by the time you have 50 or 100 services and 25 to 30 engineers, the mental service catalogue starts to break down.
Even with 50 services, you need to keep track of hundreds of relationships. Many of them have circular dependencies, like A depends on B which depends on C which depends on A. There’s edge cases to account for: Maybe service B only depends on A in some cases.
Here’s a list of things you’d have to keep track of mentally with 30 engineers and 100 services:
- The name and function of each service
- How each service depends on the other services, including circular dependencies and edge use cases
- What changes have been made to each service in the past day, week, month
- The names of 29 other engineers and which service(s) each one is responsible for
- How the other engineers are grouped into teams
- How to contact each engineer/team
- Who is on vacation right now
- Where is each engineer physically located? Is it the middle of the night there?
Even if you are capable of keeping all of this information in your head (and continually updated), the process of memorizing it will be challenging. There’s also opportunity cost: If there are only so many things we can keep in our head at one time, isn’t there some other information that you would be better off learning?
In reality, 30 engineers and 100 services is not a large engineering organization. In an enterprise setting, there might be 4,000 engineers and 6,000 services. Teams will be distributed across different cities and timezones — no one person will know the name of every engineer (or every service) in the organization. It might not even be possible to gather a team of people who would collectively have all that information in a mental map.
Would you rather take an open-note exam or one in which you had to have everything memorized? When an incident happens, it’s a high-stakes test of the on-call engineer’s understanding of the service catalogue. Even with perfect information, fixing the problem requires skill. As individuals and organizations, wouldn’t we rather have engineers focused on using their experience and training to find the most effective way to fix the problem rather than trying to remember if Joe’s vacation is this week or next week? Wouldn’t it be better for the on-call engineer to alert the right person immediately, instead of digging through Google docs just to figure out who to call?
Operating at scale
At scale, when you have thousands of engineers and thousands of services, it’s impossible for one person — or even a handful of people — to keep track of the complex relationships between services and the engineers who are responsible for them. This increases the risk that an engineer won’t have enough information during an incident, making it more likely the ‘fix’ will make things worse. It also creates an unnecessary burden of uncertainty during incidents as well as during post-mortems.
As organizations get more complicated they often rely heavily on the most senior engineers to act as their ‘service catalogue,’ taking those engineers away from more valuable projects and running the risk of burnout when they’re interrupted around the clock to help fix incidents.
If you’re operating at even moderate scale, with more than 100 services, the lack of a coherent, updated service catalogue becomes a major liability during incidents, makes it hard to have effective post-mortems and increases your dependence on a couple senior engineers.
Curious how a service catalogue that is always up-to-date would make things easier for your team? Set up a demo and we’ll show you.