During an incident, visibility into the system is the only way to approach debugging methodically. Without visibility, engineers who respond to an incident have nothing to rely on but tribal knowledge and gut instinct to find the cause of the problem and resolve it. Without enough visibility into the system, fixing incidents is a series of blind trial and error.
But how much visibility is enough? What parts of the system do you need to be able to understand? How granular do you need that information? Is every minute enough? Every five minutes? Every second?
In my experience, most engineers have enough information for simple or routine incidents. Especially for problems that happen often, teams will often create custom dashboards specifically to troubleshoot that type of error. Information about how to resolve common problems will likely be shared during the onboarding process, and finding someone who can easily fix them isn’t challenging.
It’s when errors are novel that things start to go awry. More complicated errors — or just errors that haven’t occurred before — are always harder to find, but lack of visibility can make them much more time-consuming to resolve than needed. When an error can’t be explained by the ‘usual’ causes, most engineers find themselves wishing they had more visibility, in a more human-readable format, than they have available.
Ultimately, the answer to how much visibility is enough depends on how important it is to respond quickly to incidents. If quick recovery doesn’t matter and the organization doesn’t mind having engineers spending a lot of time fixing errors, visibility isn’t crucial. But if quick recovery is important and the organization would rather have engineers creating business value than digging through logs, visibility is essential — and most organizations don’t have enough.
Troubleshooting a complex problem
Let’s walk through the steps that an engineer might take when an alert comes in about a problem with a service.
- Check the dashboard for that service.
- Check the general health dashboard.
- Check general health broken down by geographic region or country.
In many cases, the on-call engineer will find enough information in one of those dashboards to fix the problem. But sometimes everything seems normal in all the regular dashboards — that’s when things get tricky.
The next step is generally to start looking at logs. If you’re lucky, you’ll have tools in place to filter the logs and make them easier to search through, but even so logs are not very human-readable.
The log problem
If anyone has to dive into logs directly, it’s because the system does not have enough visibility. Even scanning logs requires a certain amount of tribal knowledge — each company will have a slightly different log structure, and checking logs requires knowing what logs are part of the baseline, and which ones are an anomaly. Finding information from logs requires an idea of what you’re looking for, which means it’s not very effective in situations that are truly novel.
Filtering out the noise to find the handful of log lines that contain information is challenging. I’ve seen companies with a billion log lines per hour. Narrowing that down to the 10 log lines you need is only possible if you understand ahead of time how to filter the logs and what types of logs you need.
Once you’ve found the relevant log entries, you’ll still need to compare the information from the logs with information from your high-level dashboards. Wouldn’t it be better if you could find everything you need in the high-level dashboards?
Avoiding the deep dive
With the right tools, you can avoid having a human digging through lines of log code, instead presenting any relevant information in a high-level dashboard. The fewer dashboards engineers need to look at, the faster they will be able to find the source of the error and fix it — and avoiding digging through logs dramatically reduces the amount of time to find the error.
So how much visibility is enough?
The reality is that few engineers will ever feel like they have all the information they need for every possible scenario. As long as information is presented in a way that’s human-readable and easy to navigate, it’s not possible to have too much visibility. One of the challenges in modern applications is there are so many layers and management tools that work automatically, creating a huge number of information while also obscuring how the application works.
Visibility can seem like it’s not important — until it is. Many organizations expand visibility into their systems as a result of post-mortems after incidents that were challenging to fix.
Service-level visibility is critical for microservice-based applications. Better visibility into how services depend on each other and all relevant changes in a timeline format make it possible for engineers to avoid digging in to logs entirely. This makes it faster to debug incidents and easier for new team members to confidently join the on-call rotation.
Curious what service-level visibility looks like? Book a demo.