In an ideal scenario, a reasonably experienced engineer who is new to the team should become part of the on-call rotation as soon as he or she has a functioning company computer and credentials to access any systems needed to work on the code.
In the real world, it’s rare for companies to have all the resources in place that would let a new engineer successfully handle an issue, whether it means fixing a relatively minor problem or alerting the right people to fix a more complex issue. That means it takes way too long to get new engineers into the rotation, often leading to a vicious cycle of tribal knowledge and burnout.
So what if it’s always the same two people on call?
First of all, one metric used by DevOps teams and managers to evaluate the team’s health is workload balance — how much each person contributes to the overall team’s success. This isn’t just because everyone wants to be fair — teams work better when everyone contributes to them if not exactly equally than at least in a way that’s balanced.
When only two out of six team members are part of the on-call rotations, it creates the following problems.
- It’s not a true rotation. The same people are essentially always on call, either as the primary or secondary responder
- It discourages updating the run book. If you’ll only ever need to share organization-specific or application-specific information about incident response with one colleague or your future self, it’s easy for people to get sloppy about updating run books. They assume they’ll just remember, or send any information to their colleague in an email or make notes in the run book that don’t make sense to anyone else.
- It wastes the organizations’ most valuable engineering talent. The engineers who end up always on call are the most experienced ones, with the longest tenure at the company. When they are constantly being asked to fix even minor issues, they don’t have time to work on product development — ie, using software to create value for the organizations.
- It creates burnout. Constant interruptions are frustrating, even if those interruptions only occur during the workday and even it’s a five minute fix. Expecting the same people to handle all of the incidents risks burnout.
- It’s a business risk. Lastly, if you depend on a very limited number of people to respond to incidents, even one of them leaving the company can be very disruptive — especially if they haven’t been updating the run books, as happens frequently.
- It limits newer team members’ career growth. Taking on additional responsibilities like on-call shifts is a way for newer engineers to demonstrate their skills. Without the opportunity to do so, they can both become bored with the job as well as be unfairly passed over for promotions.
DevOps managers and team leaders should aim to get new team members into the on-call rotation as quickly as possible. Not only does this ease the burden on everyone else, but having the resources in place to make an on-call rotation successful also will help get new team members up-to-speed quicker in other ways, too.
Getting a new team member ready for an on-call rotation requires giving that person confidence that resources exist to handle any situation. Specifically, there should be resources that will help them solve as many problems as possible without contacting anyone else and specific guidance about who to contact if it’s not an easily-fixed problem.
Building confidence isn’t about giving pep talks. As humans, we feel confident when we know we’ll be able to handle a situation. Here’s what new team members need to have, at their fingertips, to feel confident going in to the first on-call rotation at a new organization:
- Documentationation about each service, including information about the services’ entire tech stack and what the services’ purpose is.
- Up-to-date documentation and run books. Nothing kills confidence like being presented with a run book and being told it hasn’t been updated in two years.
- What are each services’ dependencies? If nothing in the run book solves the problem, where should the engineer look next?
- The ability to read about past incidents through post-mortems. This is useful particularly in prepping for the on-call rotation.
- Lastly, who should the engineer contact if he or she can’t solve the problem? How should the person be contacted?
Part of the ethos of DevOps is making both teams and individuals responsible for their own services. That needs to include dramatically lowering the barrier to entry for people to go on-call. Putting the right resources in place is the only way to do so without asking for disaster.
Going on call for the first time is nerve-wracking, and probably always will be.
Effx makes it easier for organizations to make sure they have all the resources engineers need to be successful during their first on-call rotation. Schedule a demo to see how.