When migrating to a service oriented architecture from the grasp of a monolith, there are plenty of technical considerations to make. Teams will spend countless hours debating the merits (or not) of monorepos, the benefits of Kubernetes over ECS (or vice versa), what open source or third party monitoring and introspection tools they’ll use to diagnose issues and countless other common workflows that are important for the microservices platform they’re building.
However, one key effort that usually gets overlooked is defining what service ownership means within your engineering organization in this new, microservices world.
So what really is service ownership?
There’s quite a few amazing articles written about the “code it, ship it, own it” portion of service ownership, especially as it pertains to on-call and incident management. See Julie Gunderson’s recent write up on Code it, ship it, own it with full-service ownership and the venerable Liz Fong-Jones’s article on Sustainable Operations in Complex Systems with Production Excellence.
Beyond on-call, there’s so much more that goes into the day-to-day of what a service owner is called upon for, including:
- Being the point of contact for the security team when any priority security issue is raised
- Staffing a Slack channel to help answer questions about how to integrate with your service or being helpful
- Keeping its documentation and runbooks up-to-date as best you can
- Ensuring that observability dashboards are easy-to-find, easy-to-peruse, and have suitable tools to be able to dig deeper and analyze
- Planning and adapting for capacity and ensuring performance is efficient as possible
- Making future on-call rotations better by learning from the current and past
- Fixing and triaging bugs, diving into performance issues, speeding up builds, and of course, all the toil and yak-shaving that we spend a large share of our time on
Now imagine an engineering organization where service ownership isn’t well-defined or implemented and how all of the above breaks down.
Earlier this year, we researched the efforts of a handful of engineering teams ranging from early-stage startups who went all-in on microservices from the get go, to large enterprise teams who had recently slogged through multi-year journeys (many still in progress) from monoliths to a microservices approach.
There were a few common themes, but most importantly, most teams struggled with the concept of service ownership. Cultural issues were a noted cause. The lack of tooling another. A gap in understanding from what the infrastructure team intended to hand off in terms of their responsibilities to the service owners.
Everyone struggled with all of these same things.
Unowned services — and discovering them — causes unending pain.
Imagine a world where a vulnerability hits a language or platform you use in the lion’s share of your microservices. And then imagine someone in your organization — a project manager or a security engineer — building a spreadsheet of all the affected services, noting the owners of the service for follow up, and managing the landscape of hundreds (or thousands) points of coordination to ensure that everything is completed in a timely fashion. That person finds new owners for the orphaned services and all is well. Phew.
And then a new vulnerability hits that same language or platform 90 days later. Again, the fix could be something as simple as build, validate, deploy — that could even be fully automated — but requires someone familiar with the service to fully ensure compatibility or if something goes terribly wrong. Since the last issue, there’s been a few changes in the engineering organization and maybe the first few teams you contact about getting the process started no longer exist. That project manager has to essentially start from zero on a new spreadsheet and re-validate that services are still owned by the same teams or find the proper mapping.
From our research, this is happening at nearly every company we talked to.
HR systems manage to get updated when re-orgs happen or people change teams, but somehow, service ownership is left outdated.
One of the issues that came up frequently in our research was the opinion that service ownership issues were related to cultural issues in engineering. An apathy for having the tough conversations about who should own what when there’s a dispute after one team’s charter has changed and left their old projects behind.
And yet, teams will take on the risk of leaving an unowned service in production and have that tough conversation when downtime or tragedy befalls that service.
If your organization can keep Workday or Namely up-to-date with who reports to who, shouldn’t an engineering team be able to make — in turn — the updates to a system that keeps track of who owns what?
One of the reasons I started building effx was because I saw some of these challenges and issues while running reliability engineering at Airbnb. We built an internal tool to track ownership in code (yaml, of course) that we did our best to keep up-to-date. It still required a project manager frequently running around the halls to keep the information current.
When we were able to corroborate our findings with peer companies that had recently began, were in the middle of, or had finished their journey to microservices, we set out to build a tool that could overcome the cultural challenges our friends had spoken of.
By invoking some timely flows and an elegant interface to continually validate ownership, keep it up-to-date, and resolve conflicts, we hope that we can help you and your team free up that project manager’s time, eliminate unowned services and prevent them from recurring.
Want to learn more about service ownership from an effx perspective? Sign up to get a demo of our platform at https://effx.com or reach out and chat more at @joeyparsons on twitter or email at joey [at] effx.com.