A curious quality of SREs is that academia didn’t prepare them for a role in this field. Instead, you find former software developers who found joy in preventing system outages. Others found downtime so reprehensible that they dedicate their careers to preventing it.
I began my career as an Embedded Systems developer with a background in Electrical Engineering. Uptime wasn’t in my vernacular yet, but if my program exited unexpectedly, the device would just turn off. I’d have to hope my mistake was recoverable by a power cycle, so the criticality of uptime was already in my blood by the time I transitioned into SRE.
A few years and a career change later, I worked as a backend developer at a startup. My day-to-day included building features, deploying our application on-premise with customers and managing a cloud deployment of our own. Building product features was rewarding, but allowing product development velocity to grow without impacting our customers was even more attractive to me.
In retrospect, I introduced Site Reliability Engineering techniques to the organization for more reasons than reliability alone. I wanted developers to trust that our automated tests would introduce the complexities of our production environments. With that trust, they could focus on building a great product and not infrastructure. Beyond reliability, I value fostering trust and engineering excellence, which I believe are the bedrock of innovation on any team.
An SRE’s Growth Cycle
Site Reliability Engineering is a discipline that leverages software engineering principles to solve infrastructure and operational problems. An SRE plays a complementary role in a modern web development team. Their expertise in reliable production systems manifests through tools, testing and deployment automation, and scalable system architecture. The approach to solving reliability is unique to each organization. There’s one thing everyone agrees on: doing it right requires organizational cooperation.
Building a reliable production system requires a breadth of knowledge that’s acquired through years of experience. These years of experience expose you to critical issues and the pain of managing brittle applications. Although, this time is important for growth as an SRE, because you learn a few things along the way:
- The importance of testing your features before you ship them
- How to build features that work on your dev machine and for millions of users
- How to log the right information to solve issues that only happen in production
- How to set proper alerts if something goes wrong
- How to identify and monitor important metrics
At this point in your career, you identify as an SRE but reflect on how your “training” mostly consisted of identifying what’s broken and then building things to prevent anyone from repeating the same mistake. Tomorrow brings another incident and the cycle repeats.
The incident response and prevention cycle does lead to positive outcomes. The iterative process of identifying system weaknesses translates to addressing issues in order of criticality. Each solution improves architecture, code quality and test coverage. As a result, a monotonic increase in uptime. A key goal is to translate knowledge of edge cases and the quirks of a system into automated tests and informative monitoring. The reliability of an application matures in tandem with the engineers building it.
An unintended consequence of this “training process” is that we’ve all forgotten what it’s like to not be obsessed with reliability. The industry is full of incident-tested veterans, each with unique workflows and habits that reinforce constant learning. We’ve acquired knowledge that our teammates depend on. Our blind spot has become how to effectively share all of the knowledge we’ve spent years acquiring.
Luckily, we can rely on future innovations as a remedy.
State of the Field
Site Reliability Engineering has certainly blossomed in popularity and for good reasons. A few Google searches and you can read about the latest tech company to have downtime and its consequences in lost revenue. Slack lost 5% of revenue in Q2 of 2019 due to service outages – causing them to promise investors that this was a “one-time” issue. Zoom experienced a 3 hour outage in Q3 of 2020, at a period when the service hosted 300 million daily meeting participants. Spending the time and money to automate incident prevention is a no-brainer when the outcome has fiscal benefits!
As the field has grown, so has the amount of approaches to the software development life cycle within distributed systems. Here’s a non-exhaustive list of processes in the space that ultimately tie back to reliability:
DevOps: A company-wide culture of high velocity application development and infrastructure management by consolidating the responsibilities into one team.
Continuous Integration: The development practice of frequent and small code changes, with each change undergoing a suite of automated tests to ensure program correctness. And because changes are small, developer teams can avoid feature-branch integration frustrations.
Continuous Deployment: The development practice where each code commit is deployed in production automatically. The practice extends to culture, where the teams are vigilant in bug prevention and test creation. Build your own safety net, because every PR commit meets the real-world.
GitOps: A version-controlled approach to infrastructure management. Gone are the days of manipulating production environments manually. Instead, configuration and deployments changes endure the scrutiny of code review.
Infrastructure as Code: All infrastructure management and provisioning is maintained by configuration files in a version-controlled repository. Say goodbye to the admin console and say hello to repeatable infrastructure provisioning.
Chaos Engineering: The practice of systematically breaking things to improve team preparedness and system reliability. Introduce “chaos” outside of production so developers can write more resilient code and practice incident response.
Observe how all of the described approaches require company-wide participation. As much as we SREs feel like heroes day-to-day, Site Reliability isn’t a solitary job. Additionally, the industry has shown willingness to invent and adopt new techniques to address the perpetual challenge of keeping an application running smoothly. Adaptability is core to SRE DNA!
Improved Collaboration with (Micro)Service Catalogs
While SREs have proven to be adept at generating processes that make their company resilient, there is plenty of problem space yet unsolved. One topic comes to mind: knowledge transfer.
I recently onboarded at effx, the company with a mission to solve this very issue. The stack is a collection of microservices hosted in Kubernetes and other workloads managed across different cloud providers. The code spread across different repositories, each with a designated responsibility. How can I quickly learn how the system is supposed to work and integrate myself into the same workflow that my new team has spent years perfecting?
A (micro)service catalog solves all those problems. It’s purpose is to aggregate online resources relevant to a service into a single pane of glass. Every resource a developer needs like service metadata, API documentation, troubleshooting tips, service architecture, runbooks, team ownership and contact information, repository links should be found here.
With service catalogs, new hires have a place to explore the code, runbooks and team ownership information to learn how the system works and who to contact with questions. The seasoned SREs have a place to share what they’ve learned with the rest of the team. And those tasked with on-call duty can quickly navigate incident response with troubleshooting tips of each service. The team works together to keep their collective knowledge in sync.
I grew into an SRE through a passion for building dependable systems that allow for innovation. The beauty of SRE comes from the synergy it brings into any company’s engineering practices. With effective inter-team collaboration, service outages can be avoided without sacrificing development velocity. Looking ahead, these values are why service catalogs excite me. They will help increase new engineers onboarding velocity while facilitating company-wide participation in crucial knowledge that has otherwise been hard to distribute.
If you're curious and want to take effx for a spin, try it here for free.