Platform engineering vs. site reliability engineering (SRE): here’s what you need to know

Amir Kazemi
April 7, 2021

Over time we’ve seen quite a bit of change in the way engineering teams ship and operate their cloud-native applications. One of the most notable changes is the adoption of DevOps principles across an organization. We can think of this as more of a mindset, culture, or philosophy to help engineering teams get to market faster and strengthen their competitive advantage. 

Along with this new philosophy also came new tools, processes, and different ways of structuring how cloud-native engineering organizations work together. Department silos are dissolved and product engineering teams are now required to follow a “you build it you run it” model where you own and operate everything you build. 

These product-focused engineering teams are now also supported by two functional areas called site reliability engineering and platform engineering. 

In practice, the lines between these functions can be often blurred or they can operate interchangeably (depending on the background and size of the organization). Below is a generalization of how you’d see these two teams get introduced and eventually evolve as an organization’s engineering team were to scale. 

A maturity diagram showing when platform engineering and site reliability engineering are introduced in an organization.


Part of the reason the lines between these teams are blurred is because they typically share a similar and overlapping set of objectives. 

  1. Self-service platform and automation. Workflows and best practices that meet the needs of the product engineers so that everything that they use on the platform is automated and ready to go. 

i.e.

Old way: Provide me your source code on a USB stick. I’ll put it on the server and make sure it keeps working.

New way: Create a pull request <here> to automatically get a Git repo, CI/CD, filled with a Hello World application, that deploys all the way to production. 

  1. Reducing toil - toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 

What is platform engineering? 

Platform engineering teams leverage software engineering principles to speed software delivery. Their goal is to make sure that product engineering teams are as productive as possible before a line of code is written all the way through when an application is deployed into production. 

Looking back to the objectives above, reducing toil is something that has encouraged platform teams to build workflows that allow their product engineers to build and ship applications faster. 

A few examples of toil experienced by engineering teams: 

  • Manually creating Git repositories and granting the correct people access 
  • Manually creating CI/CD pipelines based on another project 
  • Deploying any necessary infrastructure components

Platform teams can eliminate this toil by creating automated workflows that allow product engineers to connect their git repo with their CI/CD pipelines in order to deploy faster into production. 

As the team eventually scales and the number of engineers grow, the needs of the team also grow and evolve. This requires for the platform engineering team to re-calibrate existing workflows and make sure that the most common workflows are available for product engineering to be productive. 

If this doesn’t exist or isn’t managed correctly, it can ultimately lead to bottlenecks, which means slower to ship to production, and slower time to market. 

What is Site Reliability Engineering (SRE)? 

While a DevOps culture and philosophy helps teams collaborate, ship, and operate software faster -- it doesn’t necessarily design the systems that increase site reliability and performance. 

This is where SRE comes into play. Site reliability engineering teams leverage software engineering principles to improve system performance and reliability. The idea and function was actually pioneered by Google’s Ben Treynor, who described it as “what happens when a software engineer is tasked with what used to be called operations.” 

These teams typically also build systems, and services that ultimately help support the product engineering teams in creating workflows to maintain uptime and reliability goals (i.e. SLO’s). SRE teams may also build in-house homegrown tools to assist with software delivery or incident management weaknesses within the organization. 

Ultimately, coming back to the original set of objectives, SRE teams also need to make sure they consistently keep tabs on toil, measuring reliability, and the health of their overall systems. 

Closing thoughts

At the end of the day, both teams are clearly important in helping engineering organizations build cloud-native software as productive and reliable as possible. You can really say that most of the time product engineering teams ride on the shoulders of both the platform and site reliability engineering teams. But, regardless of the size, scale, and complexity of your organization there will always be the need for an individual or team to focus on reducing toil, shipping faster, and improving reliability. 

If you haven’t yet tried effx and want to see how some of the world’s best engineering teams are solving these challenges -- take it for a spin here for free today.