Miscellaneous

DevOps vs SRE – Reducing Technical Debt and Increasing Efficiency and Resiliency

One more blog topic stemming from our weekly office hours that we hold with the field team here at Shipa. In our last office hours, was asked a question about “what are the difference between DevOps Engineers and SREs?”. Both professions are emerging disciplines and cultures that continue to evolve and play an importance in technology organizations. 

I’ve been fortunate to have written and spoken about this before; though taking a fresh look at what the two domains try to accomplish. Like any technical discipline, roles in your organization might vary depending on how specialized each team is. First, let’s dig into what a DevOps Engineer and a Site Reliability Engineer do. 

DevOps Engineers and SREs in a Nutshell

In a nutshell, DevOps Engineers are ops-focused engineers who solve development pipeline problems. Site Reliability Engineers are development-focused engineers who solve operational/scale/reliability problems. DevOps Engineers focus on engineering efficiency compared to Site Reliability Engineers who focus on resiliency.  Both sets of engineers focus on different problem sets. 

Two Different Problem Sets

Engineering efficiency and resiliency are two separate domains but have some overlap. There is a correlation between agility and more robust systems. A counter-argument might be made that agility brings about a fast velocity of change, and change is a detriment to reliability. Today’s challenges are faced at scale, and as we continue to push the boundaries, adjusting on-the-fly is important. The problems that both teams solve are telling to the culture and skills needed for both to thrive. Engineering efficiency certainly has a large umbrella it covers. 

What Problems Do DevOps Teams Solve?

Engineering efficiency can have a wide paintbrush. If you take a look at certain DevOps job postings, it can sound like the job poster is looking for an entire IT organization in one person. The SDLC (software delivery life cycle) can be a windy road to traverse. DevOps teams strive to remove bottlenecks across the entire SDLC by removing barriers to production and automation. With the adoption of Agile, production changes are being created and need to be deployed at a faster velocity, as incremental changes are now the expectation. 

The DevOps teams are purveyors of development tools, from providing guidance at the inception of the SDLC with source code management (SCM) recommendations to enabling Continuous Integration and Continuous Delivery in an organization. With a wide gamut of responsibilities, DevOps teams can have ownership and oversight over a number of tools and platforms. SREs, on the other hand, focus on system health.

What Problems Do SREs Solve?

Site Reliability Engineering teams focus on safety, health, uptime, and the ability to remedy unforeseen problems. A romanticized idea is that SREs are only sprung into action during an incident, helping devise remedies for problems until the engineering teams can make proper remediation. Certainly, an important pillar of the job is combating incidents, and SREs spend a good deal of time making sure the firefight doesn’t occur with their vast expertise. 

By removing some of the complex burdens in how to scale and maintain uptime in distributed systems, SRE practices allow development teams to focus on feature development instead of the nuances of achieving and maintaining service-level commitments. 

SRE Measurements: SLAs, SLOs, and SLIs

Both DevOps and SRE teams value metrics, as you can’t improve on what you can’t measure. Indicators and measurements of how well a system is performing can be represented by one of the Service Level (SLx) commitments. There is a trio of metrics, SLAs, SLOs, and SLIs, that paint a picture of the agreement made vs the objectives and actuals to meet the agreement. With SLOs and SLIs, you can garner insight into the health of a system. 

SLAs

Service Level Agreements are the commitment/agreement you make with your customers. Your customers might be internal, external, or another system. SLAs are usually crafted around customer expectations or system expectations. SLAs have been around for some time, and most engineers would consider an SLA to be “we need to reply in 2000ms or less,” which in today’s nomenclature would actually be an SLO. An SLA, in that case, would be “we require 99% uptime.”

SLOs 

Service Level Objectives are goals that need to be met in order to meet SLAs. Looking at Tom Wilkie’s RED method can help you come up with good metrics for SLOs: requests, errors, and duration. In the above example of “we need to reply in 2000ms or less 99% of the time,” that would fall under duration, or the amount of time it takes to complete a request in your system. Google’s Four Golden Signals are also great metrics to have as SLOs, but also include saturation. Measuring SLOs is the purpose of SLIs. 

SLIs 

Service Level Indicators measure compliance/conformance with an SLO. Harping on the “we need to reply in 2000ms or less 99% of the time” SLO from above, the SLI would be the actual measurement. Maybe 98% of requests have a reply in less than 2000ms, which is not up to the goal of the SLO. If SLOs/SLIs are being broken, time should be spent to remedy/fix issues related to the slowdowns.

On the other hand, DevOps teams focus on efficiency through the development pipeline. 

DevOps Measurements

You can have the most resilient and robust system in the world, but if your customers are not completing their journeys, adoption, and success will be hard to attain. In Accelerate, a book by Nicole Forsgren, Jez Humble, and Gene Kim, we dig into the organizational science of high-performing technology teams. 

The authors recommended measuring software delivery performance into four key metrics. Lead Time, Deployment Frequency, Mean Time to Restore (MTTR), and Change Failure Percentage. 

Lead Time

In lean manufacturing, the lead time is the amount of time it takes from a customer request to the fulfillment of that request. In the technology domain, this can be the time from when code is checked in to when the code is deployed into production. 

Deployment Frequency

The number of times that deployments to production occur in an amount of time. Are you deploying to production every day, week, month, year? The more frequently your internal customers can deploy, certainly, the more efficient the software delivery process is. 

Mean Time to Restore

Taking another page from lean manufacturing, MTTR is an incident metric that calculates an average time to restore a system. In the software sense, restoring is rolling back to the last known version of the application. Mean Time to Repair is when the repairing starts, e.g the start of the rollback. The “restore” portion of Mean Time to Restore is when the system is back to its previous functionality. 

Change Failure Percentage

This represents the percentage of changes in production that fail. After navigating all of the confidence-building exercises leading up to production, with the number of unknowns in production, a change will fail. Lowering the change failure rate allows for more confidence in production.

With both sets of measurements e.g DevOps and SRE measurements into consideration looking at how each team approximately would approach concerns is also interesting. 

DevOps and SRE Concern Table

ConcernDevOpsSRE
What do you say you do around here?Development pipelineResilience, scaling, uptime, robustness
tl;drSystem engineers focusing on development problems.Software engineers focusing on operational problems.
Does the application cluster?Yes, the application does. We need five nodes.We use a RAFT-based leader-elected clustering mechanism focused on Apache Zookeeper. We front the application with Apache Mesos to work through Dominant Resource Fairness constraints.
Can we have monitoring?Yes, we use Prometheus and FluentD and can provide hooks into each.Concerned about the science around how the monitoring tool works. Black box vs white box monitoring and specific metrics about each. Advising teams on pros/cons.
Our deployment failed.The pipelines we created allow you to re-run. If additional debugging is needed, we can connect the dots with log/trace systems.Unless it caused an outage, we wouldn’t get involved to help with the remedy. If the deployment regularly fails, we can work to help decipher why.
Typical metricsDeployment frequency, deployment failure rate.Error budgets, SLOs, SLIs, Four Golden Signals
War chant“People, process, technology!”“There is no root cause!”

Even though these approaches are different, both are needed to work together to reduce technical debt and produce a more robust system. 

DevOps, SRE, and Shipa – Better Together

Both SRE and DevOps engineers can be viewed as leveraged resources; clearly, there is not a 1:1 ratio of a Software Engineer to DevOps Engineers; though it can feel like it as organizations try to scale or Site Reliability Engineers. O’Reilly’s Building Secure and Reliable Systems, when compared to the first rendition of Google’s SRE Book, discusses team structure poisoning SREs as advisors/experts. 

Building software at scale requires specialized engineers to help tackle problems and further capabilities. DevOps Engineers, SREs, and other engineers such as Application Security Engineers fall into the category of specialized advisors. Google, in its SRE Book, described all the expertise across multiple domains needed to launch and maintain a product like Gmail, which surprised even me.

No matter what engineering team you belong to, Shipa can help get ideas out to production easier by reducing toil, especially toil focused around Kubernetes for developers. Your internal customers will be delighted with Shipa. 

Feel free to sign up for a free Shipa Cloud Account today and take Shipa for a spin. 

Cheers,

-Ravi