Top 3 Benefits of Site Reliability Engineering

October 12, 2020 Ray Stoner

The advent of cloud computing has increased the need for speed in order to be able to leverage on the advantages of platforms such as cloud native, serverless and hybrid computing.

According to an Insights report on Tech Trends 2020 published by Deloitte, over the last decade, digital experience, analytics, and cloud are enabling technologies that have proven their value. They are the basis of numerous successful corporate strategies and new business models.

Today, a key DevOps concern for organizations going forward is the way they are dealing with delivering services at scale in the cloud. As businesses today battle to achieve and maintain a high level of customer satisfaction and evolving business needs, they are consistently modernizing applications as well as deploying new applications and services.

The continuous cycle of having to modernize existing systems presents quite the challenge as to how Information Technology organizations approach service management.

Google recognized this and in 2016 published the Site Reliability Engineering book.

What is a Site Reliability Engineer?

A Site Reliability Engineer (SRE) is an engineer with a high tolerance for risk and a deep skill set in development and tooling and a profound understanding of how the system is built from end-to-end both in infrastructure and application. An SRE has a shift-left mindset to ensure reliability throughout the software development lifecycle.

Commenting on Site Reliability Engineering Ben Treynor, Google’s VP of Engineering added, “Fundamentally, it's what happens when you ask a software engineer to design an operations function.”

Core Principles

The SRE maintains a passionate focus on a systems by ensuring:

Availability – The amount of time a system is readily servicing requests. This is dependent on the resiliency, redundancy and maintainability of the system. Supported by Service Level agreements and Objectives.
Observability – Metrics, Logs and Traces, the ability to solve issues of unknown origin.
Automation – The ability of the system to operate itself with limited human intervention
Toil Elimination – Redundant manual, repetitive, automatable tasks required to keep the system running.
Incident management toolchain and process – The SRE is involved as a First Responder to Incidents and leverages incident data to improve system reliability.
Deployments / Release Engineering – Delivering software products and services in a consistent, agile manner.

Top Benefits of Site Reliability Engineering

There are 3 top advantages of a SRE:

1.Addresses the gap or wall between operations and development teams: Many organizations have built excellent DevOps practices. However, the reality is that the organizations are typically doing more Dev and not much Ops. Applications are still being built or updated and then “thrown” over the wall for the Operations team to deal with.

Developers want to quickly deliver the latest feature and function to remain competitive, while operations teams are focused on stable systems. This mode of operation causes contention based on the premise that change brings instability. The SRE role addresses this contention where he or she has the skills for coding, observability, configuration management, capacity planning and excellent troubleshooting capabilities.

2. Improves visibility and reporting: The SRE utilizes key performance indicators to produce transparency into Service Health and Performance, impact of downtime, blast radius for changes and automation of mundane tasks, etc.

3. Stimulates culture change: The SRE has a team player mindset by organically fostering a culture of collaboration. Therefore, the impact of the SRE goes beyond tools and technology. The SRE helps DevOps teams to proactively build reliability into their services without disrupting the continuous integration/continuous delivery (CI/CD) pipeline. A Cloudbees/Hurwitz and Associates survey found that over 50% were using continuous integration company-wide, while the other half were utilizing continuous delivery processes.

In short, a SRE is:

willing to take on risk to improve a system
blameless incident management and post mortems
automation; and
sharing data and tools with others in the organization

As organizations buy into the SRE process, it becomes ingrained in the organizational culture – making reliability a core principle for all business and engineering operations.

Conclusion

Organizations of any size and complexity can benefit from SRE practices by improving processes, procedures, encouraging risk resulting and most importantly providing a richer more reliable customer experience. According to a ZDNet finding, “50% of the principles are good advice though organizations will need to tweak them for your enterprise. This includes balancing tickets between operations and development, writing your own application programming interfaces (APIs) to automate processes, and bringing down production systems to test resiliency.”

Transforming to a SRE practice is an investment in improvement and should not be looked at as a cost saving measure. As the practice evolves and improvements made, the organization will eventually realize cost savings.

ABOUT STONE DOOR GROUP

Stone Door Group® modernizes the digital enterprise through skilled DevOps and Hybrid Cloud professional services. We make it easy to quickly access and deploy DevOps solutions to transform your business and provide certified consultants to deliver your projects.

ABOUT THE AUTHOR

Ray Stoner is a Consultant focused on Observability, Service Management and Site Reliability Engineering at Stone Door Group, a Cloud and DevOps consulting company that delivers successful digital transformation projects in the private and public sectors. To speak with Ray and our team, send us an email at letsdothis@stonedoorgroup.com.