What is SRE (Site Reliability Engineering)?

Site Reliability Engineering, SRE for short, is one of the 'hottest' acronyms in the development world.

The aim of the SRE approach is to increase the reliability of systems: it is a set of principles, practices and organisational constructs that allows both to make existing systems work and to innovate them. This second aspect is fundamental, because one of the objectives of SRE is not only to keep the promises made about the management of systems, but also to do so while services are constantly being improved incrementally with new functionalities.

Let us start immediately with an important specification: the word 'Site' in the definition should not lead us to think that this approach is only applicable to the creation and management of sites. The same principles are in fact suitable for improving the performance of any software system (sites, web or mobile applications, etc.). Any system administrator can therefore aspire to bring the SRE approach into his or her team.

Check out this video for a complete overview of Site Reliability Engineering (in Italian only):

 

SRE and Google: from birth to today

SRE practices originated and developed within Google, starting around 2003. More recently, Google has decided to make public the approach that has enabled the company to build, monitor, improve and maintain some of the most widely used online services in the world.

To understand the SRE philosophy even before the more practical aspects, we can quote Ben Treynor Sloss, the man who coined the term SRE and who is now vice-president of engineering at Google.

"SRE is what happens when you ask a software engineer to design the Operations function," the manager said in an interview.

The SRE team or manager thus performs a job that has historically been done by the Operations team, but does so by adding the mindset and skills of software engineering: the cornerstone is the (typically engineering) ability to replace human work with automation.

With these premises, it is immediate to perceive the value of the key principles defined by Google for SRE practices, namely:

  • Neutral' risk management. i.e. do not pretend that errors will never occur in the life of an application workload. Rather, accept this fact and be prepared.
  • According to the intuition that those who use the system and those who keep it operational have different service objectives at heart, a careful definition and shared evaluation is necessary for these to converge from the outset and for the value delivered to be correctly perceived by each stakeholder. 
  • Minimise non-value-creating and repetitive activities.
  • Set up monitoring to keep the situation under control even in distributed contexts.
  • Design releases that do not jeopardise the operability of the platform or system.
  • Keep the overall complexity of the system low, allowing it to increase physiologically over time

In addition to the principles of SRE, Google also has a set of practices that implement the various principles and try to keep the system up and running. An SRE team must organise itself to respect what Google considers to be the hierarchy of a reliable service. 

  1. At the base of the hierarchical pyramid we find monitoring and the ability to identify a problem before users notice it.
  2. Immediately above we have the team's ability to respond to a problem with a root cause analysis and easily testable and applicable corrective fixes. 
  3. Finally, at the top of the pyramid is the focus on design and the computational resources that a reliable product requires.

Reliability as a basic principle of good architecture

Reliability, or dependability, is the ability of an application to operate correctly and consistently when expected. According to AWS, reliability is one of the core principles of the Well-Architected framework. It is a manual of best practices and checklists that should be kept in mind when designing any Cloud application. 

AWS lays down a number of guidelines that can increase the reliability of a workload, among them: 

  • the ability to recover a workload from an abnormal situation automatically, thanks to appropriately tested monitoring tools and recovery procedures; 
  • increase horizontal scalability to decrease the impact of failures on the individual node or instance;
  • finally, a more diligent analysis and tuning of the computational capacity allocated on the Cloud, with the aim of avoiding the saturation of available resources and keeping costs under control. By observing and adapting the sizing of resources, it is possible to avoid over-provisioning, i.e. the use of oversized resources that are wasted most of the time for which they are paid.

SRE and DevOps: similar but not equal approaches

Having reached this point, we cannot avoid mentioning another approach that favours collaboration between Operations and Development teams: we are talking, of course, about DevOps. There is an undeniable closeness between SRE and DevOps, despite the fact that the two practices developed independently. Both aim to bridge the gap between the two teams, with the objective of improving the life cycle (i.e. release) and product quality.

Although the two approaches have similar goals, they are not mutually exclusive: on the contrary, we can see SRE as a concretisation of the DevOps approach. SRE embraces the DevOps philosophy, but focuses its attention on the development and consolidation of practices to measure and achieve reliability. 

In other words, SRE sets operational rules for success in the various DevOps areas. If DevOps focuses on the 'What', SRE is decidedly unbalanced towards the 'How'.

Tools to support SRE

SRE is a technology-intensive practice, which is why a reference technology stack for SRE engineers has been defined over time.

As mentioned above, SRE is very close to DevOps, so it is possible to liken the tools used by this type of professional to those of a DevOps Engineer. 

Specifically, they are needed: 

  • tools enabling work planning (Jira, Azure Boards);
  • software implementation (Eclipse, Visual Studio Code);
  • the packaging of build artefacts, (Jenkins, Azure Pipelines);
  • the configuration of the installed software (Terraform, Ansible);
  • an installation environment - preferably containerised (OpenShift or Azure Kubernetes Service);
  • Finally, a monitoring suite capable of collecting, aggregating and presenting metrics and KPIs derived from the application (Grafana, Azure Monitor) is essential for each SRE team. 

What does an SRE team or engineer do?

The SRE manager (or SRE team) is in charge of a very wide range of functions, including system availability, latency, performance, efficiency, monitoring, change management, emergency response and capacity planning. 

To understand how crucial this figure is in some projects, one only has to read the first sentence of Benjamin Treynor Sloss' LinkedIn profile: 'If Google stops working, it's my fault'. Ironic, but not without a kernel of truth.

The importance of the SRE role lies not so much in the activities and skills (which are not unique, and could also be taken over by other teams) as in the modus operandi adopted. The SRE team works with data, which it learns to collect, read and exploit, and with process automation, which allows it to increase control and normalise code. This approach reduces repetitive task overload (the so-called 'toil') and errors.

If you are interested in frameworks that exploit a data-driven approach and team collaboration, DevSecOps might be of interest to you.

The Error Budget: what it is and why it is important

The SRE team is also in charge of calculating and managing the error budget. This is a fundamental strategic concept, which we could define as the tool used by SRE to balance service reliability with innovation. 

The assumption of this approach is that systems are dynamic objects that develop and change over time, evolving positively but also bringing a downside: changes are a major source of instability. The error budget provides a control mechanism to shift the focus from innovation to stability when necessary. 

The SRE team defines an error budget based on a set of metrics (we will look at these in a moment) and a time frame in which to measure it. If incidents increase significantly compared to the available error budget, the SRE team's priority immediately becomes to help solve the problems by bringing the development and operations teams together and collaborating.

Thanks to the error budget, therefore, it is possible to plan and schedule change without sacrificing too much availability.

Service Levels: SLI, SLO and SLA

In order to align all stakeholders on the objectives of service reliability and availability, SRE introduces three key concepts: SLI, SLO and SLA.

  • SLI, Service Level Indicators. These are the metrics used to measure the performance or general behaviour of the application. In other words, we could call them the KPIs to be taken into consideration. These are often particularly relevant KPIs for users, such as response time or error rate.
  • SLOs, Service Level Objectives. These are the objectives to be achieved for each metric, i.e. for each SLI. The objectives set must wisely balance quality, to be assessed according to business needs, and the cost of achieving it.
  • SLA, Service Level Agreement. It includes the legal aspects that come into play if the system does not achieve its SLOs. 

It should be noted that one of the basic elements of the SRE philosophy is that errors are expected and accepted by all involved. SLIs, SLOs and SLAs serve to set a goal and make it achievable: not by avoiding errors altogether, but by analysing them afterwards in order to improve the entire process (and without ever pointing fingers). 

ABOUT IT: 6 SRE best practices to know

Site Reliability Engineering: the benefits

Investing one's time in the development of SRE practices can be very cost-effective. The advantages are easy to see:

  • Higher quality
  • Constant product evolution
  • Reduction of errors and malfunctions

Let us try to break down these benefits in more detail and see what the positive consequences are for business and work teams.

CONSTANT MONITORING OF THE PROJECT

The complexity of some projects is such that it requires a top-down view, clear and focused on the really relevant aspects. This is precisely what happens with the advanced monitoring system that SRE requires to be implemented: selected metrics allow key parameters to be measured without ever losing sight of the project as a whole.

What you get is a concise and timely view of what is happening throughout the project, with valuable information for other business areas as well. We think of marketing, sales, support, but also of the main company stakeholders who require to be aligned on the status of work and performance.

TIMELY ERROR RESOLUTION

SRE practices have the great advantage of favouring the proactive detection and resolution of software bugs and vulnerabilities. In the absence of a monitoring and automation system such as that fostered by SRE, it often happens that errors enter production causing delays, malfunctions and downtime of services. 

The consequences for business and turnover can be very heavy; that is why the selection of the most relevant KPIs and the setting of an achievable and considered target for each SLI is so important.

YOU MIGHT ALSO BE INTERESTED IN: GitOps and Kubernetes: CI/CD for Cloud Native applications


CLARIFY AND MEET CUSTOMER EXPECTATIONS

Another great advantage of using SLAs, SLOs and SLIs is the ability to focus on the end-user's expectations at an early stage and to draw up a plan to meet them. 

Having clearly defined targets and service thresholds allows one to constantly judge the status of work and proactively align actions against predefined KPIs. All of this while keeping the end user and his expectations in mind (with all the benefits this brings in terms of digital services offered, and therefore turnover generated).

INCREASED FOCUS ON VALUE CREATIONORE

A more efficient system, with fewer problems and where repetitive tasks are automated, is a system that leaves more free time for the teams working on it. And that time should be invested productively, for example by creating new functionalities or improving existing ones. The Operations team, on the other hand, has the opportunity to work more on improving the configuration and creating tests that can identify possible defects in the system. 

The greater availability of time and less stress due to mistakes and delays also leads to greater collaboration between teams, together with the ability to discuss priorities and objectives more responsibly and creatively. 

CONTINUOUS CULTURAL IMPROVEMENT

There is a final and very important benefit of SRE: the creation of a culture of collaboration between people and between teams, where decisions are made with not only one's own work in mind, but above all with the consequences for the user and colleagues in mind.

SRE fosters an open and trusting mentality among the teams, which results in a better business climate on the one hand, and a high quality output on the other.