Site Reliability Engineering (SRE): 6 best practices you should use

All posts

Written by
SparkFabrik Team

DevOps

Book My Discovery Call

The web is rapidly evolving towards carrying out increasingly more important and demanding tasks, bringing to light a theme that today – more than ever – is central to the development of web applications: that of reliability.

To cope with this increasingly urgent need, the role of the so-called Site Reliability Engineering (SRE) becomes central. A set of principles and practices introduced in 2004 within the Google team which, since then, have been adopted by countless companies (such as Netflix and Amazon) faced with the need to take big websites and applications and make them efficient, scalable and reliable.

Site Reliability Engineering: the advantages

Typically, the implementation of SRE has the following objectives:

Reducing or eliminating repetitive and inefficient system maintenance work;
Developing scalable solutions for complex problems;
Making room for innovation in a stable and mature technological context.

The role of the Site Reliability Engineer integrates closely with the DevOps practices and approach: this figure is tasked with the development of automated solutions that can reduce not only risks attributed to manual processes, but also release times and costs. This approach makes it possible to lower the barriers between development and operations teams, as well as making the software production and maintenance activities as rapid and as secure as possible.

6 SRE Best Practices

Below, we will take a look at the main best practices that allow you to effectively adopt and apply this approach.

1.Don’t reason in watertight compartments

Your every action will have an impact on the rest of the team: the right approach is to proactively consider the consequences on others, before acting. You must always keep the big picture in mind and not act exclusively under the pressure of a momentary need.

For example: Developers in an SRE team need to decide on a new feature, selecting between a container-based service stack and the serverless solutions provided by the selected cloud vendor. Before making a decision with strong implications on the development team, it is a good idea that they discuss the options with those who will have to manage the system at runtime, evaluating the pros and cons of the possible solutions and the impacts that the selected solution will have downstream of the development activity.

2.Use automation to eliminate repetitive and time-consuming tasks

Imagine that your team is working on a project that requires very frequent backups. Of course, it’s a good idea to test every backup, but managing this operation manually would be extremely inefficient. To avoid this, it would be good practice to invest time in developing an automation solution that tests the backups, reducing the team’s human workload.

3.Looking back to see ahead: recognize and correct what went wrong with a backward-looking analysis

The task of an SRE engineer is also to analyze what went wrong in a project or in a specific incident. When something goes wrong, it is essential to carry out an analysis and to understand the dynamics that led to the specific problem. But the key is to focus on the “what” and not on the “who”. The focus must always be on ensuring good collaboration within the team and, for the most part, pointing the finger does not lead to good results.

Moreover, to help the developers, it is also important to use your time effectively and understand when you’ve hit a dead end. If the road you’re on is not leading to the desired results or, indeed, is hindering the team, it is a good idea to change direction and focus on something else. The same backward-looking analysis can become an excellent tool to better understand the functioning of the system, ensuring better support in the future.

4.Believe in the solutions you propose and learn to get management to buy in

As an SRE engineer, your job is one – and a very important one: to ensure that your systems are reliable. Don't be afraid to ask management for resources and tools – when you believe they are justified – that may cost you today, but that will prove to be useful to the team in the long run.

Prepare a document to present and justify your request, showing how investing in a tool in which you believe, will not merely be a cost that is amortized over time, but will bring significant benefits to the company.

5.Carry out analyses and measurements, putting yourself in the user’s shoes

Just like in a play, the end user does not see the processes that happen behind the scenes, but the actual experience that is presented to them.

So try to put yourself in their shoes and live their experience: this will allow you to understand the eventual errors and weaknesses at the application level, not from the server side, but from the point of view of those who will have to actually use the service. It is equally important to keep this perspective in mind when handling errors, allowing you to make changes that improve not only the server-side functionality, but the overall end user experience.

6. Use every opportunity to increase the level of observability of your system’s critical elements

Each and every incident and evolutionary development must become an opportunity to ask the right questions about the reliability of your system: what are the key objectives and operational requirements? What are the indicators that help me keep the focus on these goals? Which, on the other hand, are the ones that create “noise” and reduce the efficiency of technical support activities? Should I collect more data, visualize it in a better or more effective way, process it and cross-reference it with other data?

The production of relevant data and its processing is yet another activity that needs to be closely shared with the development team. An activity that often intersects with other best practices like, for example, overcoming watertight compartments, enhancing automation and carrying out/documenting a backward-looking analysis.

Conclusions: the requirements of the SRE approach

Diventare Site Reliability Engineer richiede un ampio bagaglio di skill specifiche che coprano sia l’ingegneria e lo sviluppo software che le operazioni tipiche di sistemi IT, come quelle di load balancing o di back-up. Perchè il modello SRE funzioni al meglio in azienda, però, è necessario che le abilità individuali siano affiancate da fiducia e collaborazione tra i team di produzione e di sviluppo.

Per adottare in modo efficace la cultura SRE è quindi necessario preparare il tuo team e avere fiducia nel metodo, tenendo sempre a mente le sue best practices. Solo così il metodo SRE ti aiuterà ad produrre e mantenere applicazioni efficienti e affidabili.