There are many phases in the software development cycle, from requirements to development and testing, but at the tail of the process, is an often overlooked aspect: deployment and delivery. With the paradigm shift of delivering on-site software to offering software-as-a-service, Site Reliability Engineering is beginning to take a greater role in product delivery.
This session aims to give a glimpse of the work that goes into site reliability engineering (SRE) and effort that goes into keeping a service going 24/7.
4. Goals of this Session
• Introduction to SRE
• What goes into SRE work
• What goes into keeping a service up 24/7
#ISSLearningFest
5. On Site Reliability Engineering
• Primary shift in how a product is being delivered to
customers driven by a boom of –as-a-service, cloud
native offerings
• This shift triggers a change in how products are being
built and how new roles are required.
#ISSLearningFest
6. So what exactly is SRE ?
There are a lot of different explanations and definitions,
and its really hard to be clear what exactly SRE is
Using Software engineering principals and apply
it to Infrastructure and Operations to create
reliable systems
What Software Engineering Principals ? SLOs, Reducing Toil,
Release Engineering …
#ISSLearningFest
7. On Site Reliability Engineering
As a concept, SRE deals with the engineering approach to
several non-functional requirements : Availability,
Scalability, Elasticity, Capacity Planning, Monitoring among
a few.
Practices differ widely: SRE is a very opinionated
approach. Different organizations would
prioritize differently.
SRE is not a one-size-fits-all
#ISSLearningFest
9. What goes into SRE work
And what we need to keep services 24/7
#ISSLearningFest
10. Principles of SRE work
Most literature will mention the 7 pillars or principles
1. Embrace Risk
2. Use Service Level Objectives
3. Eliminate Toil
4. Monitoring (distributed systems)
5. Automate Automate Automate!
6. Release Engineering
7. Simplicity
These 7 principals is what SRE work is based on, and what
we leverage on when priority tasks.
#ISSLearningFest
What drives our tasks
11. My SRE team
• Our SRE team is embedded into our Service development
itself. Our focus is keeping our service alive. The SRE
team also works on DevOps/Operational tasks.
• Shared infra forms a big part of our complex ecosystem.
This frees up a lot of our time needed to maintain these
systems to working on service reliability instead.
#ISSLearningFest
Some background on the Service
12. What goes into SRE work
• In this aspect, SRE functions a lot like an DevOps/Ops team.
DevOps
• Routinely, we deal with
• Monthly patching
• Change/Release Management
• Dealing with Incident tickets (Both internal and customer)
• We also deal with
• feature development (and supporting new features)
• automation
• Infrastructure Updates
• Region Expansion (and automation)
#ISSLearningFest
The Routine Work
13. Qualities of SRE
#ISSLearningFest
SRE is a multidisciplinary team.
We need a wide range of skillsets
1. Development and Coding
2. Operations and Infrastructure
• Change Management and
Deployments
• Capacity Planning
3. Security and Compliance
4. Incident Management
SRE also needs to have the ability
to see the big picture and
influence architectural design
decisions.
Compone
nt
Deployme
nt
Observabili
ty
Logging
Telemetry
Alarms
Support
and
Runbooks
Availabilit
y
Security
Complian
ce
14. What enables SRE work
And what we need to keep services 24/7
#ISSLearningFest
15. Keeping the site up 24/7
If we want to talk about keeping the service up 24/7, we
can condense it into 2 key areas:
1. Make sure they don’t fail (Availability, Redundancy)
2. When it fails, I want to know When and Why (Observability) …
and how solve it.
Of course, there are really a lot of other things that we
need, but this is a good place to start
#ISSLearningFest
2 key areas
16. Availability
• Use a High Availability setup to
introduce redundancy.
• There are also many other non-
functional requirements that are
tied to this: resiliency, redundancy
#ISSLearningFest
Users
Web UI
API
Gateway
Service
Load
Balanc
er
Computes
Enabling Availability through Hardware
17. Observability
When outage occurs, we want
to be able to know the
current state of the service.
Instrumentation is a key part
of this.
• Observability is a key
enabler for SLOs and SLIs
#ISSLearningFest
Compone
nt
Telemetr
y/Metrics
Logs
Alarms
and Instrumentation
19. Quantifying Reliability
• For my team, we don’t have
SLAs. We’re a free service.
However we do set SLOs.
Which are objectives that the
SRE wants to hit.
• E.g. 99.5% availability
• Our SLOs are set against
specific operations of the
service (CRUDL).
#ISSLearningFest
Site Availability
20. Service Level Objectives and Indicators
• We look at each individual REST services (CRUDL)
• Error Rate (reliability)
• How long before an asynchronous request is served ? (latency)
• Backend processing of an entity needs to complete within 2
minutes
• Every REST service will have their own SLOs and SLI, plus a
overall compounded one for reporting as well.
#ISSLearningFest
SLOs and SLIs
21. Tracking metrics
• Collecting the information is just part of it. What we what
to do with the information is more important. We want
alerts, alarms to be actionable !
• From the metrics, we can also pinpoint issues in our
components. e.g. spikes in CPU utilization, memory leaks.
• Component developers and SRE need to agree on what to
metrics to emit.
#ISSLearningFest
Other takeways
22. Conclusion
• There are many aspects and concepts of SRE work the we
did not cover here as well. Like error budgets, toil and
automation.
• Hopefully this give you a glimpse into my world and you
have some insights to takeaway.
#ISSLearningFest
TODO: Log in to PollEverywhere first and initialize the polls. !
Here’s something I’m trying out… So while I’m out talking about myself, I’d like you guys to participate in a poll
SRE is HUGE topic, so I obviously won’t be able cover it in the short time I have, but what I’d like to share with you is a few of the more critical and interesting points drawn from working in an SRE team. Some of this is common information you can find over google, so there are really no surprises there, but, as much as I can, I’d like to put it into context to how my team actually leverages this, as well as a bit of the more practical approaches or how these are being implemented.
I’m working as the SRE lead for one of the Oracle Cloud services – the Java Management Service, or JMS for short. It’s a free service that’s available on OCI deals a lot with how Java usage is managed at scale. The service itself is owned by the Java Platform group,
The service itself is quite new, we launched about a year ago. The service itself is currently deployed into around 40 regions, both commercial and non-commercial.
I’m not doing to talk too much about this, but if you’re curious, you can run a quick search for Java Management Service and find out for yourself.
I’m from a development background : Started off doing development work before transiting into DevOps work and then SRE.
First Poll: What industry are you from ?
SRE is HUGE topic, so I obviously won’t be able cover it in the short time I have, but what I’d like to share with you is a few of the more critical and interesting points drawn from working in an SRE team. Some of this is common information you can find over google, so there are really no surprises there, but, as much as I can, I’d like to put it into context to how my team actually leverages this, as well as a bit of the more practical approaches or how these are being implemented.
I’m working as the SRE lead for one of the Oracle Cloud services – the Java Management Service, or JMS for short. It’s a free service that’s available on OCI deals a lot with how Java usage is managed at scale. The service itself is owned by the Java Platform group,
The service itself is quite new, we launched about a year ago. The service itself is currently deployed into around 40 regions, both commercial and non-commercial.
I’m not doing to talk too much about this, but if you’re curious, you can run a quick search for Java Management Service and find out for yourself.
I’m from a development background : Started off doing development work before transiting into DevOps work and then SRE.
It was in the 1990s that we saw one of the first SaaS service offering, and a few years after, we had a huge explosion with media content providers and social media bursting into the scene. There was primary shift in the paradigm on how products were now being delivered to customers. The tail end of the software delivery phases became more important and it got tied directly to revenue. This, of course triggered a change to the fundamental way these products are being built and created new roles that needed to be filled.
I’m not going to touch too much on these engineering principals. For SRE, most literature will mention they 7 key principals, but I believe that it may be a bit too dry to cover in this session. I’d like to just touch on what is immediately recognizable in my SRE work.
Site reliability engineering was coined by Google engineering teams and fundamentally involves engineering principals to help balance functional requirements with reliability. Note that SRE is a very opinionated way on how organizations want to run or achieve reliability. The principals may be the same but there are different approaches are possible and different organizations would build up their SRE teams focusing on different things.
DevOps is quite similar to SRE. In some instances/literature, some people proposed that DevOps is an implementation of SRE concepts, since both are bridging Development and Operations together. In my opinion, both focus on different aspects. While SRE focuses on solving issues around operations, scale and reliability, DevOps focuses more on the Development and Release Pipeline.
According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[11][12]
According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[11][12]
In the 2022 report by Dynatrace, out of 450 SREs polled, only 20% claim to have a mature practice.
Also there have been teams that have simply rebranded their Ops teams to call SRE, and/or DevOps teams. At the end of the day, we need to consider what kind of role they are really fulfilling.
I’m not going to drill to much into details of the principles, but I thought it deserved a slide, because it really what guides almost all the SRE work that happens. The principals are all quite closely related to each other, some are independent.
E.g. Eliminating toil, and automation as you can guess are quite closely related.
What I wanted to impress upon everyone here, is that yes, we to have a set of overarching principles that we rely on.
Embrace Risk (and manage it). Understand that no service is 100% reliable. By allowing some small amount of risk, like having 99.5%, we can tradeoff that 0.5% risk for some other benefits, like faster deployments. We need to justify whether the associated risk is is worth the benefits we gain.
Using SLOs and SLIs allow us to measure the actual performance of the service
Eliminating Toil is about minimizing the
As I mentioned, SRE is very opinionated, so a bit of background on how organization has implemented SRE
The JMS is about a year old, with plenty of new features in the backlog. The SRE team is still maturing. There are still a lot of work to be done as well.
In our organization, our SRE team deals with the DevOps aspects as well. The team owns the development pipeline and the process, on top of the SRE work. A lot of our pipeline is shared infrastructure: GIT repository, CI, artifact store, and CD/Deployment. We have teams that deal with those aspects of the pipeline, so we can concentrate on the customization parts that we need.
We deal with a routine tasks, which are mostly automated. But at the end of the day, the process still needs someone to approve and execute it.
In SRE work, change management refers to our component deployment process. There is already an established process in place, and SRE’s responsibilities involve deployment of the components to Production. In our team, we run 2 week sprints, which generally 95% of the time, ends with a production deployment. The process is 90% automated through a CD process, and SRE’s job is generally to approve a release deployment, and deal with any incidents. Our service requires deployment to 30+ sites, so we need to adopt a progressive rollout approach, and it takes a fair bit of time to complete, plus we have safeguards, like ensuring that a deployment is stable enough on a region/site before we move forward with the rest.
We also work on incident tickets. The shared-services ecosystem that supports our system also generates tickets for us, in areas like resolving faulty components that failed in our production environments (e.g. chef failure)
SRE work does affect development cadences. Sometimes we do become a bottleneck
There’s really many aspects of the SRE / Ops work we can talk about, but we’re not going to. Disaster Recovery, Load Testing … all these are all intricately tied to the work that we do.
My service SRE team priorities and responsibilities
Availability
Latency
Performance
Monitoring
Change Management
Emergency Response
Capacity Planning
The SRE team touches many aspects of the overall product, and a lot on non functional requirements. To build up an SRE team is no easy feat. Our SRE engineers are a mix. Some come from a Development Background, and some from Operations and Infrastructure.
As SRE, we need that big picture view of how these pieces fit together, then feed the requirements back into the development teams. Sometimes development does get short sighted on reliability requirements. especially in areas of scaling and workloads. Their focus is often on functional requirements needed to complete the story .
There are parts of the system that need to be built into the product backlog. Logging, Telemetry, Audit to name a few.
We need to influence architectural design decisions. Reliability needs to be built into the backlog. We need to build reliability from the beginning.
In a 2022 survey, about 50% of 450 SRE engineers polled said that they dedicate a significant amount to influencing the design decisions.
A lot of the aspects of SRE work is interconnected. Like our topic : keeping the site up 24/7 involves many layers of complexity.
This is where I start to condense the content a bit. SRE is a huge multidisciplinary team and there is no easy way to tell you everything about it. Instead, in this session, I’ve picked out 2 aspects, or key areas that I think contributes to the reliability of the site.
Remember murphys law ! What can fail will fail !
Of course SRE work is a lot more than that, and these 2 don’t cover even 10% of the work we do. But since we’re going to talk about 24/7, I thought I would pick out these 2. And they are very important ones that
The most common way to solve availability is hardware. And at a bare minimum, we need to put in a HA setup. To be HA requires us to have at least 3 nodes in place to help service requests. We also distribute the compute instance deployments to different Availability domains (different sites) and fault domains (different hardware) to reduce the risks.
HA also opens up the options for having rolling updates when doing deployments and patching.
Scaling (Horizontal / Vertical) with no downtime.
Of course, we also need to ensure the components are able to support the configuration. Requirements like component heartbeat, being stateless and/or asynchronous.
Availability is an outcome of the infrastructure engineering work that is being done.
Balancing cost is important ! This is the key tradeoff in managing the redundancy aspect of site availability. Being redundant adds a layer of robustness into to the infrastructure
Observability is a really important aspect of SRE work. There are generally 3 key pillars, Telemetry, Logs and Tracing. I’m going to use the work observability and instrumentation very loosely here.
Java Management is a FREE service. Any OCI customers can use it. Our SLO is 99.95, which allows for 21 minutes of downtime monthly or 1.83 days.
Planned / Regular maintenance does not count towards our downtime
Failure of downstream services does not count towards our downtime
SLA is also something that is usually business driven. SRE teams generally have no part in defining SLAs. (In our case, we’re a free OCI service, so we don’t have SLAs, only SLOs)
Site availability is generally something that is too coarse grained to be useful for SRE work. Its useful for reporting to customers, management level reports, but from a SRE perspective, we need something a lot more granular.
Usually in development and pre-production, we may not be able generate sufficient load, or traffic behaviour to pinpoint issues, and these are the issues that will come up when we deploy to production.
We need to look beyond what our metrics is collecting and analyze the information! Not all collected metrics end up in a SLO/SLI.
Also to circle back, Observability is a very important property in the component. All this needs to be enabled by instrumentation built into the component itself. We also collect this information from our supporting tools, like our databases, queues, so choosing tools is important !
There are many aspects of SRE work that we do not cover in this 45 minutes. Things like security, compliance, incident management, error budgets, toil vs automation.
I didn’t intend for this session to be a highly technical one, where we condense the entire SRE doctrine into 45 minutes, but hopefully, I’ve shared enough little nuggets of information to give everyone an insight into what SRE .. As in both the engineering work and the engineer role, is like.