Site Reliability Engineer (SRE), We Keep The Lights On 24/7

Site Reliability Engineer
Keeping the lights on 24/7
Darrel Chia
SRE Lead
Oracle Corporation, Singapore
#ISSLearningFest

Darrel Chia
Consulting Member of Technical Staff / SRE
Lead
Java Management Service
Oracle Corporation
#ISSLearningFest

Goals of this Session
• Introduction to SRE
• What goes into SRE work
• What goes into keeping a service up 24/7
#ISSLearningFest

On Site Reliability Engineering
• Primary shift in how a product is being delivered to
customers driven by a boom of –as-a-service, cloud
native offerings
• This shift triggers a change in how products are being
built and how new roles are required.
#ISSLearningFest

So what exactly is SRE ?
There are a lot of different explanations and definitions,
and its really hard to be clear what exactly SRE is
Using Software engineering principals and apply
it to Infrastructure and Operations to create
reliable systems
What Software Engineering Principals ? SLOs, Reducing Toil,
Release Engineering …
#ISSLearningFest

On Site Reliability Engineering
As a concept, SRE deals with the engineering approach to
several non-functional requirements : Availability,
Scalability, Elasticity, Capacity Planning, Monitoring among
a few.
Practices differ widely: SRE is a very opinionated
approach. Different organizations would
prioritize differently.
SRE is not a one-size-fits-all
#ISSLearningFest

What goes into SRE work
And what we need to keep services 24/7
#ISSLearningFest

Principles of SRE work
Most literature will mention the 7 pillars or principles
1. Embrace Risk
2. Use Service Level Objectives
3. Eliminate Toil
4. Monitoring (distributed systems)
5. Automate Automate Automate!
6. Release Engineering
7. Simplicity
These 7 principals is what SRE work is based on, and what
we leverage on when priority tasks.
#ISSLearningFest
What drives our tasks

My SRE team
• Our SRE team is embedded into our Service development
itself. Our focus is keeping our service alive. The SRE
team also works on DevOps/Operational tasks.
• Shared infra forms a big part of our complex ecosystem.
This frees up a lot of our time needed to maintain these
systems to working on service reliability instead.
#ISSLearningFest
Some background on the Service

What goes into SRE work
• In this aspect, SRE functions a lot like an DevOps/Ops team.
DevOps
• Routinely, we deal with
• Monthly patching
• Change/Release Management
• Dealing with Incident tickets (Both internal and customer)
• We also deal with
• feature development (and supporting new features)
• automation
• Infrastructure Updates
• Region Expansion (and automation)
#ISSLearningFest
The Routine Work

Qualities of SRE
#ISSLearningFest
SRE is a multidisciplinary team.
We need a wide range of skillsets
1. Development and Coding
2. Operations and Infrastructure
• Change Management and
Deployments
• Capacity Planning
3. Security and Compliance
4. Incident Management
SRE also needs to have the ability
to see the big picture and
influence architectural design
decisions.
Compone
nt
Deployme
nt
Observabili
ty
Logging
Telemetry
Alarms
Support
and
Runbooks
Availabilit
y
Security
Complian
ce

What enables SRE work
And what we need to keep services 24/7
#ISSLearningFest

Keeping the site up 24/7
If we want to talk about keeping the service up 24/7, we
can condense it into 2 key areas:
1. Make sure they don’t fail (Availability, Redundancy)
2. When it fails, I want to know When and Why (Observability) …
and how solve it.
Of course, there are really a lot of other things that we
need, but this is a good place to start
#ISSLearningFest
2 key areas

Availability
• Use a High Availability setup to
introduce redundancy.
• There are also many other non-
functional requirements that are
tied to this: resiliency, redundancy
#ISSLearningFest
Users
Web UI
API
Gateway
Service
Load
Balanc
er
Computes
Enabling Availability through Hardware

Observability
When outage occurs, we want
to be able to know the
current state of the service.
Instrumentation is a key part
of this.
• Observability is a key
enabler for SLOs and SLIs
#ISSLearningFest
Compone
nt
Telemetr
y/Metrics
Logs
Alarms
and Instrumentation

Measuring Reliability
Our metrics for success – SLAs, SLOs, SLIs
#ISSLearningFest

Quantifying Reliability
• For my team, we don’t have
SLAs. We’re a free service.
However we do set SLOs.
Which are objectives that the
SRE wants to hit.
• E.g. 99.5% availability
• Our SLOs are set against
specific operations of the
service (CRUDL).
#ISSLearningFest
Site Availability

Service Level Objectives and Indicators
• We look at each individual REST services (CRUDL)
• Error Rate (reliability)
• How long before an asynchronous request is served ? (latency)
• Backend processing of an entity needs to complete within 2
minutes
• Every REST service will have their own SLOs and SLI, plus a
overall compounded one for reporting as well.
#ISSLearningFest
SLOs and SLIs

Tracking metrics
• Collecting the information is just part of it. What we what
to do with the information is more important. We want
alerts, alarms to be actionable !
• From the metrics, we can also pinpoint issues in our
components. e.g. spikes in CPU utilization, memory leaks.
• Component developers and SRE need to agree on what to
metrics to emit.
#ISSLearningFest
Other takeways

Conclusion
• There are many aspects and concepts of SRE work the we
did not cover here as well. Like error budgets, toil and
automation.
• Hopefully this give you a glimpse into my world and you
have some insights to takeaway.
#ISSLearningFest

Give Us Your Feedback
#ISSLearningFest
Day 3 Programme

Thank You!
issxxx@nus.edu.sg
#ISSLearningFest

Site Reliability Engineer (SRE), We Keep The Lights On 24/7

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Site Reliability Engineer (SRE), We Keep The Lights On 24/7

Similar a Site Reliability Engineer (SRE), We Keep The Lights On 24/7 (20)

Más de NUS-ISS

Más de NUS-ISS (20)

Último

Último (20)

Site Reliability Engineer (SRE), We Keep The Lights On 24/7

Notas del editor