SlideShare una empresa de Scribd logo
1 de 26
Site Reliability Engineer
Keeping the lights on 24/7
Darrel Chia
SRE Lead
Oracle Corporation, Singapore
#ISSLearningFest
Darrel Chia
Consulting Member of Technical Staff / SRE
Lead
Java Management Service
Oracle Corporation
#ISSLearningFest
#ISSLearningFest
Goals of this Session
• Introduction to SRE
• What goes into SRE work
• What goes into keeping a service up 24/7
#ISSLearningFest
On Site Reliability Engineering
• Primary shift in how a product is being delivered to
customers driven by a boom of –as-a-service, cloud
native offerings
• This shift triggers a change in how products are being
built and how new roles are required.
#ISSLearningFest
So what exactly is SRE ?
There are a lot of different explanations and definitions,
and its really hard to be clear what exactly SRE is
Using Software engineering principals and apply
it to Infrastructure and Operations to create
reliable systems
What Software Engineering Principals ? SLOs, Reducing Toil,
Release Engineering …
#ISSLearningFest
On Site Reliability Engineering
As a concept, SRE deals with the engineering approach to
several non-functional requirements : Availability,
Scalability, Elasticity, Capacity Planning, Monitoring among
a few.
Practices differ widely: SRE is a very opinionated
approach. Different organizations would
prioritize differently.
SRE is not a one-size-fits-all
#ISSLearningFest
#ISSLearningFest
What goes into SRE work
And what we need to keep services 24/7
#ISSLearningFest
Principles of SRE work
Most literature will mention the 7 pillars or principles
1. Embrace Risk
2. Use Service Level Objectives
3. Eliminate Toil
4. Monitoring (distributed systems)
5. Automate Automate Automate!
6. Release Engineering
7. Simplicity
These 7 principals is what SRE work is based on, and what
we leverage on when priority tasks.
#ISSLearningFest
What drives our tasks
My SRE team
• Our SRE team is embedded into our Service development
itself. Our focus is keeping our service alive. The SRE
team also works on DevOps/Operational tasks.
• Shared infra forms a big part of our complex ecosystem.
This frees up a lot of our time needed to maintain these
systems to working on service reliability instead.
#ISSLearningFest
Some background on the Service
What goes into SRE work
• In this aspect, SRE functions a lot like an DevOps/Ops team.
DevOps
• Routinely, we deal with
• Monthly patching
• Change/Release Management
• Dealing with Incident tickets (Both internal and customer)
• We also deal with
• feature development (and supporting new features)
• automation
• Infrastructure Updates
• Region Expansion (and automation)
#ISSLearningFest
The Routine Work
Qualities of SRE
#ISSLearningFest
SRE is a multidisciplinary team.
We need a wide range of skillsets
1. Development and Coding
2. Operations and Infrastructure
• Change Management and
Deployments
• Capacity Planning
3. Security and Compliance
4. Incident Management
SRE also needs to have the ability
to see the big picture and
influence architectural design
decisions.
Compone
nt
Deployme
nt
Observabili
ty
Logging
Telemetry
Alarms
Support
and
Runbooks
Availabilit
y
Security
Complian
ce
What enables SRE work
And what we need to keep services 24/7
#ISSLearningFest
Keeping the site up 24/7
If we want to talk about keeping the service up 24/7, we
can condense it into 2 key areas:
1. Make sure they don’t fail (Availability, Redundancy)
2. When it fails, I want to know When and Why (Observability) …
and how solve it.
Of course, there are really a lot of other things that we
need, but this is a good place to start
#ISSLearningFest
2 key areas
Availability
• Use a High Availability setup to
introduce redundancy.
• There are also many other non-
functional requirements that are
tied to this: resiliency, redundancy
#ISSLearningFest
Users
Web UI
API
Gateway
Service
Load
Balanc
er
Computes
Enabling Availability through Hardware
Observability
When outage occurs, we want
to be able to know the
current state of the service.
Instrumentation is a key part
of this.
• Observability is a key
enabler for SLOs and SLIs
#ISSLearningFest
Compone
nt
Telemetr
y/Metrics
Logs
Alarms
and Instrumentation
Measuring Reliability
Our metrics for success – SLAs, SLOs, SLIs
#ISSLearningFest
Quantifying Reliability
• For my team, we don’t have
SLAs. We’re a free service.
However we do set SLOs.
Which are objectives that the
SRE wants to hit.
• E.g. 99.5% availability
• Our SLOs are set against
specific operations of the
service (CRUDL).
#ISSLearningFest
Site Availability
Service Level Objectives and Indicators
• We look at each individual REST services (CRUDL)
• Error Rate (reliability)
• How long before an asynchronous request is served ? (latency)
• Backend processing of an entity needs to complete within 2
minutes
• Every REST service will have their own SLOs and SLI, plus a
overall compounded one for reporting as well.
#ISSLearningFest
SLOs and SLIs
Tracking metrics
• Collecting the information is just part of it. What we what
to do with the information is more important. We want
alerts, alarms to be actionable !
• From the metrics, we can also pinpoint issues in our
components. e.g. spikes in CPU utilization, memory leaks.
• Component developers and SRE need to agree on what to
metrics to emit.
#ISSLearningFest
Other takeways
Conclusion
• There are many aspects and concepts of SRE work the we
did not cover here as well. Like error budgets, toil and
automation.
• Hopefully this give you a glimpse into my world and you
have some insights to takeaway.
#ISSLearningFest
Poll
#ISSLearningFest
#ISSLearningFest
Give Us Your Feedback
#ISSLearningFest
Day 3 Programme
Thank You!
issxxx@nus.edu.sg
#ISSLearningFest

Más contenido relacionado

La actualidad más candente

What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)jeetendra mandal
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!New Relic
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps.com
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SREBob Wise
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)Setyo Legowo
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsRauno De Pasquale
 

La actualidad más candente (20)

Sre summary
Sre summarySre summary
Sre summary
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
SRE 101
SRE 101SRE 101
SRE 101
 

Similar a Site Reliability Engineer (SRE), We Keep The Lights On 24/7

ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptxADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptxShikhaSrivastava820471
 
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptxADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptxShikha Srivastava
 
Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!
Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!
Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!LeanKanbanIndia
 
Satisfying the ‘-ilities’ of an Enterprise Cloud Service
Satisfying the ‘-ilities’ of an Enterprise Cloud ServiceSatisfying the ‘-ilities’ of an Enterprise Cloud Service
Satisfying the ‘-ilities’ of an Enterprise Cloud ServiceNUS-ISS
 
Raghu VM_Cloud Resume
Raghu VM_Cloud ResumeRaghu VM_Cloud Resume
Raghu VM_Cloud ResumeRaghu Ravi
 
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdfADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdfPhil Johnson
 
RESUME Detailed - Thejasvi V
RESUME Detailed - Thejasvi VRESUME Detailed - Thejasvi V
RESUME Detailed - Thejasvi VThejasvi Voniadka
 
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS
 
Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"
Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"
Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"DevOps Indonesia
 
Infrastrucutre as Code
Infrastrucutre as CodeInfrastrucutre as Code
Infrastrucutre as CodeHarmeet Singh
 
5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network Operations5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network OperationsJames Kelly
 
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan MuthayanAgile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan MuthayanAgileNetwork
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 
Dynamic APIs: SOA Done Right
Dynamic APIs: SOA Done RightDynamic APIs: SOA Done Right
Dynamic APIs: SOA Done RightInside Analysis
 
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechRosalie Lauren
 
Using MySQL Enterprise Monitor for Continuous Performance Improvement
Using MySQL Enterprise Monitor for Continuous Performance ImprovementUsing MySQL Enterprise Monitor for Continuous Performance Improvement
Using MySQL Enterprise Monitor for Continuous Performance ImprovementMark Matthews
 

Similar a Site Reliability Engineer (SRE), We Keep The Lights On 24/7 (20)

ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptxADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
 
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptxADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
 
Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!
Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!
Kanban India 2023 | Ravishankar N | Don’t implement SRE like this!
 
Satisfying the ‘-ilities’ of an Enterprise Cloud Service
Satisfying the ‘-ilities’ of an Enterprise Cloud ServiceSatisfying the ‘-ilities’ of an Enterprise Cloud Service
Satisfying the ‘-ilities’ of an Enterprise Cloud Service
 
Raghu VM_Cloud Resume
Raghu VM_Cloud ResumeRaghu VM_Cloud Resume
Raghu VM_Cloud Resume
 
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdfADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
 
RESUME Detailed - Thejasvi V
RESUME Detailed - Thejasvi VRESUME Detailed - Thejasvi V
RESUME Detailed - Thejasvi V
 
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
 
Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"
Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"
Digital Transformation in Infrastructure "NetOps in The Era of Modern IT"
 
Infrastrucutre as Code
Infrastrucutre as CodeInfrastrucutre as Code
Infrastrucutre as Code
 
SRE Fundamentals
SRE FundamentalsSRE Fundamentals
SRE Fundamentals
 
5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network Operations5 steps to Network Reliability Engineering and Automated Network Operations
5 steps to Network Reliability Engineering and Automated Network Operations
 
Amit_Resume
Amit_ResumeAmit_Resume
Amit_Resume
 
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan MuthayanAgile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
Dynamic APIs: SOA Done Right
Dynamic APIs: SOA Done RightDynamic APIs: SOA Done Right
Dynamic APIs: SOA Done Right
 
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
 
Using MySQL Enterprise Monitor for Continuous Performance Improvement
Using MySQL Enterprise Monitor for Continuous Performance ImprovementUsing MySQL Enterprise Monitor for Continuous Performance Improvement
Using MySQL Enterprise Monitor for Continuous Performance Improvement
 
Senthil_SQL_15062015
Senthil_SQL_15062015Senthil_SQL_15062015
Senthil_SQL_15062015
 

Más de NUS-ISS

Designing Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeDesigning Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeNUS-ISS
 
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...NUS-ISS
 
How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...NUS-ISS
 
The Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationNUS-ISS
 
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...NUS-ISS
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
 
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeNUS-ISS
 
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
 
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...NUS-ISS
 
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnSupply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnNUS-ISS
 
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfNUS-ISS
 
Future of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengNUS-ISS
 
Product Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceProduct Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceNUS-ISS
 
Overview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsOverview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsNUS-ISS
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive AnalyticsNUS-ISS
 
Feature Engineering for IoT
Feature Engineering for IoTFeature Engineering for IoT
Feature Engineering for IoTNUS-ISS
 
Master of Technology in Software Engineering
Master of Technology in Software EngineeringMaster of Technology in Software Engineering
Master of Technology in Software EngineeringNUS-ISS
 
Master of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsMaster of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsNUS-ISS
 
Diagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesDiagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesNUS-ISS
 
Preparing and Acing your Kubernetes Certification
Preparing and Acing your Kubernetes CertificationPreparing and Acing your Kubernetes Certification
Preparing and Acing your Kubernetes CertificationNUS-ISS
 

Más de NUS-ISS (20)

Designing Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeDesigning Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee Khee
 
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
 
How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...
 
The Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
 
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
 
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
 
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
 
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
 
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnSupply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
 
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
 
Future of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
 
Product Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceProduct Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud Service
 
Overview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsOverview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and Foundations
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive Analytics
 
Feature Engineering for IoT
Feature Engineering for IoTFeature Engineering for IoT
Feature Engineering for IoT
 
Master of Technology in Software Engineering
Master of Technology in Software EngineeringMaster of Technology in Software Engineering
Master of Technology in Software Engineering
 
Master of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsMaster of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business Analytics
 
Diagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesDiagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System Archetypes
 
Preparing and Acing your Kubernetes Certification
Preparing and Acing your Kubernetes CertificationPreparing and Acing your Kubernetes Certification
Preparing and Acing your Kubernetes Certification
 

Último

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Último (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Site Reliability Engineer (SRE), We Keep The Lights On 24/7

  • 1. Site Reliability Engineer Keeping the lights on 24/7 Darrel Chia SRE Lead Oracle Corporation, Singapore #ISSLearningFest
  • 2. Darrel Chia Consulting Member of Technical Staff / SRE Lead Java Management Service Oracle Corporation #ISSLearningFest
  • 4. Goals of this Session • Introduction to SRE • What goes into SRE work • What goes into keeping a service up 24/7 #ISSLearningFest
  • 5. On Site Reliability Engineering • Primary shift in how a product is being delivered to customers driven by a boom of –as-a-service, cloud native offerings • This shift triggers a change in how products are being built and how new roles are required. #ISSLearningFest
  • 6. So what exactly is SRE ? There are a lot of different explanations and definitions, and its really hard to be clear what exactly SRE is Using Software engineering principals and apply it to Infrastructure and Operations to create reliable systems What Software Engineering Principals ? SLOs, Reducing Toil, Release Engineering … #ISSLearningFest
  • 7. On Site Reliability Engineering As a concept, SRE deals with the engineering approach to several non-functional requirements : Availability, Scalability, Elasticity, Capacity Planning, Monitoring among a few. Practices differ widely: SRE is a very opinionated approach. Different organizations would prioritize differently. SRE is not a one-size-fits-all #ISSLearningFest
  • 9. What goes into SRE work And what we need to keep services 24/7 #ISSLearningFest
  • 10. Principles of SRE work Most literature will mention the 7 pillars or principles 1. Embrace Risk 2. Use Service Level Objectives 3. Eliminate Toil 4. Monitoring (distributed systems) 5. Automate Automate Automate! 6. Release Engineering 7. Simplicity These 7 principals is what SRE work is based on, and what we leverage on when priority tasks. #ISSLearningFest What drives our tasks
  • 11. My SRE team • Our SRE team is embedded into our Service development itself. Our focus is keeping our service alive. The SRE team also works on DevOps/Operational tasks. • Shared infra forms a big part of our complex ecosystem. This frees up a lot of our time needed to maintain these systems to working on service reliability instead. #ISSLearningFest Some background on the Service
  • 12. What goes into SRE work • In this aspect, SRE functions a lot like an DevOps/Ops team. DevOps • Routinely, we deal with • Monthly patching • Change/Release Management • Dealing with Incident tickets (Both internal and customer) • We also deal with • feature development (and supporting new features) • automation • Infrastructure Updates • Region Expansion (and automation) #ISSLearningFest The Routine Work
  • 13. Qualities of SRE #ISSLearningFest SRE is a multidisciplinary team. We need a wide range of skillsets 1. Development and Coding 2. Operations and Infrastructure • Change Management and Deployments • Capacity Planning 3. Security and Compliance 4. Incident Management SRE also needs to have the ability to see the big picture and influence architectural design decisions. Compone nt Deployme nt Observabili ty Logging Telemetry Alarms Support and Runbooks Availabilit y Security Complian ce
  • 14. What enables SRE work And what we need to keep services 24/7 #ISSLearningFest
  • 15. Keeping the site up 24/7 If we want to talk about keeping the service up 24/7, we can condense it into 2 key areas: 1. Make sure they don’t fail (Availability, Redundancy) 2. When it fails, I want to know When and Why (Observability) … and how solve it. Of course, there are really a lot of other things that we need, but this is a good place to start #ISSLearningFest 2 key areas
  • 16. Availability • Use a High Availability setup to introduce redundancy. • There are also many other non- functional requirements that are tied to this: resiliency, redundancy #ISSLearningFest Users Web UI API Gateway Service Load Balanc er Computes Enabling Availability through Hardware
  • 17. Observability When outage occurs, we want to be able to know the current state of the service. Instrumentation is a key part of this. • Observability is a key enabler for SLOs and SLIs #ISSLearningFest Compone nt Telemetr y/Metrics Logs Alarms and Instrumentation
  • 18. Measuring Reliability Our metrics for success – SLAs, SLOs, SLIs #ISSLearningFest
  • 19. Quantifying Reliability • For my team, we don’t have SLAs. We’re a free service. However we do set SLOs. Which are objectives that the SRE wants to hit. • E.g. 99.5% availability • Our SLOs are set against specific operations of the service (CRUDL). #ISSLearningFest Site Availability
  • 20. Service Level Objectives and Indicators • We look at each individual REST services (CRUDL) • Error Rate (reliability) • How long before an asynchronous request is served ? (latency) • Backend processing of an entity needs to complete within 2 minutes • Every REST service will have their own SLOs and SLI, plus a overall compounded one for reporting as well. #ISSLearningFest SLOs and SLIs
  • 21. Tracking metrics • Collecting the information is just part of it. What we what to do with the information is more important. We want alerts, alarms to be actionable ! • From the metrics, we can also pinpoint issues in our components. e.g. spikes in CPU utilization, memory leaks. • Component developers and SRE need to agree on what to metrics to emit. #ISSLearningFest Other takeways
  • 22. Conclusion • There are many aspects and concepts of SRE work the we did not cover here as well. Like error budgets, toil and automation. • Hopefully this give you a glimpse into my world and you have some insights to takeaway. #ISSLearningFest
  • 25. Give Us Your Feedback #ISSLearningFest Day 3 Programme

Notas del editor

  1. TODO: Log in to PollEverywhere first and initialize the polls. !
  2. Here’s something I’m trying out… So while I’m out talking about myself, I’d like you guys to participate in a poll SRE is HUGE topic, so I obviously won’t be able cover it in the short time I have, but what I’d like to share with you is a few of the more critical and interesting points drawn from working in an SRE team. Some of this is common information you can find over google, so there are really no surprises there, but, as much as I can, I’d like to put it into context to how my team actually leverages this, as well as a bit of the more practical approaches or how these are being implemented. I’m working as the SRE lead for one of the Oracle Cloud services – the Java Management Service, or JMS for short. It’s a free service that’s available on OCI deals a lot with how Java usage is managed at scale. The service itself is owned by the Java Platform group, The service itself is quite new, we launched about a year ago. The service itself is currently deployed into around 40 regions, both commercial and non-commercial. I’m not doing to talk too much about this, but if you’re curious, you can run a quick search for Java Management Service and find out for yourself. I’m from a development background : Started off doing development work before transiting into DevOps work and then SRE. First Poll: What industry are you from ?
  3. SRE is HUGE topic, so I obviously won’t be able cover it in the short time I have, but what I’d like to share with you is a few of the more critical and interesting points drawn from working in an SRE team. Some of this is common information you can find over google, so there are really no surprises there, but, as much as I can, I’d like to put it into context to how my team actually leverages this, as well as a bit of the more practical approaches or how these are being implemented. I’m working as the SRE lead for one of the Oracle Cloud services – the Java Management Service, or JMS for short. It’s a free service that’s available on OCI deals a lot with how Java usage is managed at scale. The service itself is owned by the Java Platform group, The service itself is quite new, we launched about a year ago. The service itself is currently deployed into around 40 regions, both commercial and non-commercial. I’m not doing to talk too much about this, but if you’re curious, you can run a quick search for Java Management Service and find out for yourself. I’m from a development background : Started off doing development work before transiting into DevOps work and then SRE.
  4. It was in the 1990s that we saw one of the first SaaS service offering, and a few years after, we had a huge explosion with media content providers and social media bursting into the scene. There was primary shift in the paradigm on how products were now being delivered to customers. The tail end of the software delivery phases became more important and it got tied directly to revenue. This, of course triggered a change to the fundamental way these products are being built and created new roles that needed to be filled.
  5. I’m not going to touch too much on these engineering principals. For SRE, most literature will mention they 7 key principals, but I believe that it may be a bit too dry to cover in this session. I’d like to just touch on what is immediately recognizable in my SRE work.
  6. Site reliability engineering was coined by Google engineering teams and fundamentally involves engineering principals to help balance functional requirements with reliability. Note that SRE is a very opinionated way on how organizations want to run or achieve reliability. The principals may be the same but there are different approaches are possible and different organizations would build up their SRE teams focusing on different things. DevOps is quite similar to SRE. In some instances/literature, some people proposed that DevOps is an implementation of SRE concepts, since both are bridging Development and Operations together. In my opinion, both focus on different aspects. While SRE focuses on solving issues around operations, scale and reliability, DevOps focuses more on the Development and Release Pipeline. According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[11][12]
  7. According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[11][12] In the 2022 report by Dynatrace, out of 450 SREs polled, only 20% claim to have a mature practice. Also there have been teams that have simply rebranded their Ops teams to call SRE, and/or DevOps teams. At the end of the day, we need to consider what kind of role they are really fulfilling.
  8. I’m not going to drill to much into details of the principles, but I thought it deserved a slide, because it really what guides almost all the SRE work that happens. The principals are all quite closely related to each other, some are independent. E.g. Eliminating toil, and automation as you can guess are quite closely related. What I wanted to impress upon everyone here, is that yes, we to have a set of overarching principles that we rely on. Embrace Risk (and manage it). Understand that no service is 100% reliable. By allowing some small amount of risk, like having 99.5%, we can tradeoff that 0.5% risk for some other benefits, like faster deployments. We need to justify whether the associated risk is is worth the benefits we gain. Using SLOs and SLIs allow us to measure the actual performance of the service Eliminating Toil is about minimizing the
  9. As I mentioned, SRE is very opinionated, so a bit of background on how organization has implemented SRE The JMS is about a year old, with plenty of new features in the backlog. The SRE team is still maturing. There are still a lot of work to be done as well.
  10. In our organization, our SRE team deals with the DevOps aspects as well. The team owns the development pipeline and the process, on top of the SRE work. A lot of our pipeline is shared infrastructure: GIT repository, CI, artifact store, and CD/Deployment. We have teams that deal with those aspects of the pipeline, so we can concentrate on the customization parts that we need. We deal with a routine tasks, which are mostly automated. But at the end of the day, the process still needs someone to approve and execute it. In SRE work, change management refers to our component deployment process. There is already an established process in place, and SRE’s responsibilities involve deployment of the components to Production. In our team, we run 2 week sprints, which generally 95% of the time, ends with a production deployment. The process is 90% automated through a CD process, and SRE’s job is generally to approve a release deployment, and deal with any incidents. Our service requires deployment to 30+ sites, so we need to adopt a progressive rollout approach, and it takes a fair bit of time to complete, plus we have safeguards, like ensuring that a deployment is stable enough on a region/site before we move forward with the rest. We also work on incident tickets. The shared-services ecosystem that supports our system also generates tickets for us, in areas like resolving faulty components that failed in our production environments (e.g. chef failure) SRE work does affect development cadences. Sometimes we do become a bottleneck There’s really many aspects of the SRE / Ops work we can talk about, but we’re not going to. Disaster Recovery, Load Testing … all these are all intricately tied to the work that we do. My service SRE team priorities and responsibilities Availability Latency Performance Monitoring Change Management Emergency Response Capacity Planning
  11. The SRE team touches many aspects of the overall product, and a lot on non functional requirements. To build up an SRE team is no easy feat. Our SRE engineers are a mix. Some come from a Development Background, and some from Operations and Infrastructure. As SRE, we need that big picture view of how these pieces fit together, then feed the requirements back into the development teams. Sometimes development does get short sighted on reliability requirements. especially in areas of scaling and workloads. Their focus is often on functional requirements needed to complete the story . There are parts of the system that need to be built into the product backlog. Logging, Telemetry, Audit to name a few. We need to influence architectural design decisions. Reliability needs to be built into the backlog. We need to build reliability from the beginning. In a 2022 survey, about 50% of 450 SRE engineers polled said that they dedicate a significant amount to influencing the design decisions.
  12. A lot of the aspects of SRE work is interconnected. Like our topic : keeping the site up 24/7 involves many layers of complexity.
  13. This is where I start to condense the content a bit. SRE is a huge multidisciplinary team and there is no easy way to tell you everything about it. Instead, in this session, I’ve picked out 2 aspects, or key areas that I think contributes to the reliability of the site. Remember murphys law ! What can fail will fail ! Of course SRE work is a lot more than that, and these 2 don’t cover even 10% of the work we do. But since we’re going to talk about 24/7, I thought I would pick out these 2. And they are very important ones that
  14. The most common way to solve availability is hardware. And at a bare minimum, we need to put in a HA setup. To be HA requires us to have at least 3 nodes in place to help service requests. We also distribute the compute instance deployments to different Availability domains (different sites) and fault domains (different hardware) to reduce the risks. HA also opens up the options for having rolling updates when doing deployments and patching. Scaling (Horizontal / Vertical) with no downtime. Of course, we also need to ensure the components are able to support the configuration. Requirements like component heartbeat, being stateless and/or asynchronous. Availability is an outcome of the infrastructure engineering work that is being done. Balancing cost is important ! This is the key tradeoff in managing the redundancy aspect of site availability. Being redundant adds a layer of robustness into to the infrastructure
  15. Observability is a really important aspect of SRE work. There are generally 3 key pillars, Telemetry, Logs and Tracing. I’m going to use the work observability and instrumentation very loosely here.
  16. Java Management is a FREE service. Any OCI customers can use it. Our SLO is 99.95, which allows for 21 minutes of downtime monthly or 1.83 days. Planned / Regular maintenance does not count towards our downtime Failure of downstream services does not count towards our downtime SLA is also something that is usually business driven. SRE teams generally have no part in defining SLAs. (In our case, we’re a free OCI service, so we don’t have SLAs, only SLOs) Site availability is generally something that is too coarse grained to be useful for SRE work. Its useful for reporting to customers, management level reports, but from a SRE perspective, we need something a lot more granular.
  17. Usually in development and pre-production, we may not be able generate sufficient load, or traffic behaviour to pinpoint issues, and these are the issues that will come up when we deploy to production. We need to look beyond what our metrics is collecting and analyze the information! Not all collected metrics end up in a SLO/SLI. Also to circle back, Observability is a very important property in the component. All this needs to be enabled by instrumentation built into the component itself. We also collect this information from our supporting tools, like our databases, queues, so choosing tools is important !
  18. There are many aspects of SRE work that we do not cover in this 45 minutes. Things like security, compliance, incident management, error budgets, toil vs automation. I didn’t intend for this session to be a highly technical one, where we condense the entire SRE doctrine into 45 minutes, but hopefully, I’ve shared enough little nuggets of information to give everyone an insight into what SRE .. As in both the engineering work and the engineer role, is like.