Abstract:
Site Reliability Engineering (SRE) and AIOps are two of the most discussed topics in the IT world these days. SRE incorporates Infrastructure and Operation aspects to create scalable and reliable software systems that are highly automatic and self-healing. Artificial Intelligence for IT Operations (AIOps) takes a further step to automate and enhance IT operations by using data analytics and machine learning. This session covers the benefits of SRE & AIOps and how to adapt it.
Key Takeaways:
1. Understand the concepts of SRE & AIOps
2. Understand the importance and benefits of SRE & AIOps
3. How do we adapt to SRE & AIOps?
2. AGILE, DEVOPS AND SRE..
2
Agile Development
• Transformed the way software being built
• Collaboration & quicker feedback loop
• Better control, early value
DevOps
• Cultural transformation focused on delivery speed
• Enable automation wherever possible
• Make development and operation process frictionless
Site Reliability Engineering
Focus to improve the reliability of software in production by implementing the best practices in engineering and operations
3. Tesco Transport Systems Adjustment3
SITE RELIABILITY ENGINEERING
SRE incorporates Engineering, Infrastructure and
Operation aspects to create scalable and reliable
software systems that are highly automatic and
self-healing.
SRE aims at DevOps to NoOps - “what happens
when a software engineer is tasked with what
used to be called operations.” - Ben Treynor,
Founder of Google SRE
The purpose of SRE is to achieve reliability by
implementing the best practices in engineering
and operations.
SRE can be thought of as an extreme
implementation of DevOps.
5. SITE RELIABILITY ENGINEER
5
The ideal site reliability engineer is either a software engineer with a good administration
background or a highly skilled system administrator with knowledge of coding and automation –
“Part systems administrator, part second tier support and part developer”
50% cap on the aggregate "ops" work for all SREs. SRE team must spend the remaining 50% of its
time actually doing development activities
An SRE team is responsible for,
• availability,
• latency,
• performance,
• efficiency,
• change management,
• monitoring,
• emergency response,
• capacity planning
7. SRE - METRICS & MEASUREMENTS
7
Service Level Indicators that measures failures per request by calculating request latencySLI
Service Level Objectives that sets goals for System availability, performance, success ratesSLO
Service level agreements that are driven from SLO and dictate commercial penaltiesSLA
It is a measure of risk and the amount of headroom you have above the SLAError Budget
Mean time to repair is average time required to repair a failureMTTR
Predicted elapsed time between inherent failures of a system during operationMTTF
8. TAKE AWAY..
8
..and AIOps takes a further step from SRE towards automating IT operations using
advanced analytics !!!
9. COGNITIVE LEARNING – INTELLIGENT OPERATIONS (AIOps)
9
Insight Predict
Big Data Machine
Learning
10. Definition - What Does AIOps Mean?
10
AIOps is a methodology that is on the frontier of enterprise IT
operations. AIOps automates various aspects of IT and utilizes
the power of artificial intelligence to create self-learning
programs that help revolutionize IT services.
It is the application of advanced analytics—in the form of
machine learning (ML) and artificial intelligence (AI), towards
automating operations so that your IT Ops team can move at
the speed that your business expects today.
AIOps refers to multi-layered technology platforms that automate
and enhance IT operations by 1) using analytics and machine
learning to analyze big data collected from various IT operations
tools and devices, in order to 2) automatically spot and react to
issues in real time.
11. What Will Tomorrow Look Like ?
11
….Function Follows Need
Distributed Computing
Software Defined
Everything
Monitoring
Platforms
ISV Platforms
Patchwork, Open source,
Departmental
Source Events
Custom/Standard/Fixed
~ 100 – 1000 eps
Chaotic, Unstructured
~ 1000 – 100,000 eps
Configuration
Flexible
TBC ~ hours
Chaotic
TBC < 1 second
Infrastructure
Multi vendor
UNIX/IP/Windows client
server
Virtualised/Containers
Fluid/UNIX/Mobile/Micro
Digital
Transformation
Demands DevOps &
elastic
2010 2020
12. Current and Future Demands
12
Scale
• 105+ Moving Parts
• 106+ Notifications
• 109+ Data Points
• 1012 -> 10120+ Possible Failure Modes
+ Bounded by the estimated information content of the
universe !
Compulsion of Change
Complexity
Reduction in the Unit of compute
Mainframe → Server → VM → Container
Multiple Orders of Magnitude
Increase in Change Cycle
Fully fluid CI/CD Cycle
13. Traditional IT Ops caught Flat - Footed
13
Overwhelmed by DATA and a lack of INFORMATION
Siloed
teams and
tools
Too
many
alerts
No context
when an
incident
occurs
No
early
warning
DevOps
lacks
proactive
assurance
75-80%
~ 90%
> 45%
> 73%
Many Siloed
War room
14. IT Ops Priorities Driven by Digital Transformation
14
INCREASE frequency of change, stability and availability of IT services1
REDUCE resource operations workload and INCREASE productivity2
CONSOLDATE tools3
MIGRATE to the cloud4
SUPPORT software-defined services5
SUPPORT microservices based software architecture6
15. AIOps Agile and Proactive Event-to-Resolution Workflow
15
Early Detection, fewer tickets, reduced MTTR
Industrialised data
ingestion from
multiple sources
Automatically resolves
signals from alert noise
Proactively and
automatically detects
incidents and probable root
causes (reduced MTTD)
Enables collaborative
workflows (reduces
adverse business
impact)
Triggers automation
to restore services
Predictive insights
(reduced support
escalations and
MTTR)
16. How AIOps makes ITOps Robutst ?
16
• Determine the service health of
mission-critical services or
applications.
• Gain control and visibility to
spiraling consumption of cloud
resources.
• Accelerate MTTR with automated
incident management and real-
time configuration management
database (CMDB) updates.
• Build context-rich data lakes
integrating disparate, third-party
data sources.
17. AIOps makes Teams Faster, Smarter, and More
Productive
17
Level 0/NOC Operators
• Improve efficiency by consolidating related alerts together
• Reduce catch-n-dispatch activities
Support SMEs & Developers
• Pass incident resolution knowledge to lower support tiers
• Collaborate across complex multi-disciplinary incidents
IT Operations Managers
• Delivery service-level state monitoring
• Improve efficiency and job satisfaction
• Identify and address repeating mundane work with run book automation
• Investigate and problem-solve for frequently repeating P3-P5 incidents
IT Senior Management
• Achieve overall per-alert efforts reduction
• Re -purpose the savings towards business’s bottom line