SlideShare una empresa de Scribd logo
1 de 27
Helping operations top-heavy
teams the smart way
(Lessons from my experience being loaned out to SRE teams)
Michael Kehoe
Staff Site Reliability Engineer
Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland
Production-SRE Team @ LinkedIn
$ WHOAMI
• Disaster Recovery - Planning &
Automation
• Incident Response – Process &
Automation
• Visibility Engineering – Making use of
operational data
• Reliability Principles – Defining best
practice & automating it
• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not
• How to identify team anti-patterns
• How to work through high-toil
• How to create sustainable
workloads
This talk is
Today’s
agenda
1 Background
2 Scenario 1: Resource Allocation
3 Scenario 2: Technical Debt
4 Scenario 3: High Toil
5 Building A Formula For Success
6 Key Learnings
7 Q&A
Background
Personal Experience in the past 15 months
ASSISTANCE RENDERED
• Traffic-SRE: Resource Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability
Scenario 1: Resource
Allocation
Problem Statement
Resource Allocations
• Lack of written documentation
• Backlog of work for clients
• Alert Fatigue
Scenario 2: Technical Debt
Problem Statement
Technical Debt
• New frontend service
• Understanding performance is
complicated
• Management of dependent
services was difficult
Scenario 3: High toil
Problem Statement
High Toil
• Large multi-tenant/ multi-cluster
database team
• Lack of maturity in team-specific
automation
• Alert Fatigue
Building a formula for
success
Code Yellow
Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Commutation &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning
Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success
Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success
Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success
Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success
Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success
When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/
Key Learnings
Key Learnings
Measure toil/
overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate
Q&A
Helping operations top-heavy teams the smart way

Más contenido relacionado

La actualidad más candente

Making a Project a Complete Success with Post-Implementation Strategies | Jul...
Making a Project a Complete Success with Post-Implementation Strategies | Jul...Making a Project a Complete Success with Post-Implementation Strategies | Jul...
Making a Project a Complete Success with Post-Implementation Strategies | Jul...Katie Elliott
 
Agile network India | Dysfunctions in a Scrum Master's Role | Soja Nizam
Agile network India | Dysfunctions in a Scrum Master's Role | Soja NizamAgile network India | Dysfunctions in a Scrum Master's Role | Soja Nizam
Agile network India | Dysfunctions in a Scrum Master's Role | Soja NizamAgileNetwork
 
Using OEE to Improve Production - Interphex 2012
Using OEE to Improve Production - Interphex 2012Using OEE to Improve Production - Interphex 2012
Using OEE to Improve Production - Interphex 2012Adrian Pask
 
Blackbaud CRM After Go-Live
Blackbaud CRM After Go-LiveBlackbaud CRM After Go-Live
Blackbaud CRM After Go-LiveBlackbaud
 
Top tips for a successful traceability system implemention paula peterson 2015
Top tips for a successful traceability system implemention paula peterson 2015Top tips for a successful traceability system implemention paula peterson 2015
Top tips for a successful traceability system implemention paula peterson 2015Paula Peterson
 
Agile Balanced Scorecard -Agile Tour 2011 Pune
Agile Balanced Scorecard -Agile Tour 2011 PuneAgile Balanced Scorecard -Agile Tour 2011 Pune
Agile Balanced Scorecard -Agile Tour 2011 PuneAsheesh Mehdiratta
 
Why lean can't succeed without operational discipline
Why lean can't succeed without operational disciplineWhy lean can't succeed without operational discipline
Why lean can't succeed without operational disciplineCalvin L Williams
 
Agile Network India | Disciplined Agile Through Case Study | Nagaraja Gundappa
Agile Network India | Disciplined Agile Through Case Study | Nagaraja GundappaAgile Network India | Disciplined Agile Through Case Study | Nagaraja Gundappa
Agile Network India | Disciplined Agile Through Case Study | Nagaraja GundappaAgileNetwork
 
Project management career seminar
Project management career seminarProject management career seminar
Project management career seminarOjiugo Ajunwa
 
Procensol Breakfast Forum Launch - Modern Business Transformation
Procensol Breakfast Forum Launch - Modern Business TransformationProcensol Breakfast Forum Launch - Modern Business Transformation
Procensol Breakfast Forum Launch - Modern Business TransformationProcensol
 
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchioGo Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchioKatie Elliott
 
City of Canning: 4 Key Success Factors to Drive Engagement and Build Momentum
City of Canning: 4 Key Success Factors to Drive Engagement and Build MomentumCity of Canning: 4 Key Success Factors to Drive Engagement and Build Momentum
City of Canning: 4 Key Success Factors to Drive Engagement and Build MomentumEileenTan67
 
Agile project management - everything you want to know but were afraid to ask...
Agile project management - everything you want to know but were afraid to ask...Agile project management - everything you want to know but were afraid to ask...
Agile project management - everything you want to know but were afraid to ask...Association for Project Management
 
Agile Network India | Data driven approach to Retrospectives | Sandhya Bhayana
Agile Network India | Data driven approach to Retrospectives | Sandhya BhayanaAgile Network India | Data driven approach to Retrospectives | Sandhya Bhayana
Agile Network India | Data driven approach to Retrospectives | Sandhya BhayanaAgileNetwork
 

La actualidad más candente (18)

Implementing infinity hr (katie cuthriell)
Implementing infinity hr (katie cuthriell)Implementing infinity hr (katie cuthriell)
Implementing infinity hr (katie cuthriell)
 
Making a Project a Complete Success with Post-Implementation Strategies | Jul...
Making a Project a Complete Success with Post-Implementation Strategies | Jul...Making a Project a Complete Success with Post-Implementation Strategies | Jul...
Making a Project a Complete Success with Post-Implementation Strategies | Jul...
 
Agile network India | Dysfunctions in a Scrum Master's Role | Soja Nizam
Agile network India | Dysfunctions in a Scrum Master's Role | Soja NizamAgile network India | Dysfunctions in a Scrum Master's Role | Soja Nizam
Agile network India | Dysfunctions in a Scrum Master's Role | Soja Nizam
 
Using OEE to Improve Production - Interphex 2012
Using OEE to Improve Production - Interphex 2012Using OEE to Improve Production - Interphex 2012
Using OEE to Improve Production - Interphex 2012
 
Blackbaud CRM After Go-Live
Blackbaud CRM After Go-LiveBlackbaud CRM After Go-Live
Blackbaud CRM After Go-Live
 
Top tips for a successful traceability system implemention paula peterson 2015
Top tips for a successful traceability system implemention paula peterson 2015Top tips for a successful traceability system implemention paula peterson 2015
Top tips for a successful traceability system implemention paula peterson 2015
 
Agile Balanced Scorecard -Agile Tour 2011 Pune
Agile Balanced Scorecard -Agile Tour 2011 PuneAgile Balanced Scorecard -Agile Tour 2011 Pune
Agile Balanced Scorecard -Agile Tour 2011 Pune
 
Why lean can't succeed without operational discipline
Why lean can't succeed without operational disciplineWhy lean can't succeed without operational discipline
Why lean can't succeed without operational discipline
 
Agile Network India | Disciplined Agile Through Case Study | Nagaraja Gundappa
Agile Network India | Disciplined Agile Through Case Study | Nagaraja GundappaAgile Network India | Disciplined Agile Through Case Study | Nagaraja Gundappa
Agile Network India | Disciplined Agile Through Case Study | Nagaraja Gundappa
 
Project management career seminar
Project management career seminarProject management career seminar
Project management career seminar
 
Procensol Breakfast Forum Launch - Modern Business Transformation
Procensol Breakfast Forum Launch - Modern Business TransformationProcensol Breakfast Forum Launch - Modern Business Transformation
Procensol Breakfast Forum Launch - Modern Business Transformation
 
Lean testing
Lean testingLean testing
Lean testing
 
IT Outsourcing Best Practices
IT Outsourcing Best PracticesIT Outsourcing Best Practices
IT Outsourcing Best Practices
 
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchioGo Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio
 
TPM: Focused Improvement (Kobetsu Kaizen) Poster
TPM: Focused Improvement (Kobetsu Kaizen) PosterTPM: Focused Improvement (Kobetsu Kaizen) Poster
TPM: Focused Improvement (Kobetsu Kaizen) Poster
 
City of Canning: 4 Key Success Factors to Drive Engagement and Build Momentum
City of Canning: 4 Key Success Factors to Drive Engagement and Build MomentumCity of Canning: 4 Key Success Factors to Drive Engagement and Build Momentum
City of Canning: 4 Key Success Factors to Drive Engagement and Build Momentum
 
Agile project management - everything you want to know but were afraid to ask...
Agile project management - everything you want to know but were afraid to ask...Agile project management - everything you want to know but were afraid to ask...
Agile project management - everything you want to know but were afraid to ask...
 
Agile Network India | Data driven approach to Retrospectives | Sandhya Bhayana
Agile Network India | Data driven approach to Retrospectives | Sandhya BhayanaAgile Network India | Data driven approach to Retrospectives | Sandhya Bhayana
Agile Network India | Data driven approach to Retrospectives | Sandhya Bhayana
 

Similar a Helping operations top-heavy teams the smart way

Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
American Electric Power Ercot kickoff
American Electric Power Ercot kickoffAmerican Electric Power Ercot kickoff
American Electric Power Ercot kickoffJohn Napier
 
103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...
103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...
103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...ssuser835d1a
 
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges
AVATA Webinar: Solutions to Common Demantra & ASCP ChallengesAVATA Webinar: Solutions to Common Demantra & ASCP Challenges
AVATA Webinar: Solutions to Common Demantra & ASCP ChallengesAVATA
 
Methodology lean IT transformation mission
Methodology   lean IT transformation missionMethodology   lean IT transformation mission
Methodology lean IT transformation missionJean-François Nguyen
 
The Dashlane Agile Journey
The Dashlane Agile JourneyThe Dashlane Agile Journey
The Dashlane Agile JourneyDashlane
 
Engineering Teams and Systems for Velocity
Engineering Teams and Systems for VelocityEngineering Teams and Systems for Velocity
Engineering Teams and Systems for VelocityJean Barmash
 
Lean Six Sigma-An Execution Engine
Lean Six Sigma-An Execution EngineLean Six Sigma-An Execution Engine
Lean Six Sigma-An Execution EngineMark Cichonski
 
Fundamentals of agile tntu (2015-04-27)
Fundamentals of agile   tntu (2015-04-27)Fundamentals of agile   tntu (2015-04-27)
Fundamentals of agile tntu (2015-04-27)Oleg Nazarevych
 
R a ci & innovation
R a ci & innovationR a ci & innovation
R a ci & innovationAlan Culler
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference
 
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...TheConnectedCause
 
Practical Enterprise Architecture in Medium-size Corporation using TOGAF
Practical Enterprise Architecture in Medium-size Corporation using TOGAFPractical Enterprise Architecture in Medium-size Corporation using TOGAF
Practical Enterprise Architecture in Medium-size Corporation using TOGAFMichael Sukachev
 
Agile ncr pramila hitachi consulting_future_coaching
Agile ncr pramila hitachi consulting_future_coachingAgile ncr pramila hitachi consulting_future_coaching
Agile ncr pramila hitachi consulting_future_coachingAgileNCR2016
 
The Five Phases of Agile Maturity (Part 2): Phase 3 and 4
The Five Phases of Agile Maturity (Part 2): Phase 3 and 4The Five Phases of Agile Maturity (Part 2): Phase 3 and 4
The Five Phases of Agile Maturity (Part 2): Phase 3 and 4Cprime
 
Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6
Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6
Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6PrescienceTechnology
 
Changing culture and building efficiencies by applying the Lean principles to...
Changing culture and building efficiencies by applying the Lean principles to...Changing culture and building efficiencies by applying the Lean principles to...
Changing culture and building efficiencies by applying the Lean principles to...Association for Project Management
 

Similar a Helping operations top-heavy teams the smart way (20)

Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
American Electric Power Ercot kickoff
American Electric Power Ercot kickoffAmerican Electric Power Ercot kickoff
American Electric Power Ercot kickoff
 
103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...
103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...
103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...
 
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges
AVATA Webinar: Solutions to Common Demantra & ASCP ChallengesAVATA Webinar: Solutions to Common Demantra & ASCP Challenges
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges
 
Methodology lean IT transformation mission
Methodology   lean IT transformation missionMethodology   lean IT transformation mission
Methodology lean IT transformation mission
 
The Dashlane Agile Journey
The Dashlane Agile JourneyThe Dashlane Agile Journey
The Dashlane Agile Journey
 
Engineering Teams and Systems for Velocity
Engineering Teams and Systems for VelocityEngineering Teams and Systems for Velocity
Engineering Teams and Systems for Velocity
 
Lean Six Sigma-An Execution Engine
Lean Six Sigma-An Execution EngineLean Six Sigma-An Execution Engine
Lean Six Sigma-An Execution Engine
 
Fundamentals of agile tntu (2015-04-27)
Fundamentals of agile   tntu (2015-04-27)Fundamentals of agile   tntu (2015-04-27)
Fundamentals of agile tntu (2015-04-27)
 
R a ci & innovation
R a ci & innovationR a ci & innovation
R a ci & innovation
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
 
Practical Enterprise Architecture in Medium-size Corporation using TOGAF
Practical Enterprise Architecture in Medium-size Corporation using TOGAFPractical Enterprise Architecture in Medium-size Corporation using TOGAF
Practical Enterprise Architecture in Medium-size Corporation using TOGAF
 
Agile ncr pramila hitachi consulting_future_coaching
Agile ncr pramila hitachi consulting_future_coachingAgile ncr pramila hitachi consulting_future_coaching
Agile ncr pramila hitachi consulting_future_coaching
 
The Five Phases of Agile Maturity (Part 2): Phase 3 and 4
The Five Phases of Agile Maturity (Part 2): Phase 3 and 4The Five Phases of Agile Maturity (Part 2): Phase 3 and 4
The Five Phases of Agile Maturity (Part 2): Phase 3 and 4
 
Effective Scrum
Effective ScrumEffective Scrum
Effective Scrum
 
Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6
Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6
Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6
 
Changing culture and building efficiencies by applying the Lean principles to...
Changing culture and building efficiencies by applying the Lean principles to...Changing culture and building efficiencies by applying the Lean principles to...
Changing culture and building efficiencies by applying the Lean principles to...
 

Más de Michael Kehoe

QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsMichael Kehoe
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsMichael Kehoe
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...Michael Kehoe
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsMichael Kehoe
 
SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe
 

Más de Michael Kehoe (20)

eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production Systems
 
SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
 

Último

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Helping operations top-heavy teams the smart way

  • 1. Helping operations top-heavy teams the smart way (Lessons from my experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer
  • 2. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  • 3. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  • 4. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not
  • 5. • How to identify team anti-patterns • How to work through high-toil • How to create sustainable workloads This talk is
  • 6. Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3 Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A
  • 8. Personal Experience in the past 15 months ASSISTANCE RENDERED • Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability
  • 10. Problem Statement Resource Allocations • Lack of written documentation • Backlog of work for clients • Alert Fatigue
  • 12. Problem Statement Technical Debt • New frontend service • Understanding performance is complicated • Management of dependent services was difficult
  • 14. Problem Statement High Toil • Large multi-tenant/ multi-cluster database team • Lack of maturity in team-specific automation • Alert Fatigue
  • 15. Building a formula for success
  • 17. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  • 18. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  • 19. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  • 20. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  • 21. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  • 22. Communicate expectations with clients & partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  • 23. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/
  • 25. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate
  • 26. Q&A

Notas del editor

  1. Michael So we’re apart of a team at LinkedIn called Production-SRE The key tenants of production-sre at LinkedIn is: Assist in restoring stability during site-critical issues Developing applications to reduce MTTD and MTTR Provide direction and guidelines for site-troubleshooting Build tools for efficient site-issue troubleshooting, issue detection and correlation As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  2. This talk isn’t how to magically erase all of your technical debt Neither is it a talk on changing your engineering culture
  3. This talk is How to identify team anti-patterns How to work through high-toil How to create sustainable workloads
  4. Michael So we’re apart of a team at LinkedIn called Production-SRE The key tenants of production-sre at LinkedIn is: Assist in restoring stability during site-critical issues Developing applications to reduce MTTD and MTTR Provide direction and guidelines for site-troubleshooting Build tools for efficient site-issue troubleshooting, issue detection and correlation As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  5. So the first scenario I want to discuss is when I got pulled into the Traffic team due to severe resource allocation issues: We had a team that had a lack of written documentation on how their platform worked and was deployed They had a large backlog of work for clients And there was a large amount of alert fatigue due to a some poorly defined alerts and some infrastructure that needed upgrading but they hadn’t gotten to it yet Ontop of that, 4/5 team members left in a short period of time and started doing reliability operations at another company together So we’re in a bit of a pickle here…. So in response, we took 5 staff SRE’s from other teams and dedicated them to the traffic team for a period of 3 months Stopped all non-critical client work for a number of weeks Completely recreated all monitoring systems Spent a large chunk of time removing complexity Focused on infrastructure reliability
  6. The second team I worked with was our frontend API service team
  7. Thousands of instances Lack of maturity in automation for the team Alert fatigue given the size of their infrastructure Poor visibility into ops metrics