SRE_Ch12_effective_troubleshooting

•

3 likes•965 views

Pjack Chen

SRE Study Group Ch12 Effective Troubleshooting

Technology

Effective
Troubleshooting
Dec. 28, 2017
Eric Chen, Trendmicro

https://www.slideshare.net/jallspaw/responding-to-outages-maturely

Problem Report
● Expected behavior
● Actual behavior
● How to reproduce
● Store in a searchable location, e.g. bug tracking system

Triage
● Measure the severity
Initial Response Update Time Resolution Time /
SLO
P0 15 min Real time 4 hours
P1 30 min 2 hours 8 hours
P2 1 hour 6 hours 36 hours
Triage incident in Trendmicro

Stopping the bleeding should be
your first priority

Workaround
● Rollback
● Restart / Reboot
● Deploy new node
● Scale up/out the instance
● Failover database
● Shutdown the service
My experience

Examine
● Graphing time-series metrics
● Logging
○ Structured binary format
○ Multiple verbosity level, change it on the fly
○ Searchable
● Exposing current state
○ Endpoint to show error rate and latency
○ Configuration

Diagnose
● Simplify and reduce
○ Divide and conquering
● Ask “what”, “where”, and “why”
● What touched it last
● Diagnose tool

https://www.slideshare.net/brendangregg/linux-performance-tools

Test and Treat
● Mutually exclusive
● Consider the obvious first
● Experiment may provide misleading results
● Active test may have side effect
● Take clear notes before performing active testing

Cure
● Prove the root cause may be hard
○ System are complex, multiple factors
○ Reproduce the problem in live production may not be an option
● Postmortem is important

Postmortem
● Problem report
● Business impact
● Workaround Fix
● Root cause
● Technical details
● Action items
My experience

Similar to SRE_Ch12_effective_troubleshooting

Flight training for DevOpsServer Density

Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman

Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent

DrupalCon 2014: A Perfect Launch, Every TimePantheon

Building real time Data Pipeline using Spark Streamingdatamantra

Validating big data pipelines - FOSDEM 2019Holden Karau

Journey through high performance django applicationbangaloredjangousergroup

Winston - Netflix's event driven auto remediation and diagnostics toolVinay Shah

Gatling - Bordeaux JUGslandelle

3 Keys to Performance Testing at the Speed of AgileNeotys

PeopleSoft Test Framework WalkthroughSreekanth Mukalla

Using SaltStack to DevOps the enterpriseChristian McHugh

Accumulo14 15Sqrrl

Simplifying Disaster Recovery with Delta LakeDatabricks

WebPagetest Power Users - Velocity 2014Patrick Meenan

OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS

Android Jam - Activity Lifecycle & Databases - Udacity Lesson 4aPaul Blundell

Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios

Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...StormForge .io

Similar to SRE_Ch12_effective_troubleshooting (20)

Flight training for DevOps

Benchmarks, performance, scalability, and capacity what's behind the numbers

Benchmarks, performance, scalability, and capacity what s behind the numbers...

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...

DrupalCon 2014: A Perfect Launch, Every Time

Building real time Data Pipeline using Spark Streaming

Validating big data pipelines - FOSDEM 2019

Journey through high performance django application

Winston - Netflix's event driven auto remediation and diagnostics tool

Gatling - Bordeaux JUG

3 Keys to Performance Testing at the Speed of Agile

PeopleSoft Test Framework Walkthrough

Using SaltStack to DevOps the enterprise

Accumulo14 15

Simplifying Disaster Recovery with Delta Lake

WebPagetest Power Users - Velocity 2014

OSMC 2019 | How to improve database Observability by Charles Judith

Android Jam - Activity Lifecycle & Databases - Udacity Lesson 4a

Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions

Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

MINDCTI Revenue Release Quarter One 2024MIND CTI

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

DBX First Quarter 2024 Investor PresentationDropbox

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Corporate and higher education May webinar.pptxRustici Software

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

FWD Group - Insurer Innovation Award 2024

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

MINDCTI Revenue Release Quarter One 2024

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

MS Copilot expands with MS Graph connectors

Understanding the FAA Part 107 License ..

DBX First Quarter 2024 Investor Presentation

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Corporate and higher education May webinar.pptx

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Boost Fertility New Invention Ups Success Rates.pdf

Artificial Intelligence Chap.5 : Uncertainty

Apidays New York 2024 - The value of a flexible API Management solution for O...

SRE_Ch12_effective_troubleshooting

1. Effective Troubleshooting Dec. 28, 2017 Eric Chen, Trendmicro

4. https://www.slideshare.net/jallspaw/responding-to-outages-maturely

6. Problem Report ● Expected behavior ● Actual behavior ● How to reproduce ● Store in a searchable location, e.g. bug tracking system

7. Triage ● Measure the severity Initial Response Update Time Resolution Time / SLO P0 15 min Real time 4 hours P1 30 min 2 hours 8 hours P2 1 hour 6 hours 36 hours Triage incident in Trendmicro

8. Troubleshooting root cause

9. Stopping the bleeding should be your first priority

10. Workaround ● Rollback ● Restart / Reboot ● Deploy new node ● Scale up/out the instance ● Failover database ● Shutdown the service My experience

11. Examine ● Graphing time-series metrics ● Logging ○ Structured binary format ○ Multiple verbosity level, change it on the fly ○ Searchable ● Exposing current state ○ Endpoint to show error rate and latency ○ Configuration

12. Diagnose ● Simplify and reduce ○ Divide and conquering ● Ask “what”, “where”, and “why” ● What touched it last ● Diagnose tool

13. https://www.slideshare.net/brendangregg/linux-performance-tools

14. Test and Treat ● Mutually exclusive ● Consider the obvious first ● Experiment may provide misleading results ● Active test may have side effect ● Take clear notes before performing active testing

15. Cure ● Prove the root cause may be hard ○ System are complex, multiple factors ○ Reproduce the problem in live production may not be an option ● Postmortem is important

16. Postmortem ● Problem report ● Business impact ● Workaround Fix ● Root cause ● Technical details ● Action items My experience

17.

18. Thank You

SRE_Ch12_effective_troubleshooting

Recommended

Recommended

More Related Content

Similar to SRE_Ch12_effective_troubleshooting

Similar to SRE_Ch12_effective_troubleshooting (20)

Recently uploaded

Recently uploaded (20)

SRE_Ch12_effective_troubleshooting