SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Nothing Good Ever
Happens After 2am
Reversim 2019
Daniel Korn
Engineering Team Lead at BigPanda 
korndaniel1
BigPanda’s 

Outage Procedure
Roles and responsibilities
On-call Incident Manager

On-Call (IMOC)
Tech Lead

On-Call (TLOC)
Support 

On-Call (SOC)
Incident Priority Definitions
Priority Affect Outage Resolution
P1
• Core feature
• Multiple customers
24/7
P2
• Core feature
• Single customer
24/7
P3
• Secondary feature
• No workaround
Next business day
Tools
Tools
• Alerting
Tools
• Alerting
• Communication
Tools
• Alerting
• Communication
• Observability
Alert/Support
notifies On-call
IMOC asses impact,
determine P1/P2/P3
On-call performs
simple mitigation
On-call escalate

to IMOC
IMOC escalate to
TLOC and SOC
1
2
3
4
5
6
7
8
9
10
On-call If (P1) { 

StatusPage;

dedicated channel;

}
SOC update
customers
R&D mitigate till
solved, update
StatusPage
IMOC Verifies resolved,

summary in channel
IMOC postmortem,
share with stakeholders
The Long Night
THIS IS A TRUE STORY.
The events depicted in this postmortem
took place in Tel Aviv and San Francisco
in 2018.



Despite the request of the survivors, the
names have not been changed.
Out of respect for our customers, the
story has been told exactly as it occurred.
Michal
On-call
Almog & Pini
TLOCs
Daniel (Me)
TLOC
Shmeff Andru
SOC Support
Julio
Support
Background
• REMINDER: BigPanda’s SLA
• New Access Control (RBAC) service
• Not all customers migrated
• Sunday: Multi-service deployment
[MON 05:03 PM] SOC

multiple tickets:“cannot
update environments”
[05:05 PM] On-call

Asks SOC for details, opens a
dedicated Slack channel
[05:08 PM] On-call

Identifies as Auth-related,
notifies TLOCs
[05:35 PM] On-call

“we think it’s related to a
deploy, working on a fix”
[05:33 PM] SOC

considers opening a status
page, but “might be a P3”
[06:16 PM] SOC

Opens status page
Stick to the Plan
TA
K
EAW
AY
[07:41 PM] TLOCs

Deploy fix to production
[06:50-07:30 PM] TLOCs

Fix is tested, not reproduced
debate fix or revert
[07:45-08:05 PM] SOC

Verifies together with TLOCs
the issue is resolved
[08:10 PM] SOC

Closes status page

On-call and TLOCs leaving
REVERT FIRST
Rule of Thumb
TA
K
EAW
AY
[12:57 AM] SOC

“So it appears to be just a
UI issue”. Notifies On-call
[12:45 AM] Support

“Some customers can’t see
roles in the env editor”
[12:59 AM] On-call

Notifies TLOC
[01:01 AM] TLOC

Starts investigating the issue
– Someone smart
If it looks like an outage, and (support)
sounds like an outage, then it might
be just a bug“
Do not Assume
an Outage
TA
K
EAW
AY
[01:54 AM] TLOCs

Deploy fix to production, 

ask SOC to verify with customers
[01:20 AM] TLOCs

Identifying the cause, 

starting to work on a fix
If you think this has a
happy ending, you haven’t
been paying attention.
— Ramsay Bolton
“
[02:00 AM] SOC + Support 

Debating on StatusPage re-open
[01:57 AM] Support

customers reporting the initial issue -
“cannot update environments”
[02:03 AM] TLOCs

Start investigating the issue
[02:15-02:51 AM] TLOC

Manually adds missing
permissions to customers DB
[02:10 AM] TLOCs

Identifying the cause - lack of
permissions (migration)
Time to Call it
a Night
TA
K
EAW
AY
[02:56 AM] SOC

Verifies this customer is
facing the issue
[02:52 AM] TLOC

Having problems with a
specific customer
[02:56-03:25 AM] TLOCs

Identify the problem - edge case
involving FT and manual customizations
[03:25 PM] SOC

Asks TLOC to discuss the
situation on a phone call
[-04:07 AM] SOC+TLOC

SOC asks TLOC to
commit to fix by EOD
[03:29- AM] SOC + TLOC

Sensitive customer, no
changes ,issue remains
[09:30 AM - 05:12 PM] TLOCs

Implemented a fix, deploy to production,
ask SOC to verify
[05:25 PM] SOC

Verifies issue resolved
Do not Commit
to Action Items
TA
K
EAW
AY
[19:00 PM] CS + R&D + PM

Joint postmortem,

Preparing customer’s updates
[WED 11:00 AM] R&D

Conduct a postmortem,

Share with R&D and CS
Chaos isn’t a pit.
Chaos is a ladder.
— Petyr “Littlefinger” Baelish
“
Recap
• Stick to the plan
• Rule of thumb: REVERT FIRST
• Do not assume an outage
• Time to call it a night
• Do not commit to action items
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am

Más contenido relacionado

Similar a Nothing Good Ever Happens After 2am

Critical incident management.pptx
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptx
DavidForeroS
 
2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues
Amber Joan Wood
 

Similar a Nothing Good Ever Happens After 2am (20)

3 steps to hosted success
3 steps to hosted success3 steps to hosted success
3 steps to hosted success
 
Choked by technical debt?
Choked by technical debt?Choked by technical debt?
Choked by technical debt?
 
DR planning and testing
DR planning and testingDR planning and testing
DR planning and testing
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and Testing
 
2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop
 
Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]Technical debt in cyber ark [agile practitioners-2015]
Technical debt in cyber ark [agile practitioners-2015]
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an sa
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
Critical incident management.pptx
Critical incident management.pptxCritical incident management.pptx
Critical incident management.pptx
 
Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!Think You've Tested Your DR Plan? Think again!
Think You've Tested Your DR Plan? Think again!
 
Harry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get WorseHarry Regan - It's Never So Bad That It Can't Get Worse
Harry Regan - It's Never So Bad That It Can't Get Worse
 
RPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slidesRPS/APS vulnerability in snom/yealink and others - slides
RPS/APS vulnerability in snom/yealink and others - slides
 
Cloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service OverviewCloud-Based Disaster Recovery Service Overview
Cloud-Based Disaster Recovery Service Overview
 
Avoiding Technical Bankruptcy
Avoiding Technical BankruptcyAvoiding Technical Bankruptcy
Avoiding Technical Bankruptcy
 
2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues2011 06-21 green365 nahbrc - hph reoccuring issues
2011 06-21 green365 nahbrc - hph reoccuring issues
 
Plate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery SolutionPlate Spin Disaster Recovery Solution
Plate Spin Disaster Recovery Solution
 
Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017Advanced problems solving using A3 Report - January 2017
Advanced problems solving using A3 Report - January 2017
 
World-Class Incident Response Management
World-Class Incident Response ManagementWorld-Class Incident Response Management
World-Class Incident Response Management
 
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
If an Application Fails in the Datacenter and No Users Are On It, Will it Cut...
 
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, StatuspageProduct Keynote: Jira Service Desk, Opsgenie, Statuspage
Product Keynote: Jira Service Desk, Opsgenie, Statuspage
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Nothing Good Ever Happens After 2am

  • 1. Nothing Good Ever Happens After 2am Reversim 2019
  • 2. Daniel Korn Engineering Team Lead at BigPanda  korndaniel1
  • 3.
  • 4.
  • 6. Roles and responsibilities On-call Incident Manager
 On-Call (IMOC) Tech Lead
 On-Call (TLOC) Support 
 On-Call (SOC)
  • 7. Incident Priority Definitions Priority Affect Outage Resolution P1 • Core feature • Multiple customers 24/7 P2 • Core feature • Single customer 24/7 P3 • Secondary feature • No workaround Next business day
  • 12. Alert/Support notifies On-call IMOC asses impact, determine P1/P2/P3 On-call performs simple mitigation On-call escalate
 to IMOC IMOC escalate to TLOC and SOC 1 2 3 4 5
  • 13. 6 7 8 9 10 On-call If (P1) { 
 StatusPage;
 dedicated channel;
 } SOC update customers R&D mitigate till solved, update StatusPage IMOC Verifies resolved,
 summary in channel IMOC postmortem, share with stakeholders
  • 15. THIS IS A TRUE STORY. The events depicted in this postmortem took place in Tel Aviv and San Francisco in 2018.
 
 Despite the request of the survivors, the names have not been changed. Out of respect for our customers, the story has been told exactly as it occurred.
  • 18. Background • REMINDER: BigPanda’s SLA • New Access Control (RBAC) service • Not all customers migrated • Sunday: Multi-service deployment
  • 19. [MON 05:03 PM] SOC
 multiple tickets:“cannot update environments” [05:05 PM] On-call
 Asks SOC for details, opens a dedicated Slack channel [05:08 PM] On-call
 Identifies as Auth-related, notifies TLOCs
  • 20.
  • 21.
  • 22. [05:35 PM] On-call
 “we think it’s related to a deploy, working on a fix” [05:33 PM] SOC
 considers opening a status page, but “might be a P3” [06:16 PM] SOC
 Opens status page
  • 23. Stick to the Plan TA K EAW AY
  • 24. [07:41 PM] TLOCs
 Deploy fix to production [06:50-07:30 PM] TLOCs
 Fix is tested, not reproduced debate fix or revert [07:45-08:05 PM] SOC
 Verifies together with TLOCs the issue is resolved [08:10 PM] SOC
 Closes status page
 On-call and TLOCs leaving
  • 25. REVERT FIRST Rule of Thumb TA K EAW AY
  • 26. [12:57 AM] SOC
 “So it appears to be just a UI issue”. Notifies On-call [12:45 AM] Support
 “Some customers can’t see roles in the env editor” [12:59 AM] On-call
 Notifies TLOC [01:01 AM] TLOC
 Starts investigating the issue
  • 27.
  • 28. – Someone smart If it looks like an outage, and (support) sounds like an outage, then it might be just a bug“
  • 29. Do not Assume an Outage TA K EAW AY
  • 30. [01:54 AM] TLOCs
 Deploy fix to production, 
 ask SOC to verify with customers [01:20 AM] TLOCs
 Identifying the cause, 
 starting to work on a fix
  • 31. If you think this has a happy ending, you haven’t been paying attention. — Ramsay Bolton “
  • 32. [02:00 AM] SOC + Support 
 Debating on StatusPage re-open [01:57 AM] Support
 customers reporting the initial issue - “cannot update environments” [02:03 AM] TLOCs
 Start investigating the issue
  • 33. [02:15-02:51 AM] TLOC
 Manually adds missing permissions to customers DB [02:10 AM] TLOCs
 Identifying the cause - lack of permissions (migration)
  • 34.
  • 35. Time to Call it a Night TA K EAW AY
  • 36. [02:56 AM] SOC
 Verifies this customer is facing the issue [02:52 AM] TLOC
 Having problems with a specific customer [02:56-03:25 AM] TLOCs
 Identify the problem - edge case involving FT and manual customizations [03:25 PM] SOC
 Asks TLOC to discuss the situation on a phone call
  • 37. [-04:07 AM] SOC+TLOC
 SOC asks TLOC to commit to fix by EOD [03:29- AM] SOC + TLOC
 Sensitive customer, no changes ,issue remains [09:30 AM - 05:12 PM] TLOCs
 Implemented a fix, deploy to production, ask SOC to verify [05:25 PM] SOC
 Verifies issue resolved
  • 38. Do not Commit to Action Items TA K EAW AY
  • 39. [19:00 PM] CS + R&D + PM
 Joint postmortem,
 Preparing customer’s updates [WED 11:00 AM] R&D
 Conduct a postmortem,
 Share with R&D and CS
  • 40. Chaos isn’t a pit. Chaos is a ladder. — Petyr “Littlefinger” Baelish “
  • 41. Recap
  • 42. • Stick to the plan • Rule of thumb: REVERT FIRST • Do not assume an outage • Time to call it a night • Do not commit to action items