SlideShare a Scribd company logo
1 of 15
Amazon Major Cloud Outage Analysis
Author: Rahul Tyagi
2
The Agenda
• The Issue
• The Goals
• Analysis Methodology
• The Analysis
3
The Issue
• Due to deep proliferation of Amazon cloud into
enterprises, The major Amazon cloud outages
causes wide spread impact…
• The organizations like Netflix, Dropbox, AirBnB
and Pinterest had impact due to Amazon cloud
outages
4
The Issue
• Major cloud outages are pretty regular events in
recent past, some of the major outages
• Dec/24/2012
• Oct/22/2012
• Jun/29/2012
• Apr/21/2011
5
The Goals
• We want to analyze chain of events causing major
Amazon cloud outages (from official Amazon
statements)…
• We analyzed major outages in past 2 years…
• The goal is to identify probable root causes and
areas that have opportunity to improve…
6
Analysis Methodology
We would leverage “Analytical Hierarchy Process”
for identifying root causes…
7
Analysis Methodology
Analyze
Amazon’s
Statements
about Outage
Identify “Chain
of Events”
causing outage
Categorize
“Chain of
Events”
Analysis and
Conclusion
8
The Analysis > Analyze Amazon’s Statements about Outages
Outage Date Amazon’s Statement
Dec/24/2012 http://aws.amazon.com/message/680587/
Oct/22/2012 http://aws.amazon.com/message/680342/
Jun/29/2012 http://aws.amazon.com/message/67457/
Apr/21/2011 http://aws.amazon.com/message/65648/
We analyzed following Amazon’s official
statements…
9
The Analysis > Identify “Chain of Events” causing outages
Outage Core Issue
Dec-12
“The *ELB State+ data was deleted by a maintenance process that
was inadvertently run against the production ELB state data”
Oct-12
“The root cause of the problem was a latent bug in an operational
data collection agent that runs on the EBS storage servers”
Jun-12
“In the single datacenter that did not successfully transfer to the
generator backup, all servers continued to operate normally on
Uninterruptable Power Supply (“UPS”) power. As onsite personnel
worked to stabilize the primary and backup power generators, the
UPS systems were depleting and servers began losing power at
8:04pm PDT”
Apr-11
“The traffic shift was executed incorrectly and rather than routing
the traffic to the other router on the primary network, the traffic
was routed onto the lower capacity redundant EBS network.”
The statements in double quotes are from
Amazon’s press releases…
10
The Analysis > Identify “Chain of Events” causing outages
Outage Chain of Events
Dec-12"Maintenance process inadvertently run against production ELB state data"
Process for incident approval had loose ends
Validation for maintenance process's (which ran inadvertently) output was missing
"load balancers that were modified were improperly configured by the control plane"
Oct-12"latent bug in an (EBS) operational data collection agent"
"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent.
"the DNS update did not successfully propagate to all of the internal DNS servers"
"the (aggressive) throttling policy that was put in place was too aggressive"
Jun-12"datacenter that did not successfully transfer to the generator backup"
"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting
and servers began losing power at 8:04pm PDT"
"a small number of Multi-AZ RDS instances did not complete failover, due to a software bug"
"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen
before"
Apr-11
“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary
network, the traffic was routed onto the lower capacity redundant EBS network.”
"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity
planning and alarming so that we carry the additional safety capacity that is needed for large scale failures"
"We will audit our change process and increase the automation to prevent this mistake from happening in the future"
"We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
11
The Analysis > Categorize “Chain of Events”
Outage Chain of Events Hardware Software Automation Process
Dec-12"Maintenance process inadvertently run against production ELB state data" X X
Process for incident approval had loose ends X
Validation for maintenance process's (which ran inadvertently) output was
missing X X X
"load balancers that were modified were improperly configured by the control
plane" X
Oct-12"latent bug in an (EBS) operational data collection agent" X X
"latent memory leak bug in the reporting agent" The monitoring process of
memory leak was non existent. X X
"the DNS update did not successfully propagate to all of the internal DNS servers" X X
"the (aggressive) throttling policy that was put in place was too aggressive" X X
Jun-12"datacenter that did not successfully transfer to the generator backup" X
"As onsite personnel worked to stabilize the primary and backup power
generators, the UPS systems were depleting and servers began losing power at
8:04pm PDT" X
"a small number of Multi-AZ RDS instances did not complete failover, due to a
software bug" X X
"As the power and systems returned, a large number of ELBs came up in a state
which triggered a bug we hadn’t seen before" X X
Apr-11
“The traffic shift was executed incorrectly and rather than routing the traffic to
the other router on the primary network, the traffic was routed onto the lower
capacity redundant EBS network.” X
"We now understand the amount of capacity needed for large recovery events
and will be modifying our capacity planning and alarming so that we carry the
additional safety capacity that is needed for large scale failures" X
"We will audit our change process and increase the automation to prevent this
mistake from happening in the future" X
"We will also invest in increasing our visibility, control, and automation to recover
volumes in an EBS cluster" X X
12
The Analysis > Analysis and Conclusions
Process issues are common theme in major
outages at Amazon cloud…
13
The Analysis > Analysis and Conclusions
Software, 8
Automation, 4
Process, 14
#ofIssues
Amazon Cloud Major Outage - Issues Categories
Process and Software are leading contributing
factors to major outages at Amazon…
14
The Analysis > Analysis and Conclusions
• The majority of issues contributing to outages are
related to process or software
• It seems “Process” rigor in cloud operations and
SDLC at Amazon has opportunity to improve
• Culture? We heard, Amazon has Just-Do-It
culture, The process rigor may require more than
just “just-do-it”
15
Thank You! You are Awesome! You deserve applause!!

More Related Content

Viewers also liked

External analysis Nokia, Amazon
External analysis Nokia, AmazonExternal analysis Nokia, Amazon
External analysis Nokia, Amazon
Dan Saguy
 
Creating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersCreating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of Users
Hasan Basri AKIRMAK, MSc,ExecMBA
 
Amazon Investor's Analysis
Amazon Investor's AnalysisAmazon Investor's Analysis
Amazon Investor's Analysis
Thomas Pollard
 

Viewers also liked (19)

5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
 
Production Monitoring Platform
Production Monitoring PlatformProduction Monitoring Platform
Production Monitoring Platform
 
Aws presentation
Aws presentationAws presentation
Aws presentation
 
External analysis Nokia, Amazon
External analysis Nokia, AmazonExternal analysis Nokia, Amazon
External analysis Nokia, Amazon
 
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
 
Henry
HenryHenry
Henry
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
 
Dcpl cloud computing amazon fail
Dcpl cloud computing amazon failDcpl cloud computing amazon fail
Dcpl cloud computing amazon fail
 
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
 
Cloud malfunction up11
Cloud malfunction up11Cloud malfunction up11
Cloud malfunction up11
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
 
Creating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersCreating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of Users
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
 
Amazon Investor's Analysis
Amazon Investor's AnalysisAmazon Investor's Analysis
Amazon Investor's Analysis
 
Amazon Partnership Model
Amazon Partnership Model Amazon Partnership Model
Amazon Partnership Model
 
APN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWSAPN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWS
 
Amazon Web Services SWOT
Amazon Web Services SWOTAmazon Web Services SWOT
Amazon Web Services SWOT
 
Amazon Brand Analysis
Amazon Brand AnalysisAmazon Brand Analysis
Amazon Brand Analysis
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Amazon Cloud Major Outages Analysis

  • 1. Amazon Major Cloud Outage Analysis Author: Rahul Tyagi
  • 2. 2 The Agenda • The Issue • The Goals • Analysis Methodology • The Analysis
  • 3. 3 The Issue • Due to deep proliferation of Amazon cloud into enterprises, The major Amazon cloud outages causes wide spread impact… • The organizations like Netflix, Dropbox, AirBnB and Pinterest had impact due to Amazon cloud outages
  • 4. 4 The Issue • Major cloud outages are pretty regular events in recent past, some of the major outages • Dec/24/2012 • Oct/22/2012 • Jun/29/2012 • Apr/21/2011
  • 5. 5 The Goals • We want to analyze chain of events causing major Amazon cloud outages (from official Amazon statements)… • We analyzed major outages in past 2 years… • The goal is to identify probable root causes and areas that have opportunity to improve…
  • 6. 6 Analysis Methodology We would leverage “Analytical Hierarchy Process” for identifying root causes…
  • 7. 7 Analysis Methodology Analyze Amazon’s Statements about Outage Identify “Chain of Events” causing outage Categorize “Chain of Events” Analysis and Conclusion
  • 8. 8 The Analysis > Analyze Amazon’s Statements about Outages Outage Date Amazon’s Statement Dec/24/2012 http://aws.amazon.com/message/680587/ Oct/22/2012 http://aws.amazon.com/message/680342/ Jun/29/2012 http://aws.amazon.com/message/67457/ Apr/21/2011 http://aws.amazon.com/message/65648/ We analyzed following Amazon’s official statements…
  • 9. 9 The Analysis > Identify “Chain of Events” causing outages Outage Core Issue Dec-12 “The *ELB State+ data was deleted by a maintenance process that was inadvertently run against the production ELB state data” Oct-12 “The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers” Jun-12 “In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT” Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” The statements in double quotes are from Amazon’s press releases…
  • 10. 10 The Analysis > Identify “Chain of Events” causing outages Outage Chain of Events Dec-12"Maintenance process inadvertently run against production ELB state data" Process for incident approval had loose ends Validation for maintenance process's (which ran inadvertently) output was missing "load balancers that were modified were improperly configured by the control plane" Oct-12"latent bug in an (EBS) operational data collection agent" "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. "the DNS update did not successfully propagate to all of the internal DNS servers" "the (aggressive) throttling policy that was put in place was too aggressive" Jun-12"datacenter that did not successfully transfer to the generator backup" "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" "We will audit our change process and increase the automation to prevent this mistake from happening in the future" "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
  • 11. 11 The Analysis > Categorize “Chain of Events” Outage Chain of Events Hardware Software Automation Process Dec-12"Maintenance process inadvertently run against production ELB state data" X X Process for incident approval had loose ends X Validation for maintenance process's (which ran inadvertently) output was missing X X X "load balancers that were modified were improperly configured by the control plane" X Oct-12"latent bug in an (EBS) operational data collection agent" X X "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. X X "the DNS update did not successfully propagate to all of the internal DNS servers" X X "the (aggressive) throttling policy that was put in place was too aggressive" X X Jun-12"datacenter that did not successfully transfer to the generator backup" X "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" X "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" X X "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" X X Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” X "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" X "We will audit our change process and increase the automation to prevent this mistake from happening in the future" X "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster" X X
  • 12. 12 The Analysis > Analysis and Conclusions Process issues are common theme in major outages at Amazon cloud…
  • 13. 13 The Analysis > Analysis and Conclusions Software, 8 Automation, 4 Process, 14 #ofIssues Amazon Cloud Major Outage - Issues Categories Process and Software are leading contributing factors to major outages at Amazon…
  • 14. 14 The Analysis > Analysis and Conclusions • The majority of issues contributing to outages are related to process or software • It seems “Process” rigor in cloud operations and SDLC at Amazon has opportunity to improve • Culture? We heard, Amazon has Just-Do-It culture, The process rigor may require more than just “just-do-it”
  • 15. 15 Thank You! You are Awesome! You deserve applause!!