SlideShare una empresa de Scribd logo
1 de 13
Automated Fault-Tolerance Testing
Ajay Vaddadi, Groupon
April 11th, 2016
About Me and My Team
● Lead Internal Tools Developer @ Groupon.
● Employed with Groupon for 4 years.
● Built various Internal tools ranging from Test Automation Frameworks to
Continuous Performance and Deployment Tools (Story for another time …)
● My Team -
○ Adithya Nagarajan (Team Manager and Co-Author on the Paper)
○ Amrutha Badami
○ Kavin Arasu
Fault Tolerance - What is it and why is it important ?
● Fault Tolerance is an ability of computer software to continue its normal
operation despite the presence of system or hardware faults.
● Fault in Software terms can vary from high latency from dependent or sub-
systems to complete failure of the System under test (SUT).
● Fault injection can easily be simulated and overall environment behaviour
can be observed to make changes in design to be better tolerant to the
failure.
● Many downtimes and economic losses would have been avoided if were
tested for fault tolerance -
○ ATT Complete Network Failure (1990) - SPOF switch failure.
○ Amazon Outage (2011) - Interestingly Netflix was not affected.
About Groupon …
Engineering Scale @ Groupon
● Microservices Based Architecture
● 500+ Production Services
● In-house, Self-Managed Data
Centres spread across NA,
Europe & Asia Pacific
● Complex Interdependencies
between Services
What are the Requirements ?
● Ability to Simulate Failures and Identify resiliency issues within the
Groupon Architecture
● Ability to understand the Failure propagation pattern when a failure occurs.
● Ability to understand service architecture to better identify and group faults
on a unique set of application servers/apps/machines.
● Ability to continuously monitor critical metrics during the faults on the SUT
and dependent Services and terminate when anomaly is detected.
● Ability to recover the System under Test to pre-fault state.
● Companies like Google, Amazon, Twitter and Netflix conduct
fault tolerance testing on their platform.
● NETFLIX was one of the early players of in this Fault testing
domain by developing widely popular CHAOS MONKEY /
SIMIAN ARMY
● Other tools such as Varien, Pantera are mocked versions of
SIMIAN ARMY developed in other ecosystems like python.
● Either very company specific and are not reusable OR highly
geared towards Cloud based Infra such as Amazon Web
Services etc.
What is already done ?
Available Tools : Are we reinventing the wheel ?
ScrewDriver - Our Proposed Solution ..
● Capsule Builder - Generates time based expirable, machine specific
capsule/container to be deployed on the machines specified by Topology service.
● Controller Core - Starts/Stops the fault/capsule with proper authentication. During
fault execution works with Anomaly detector using Publisher/Subscriber model.
● Anomaly Detector - Interacts with Metric Client and identifies Anomalies. If found -
aborts the current fault execution. Sometimes, Static Threshold checks are not
enough to identify anomalies and need to delve into 2nd and 3rd derivatives.
● Notification Module - Communicates the status about the fault execution and post-
execution report to the service owners.
Controller (and it’s Friends)
Controller is a combination of 4 major modules focussing on 3 Key aspects - Building and Deploying Capsule,
Injecting Fault(Core), Monitoring the Fault(Anomaly Detector).
How does it work?
Topology Translation Engine
Service to parse and understand the building blocks (components/layers) of any service. Picks the machines and
the corresponding faults to be executed and identifies the dependencies for the given service to monitor.
● Consumes topology YAML file for a given service and identifies underlying
components/layers, machines, monitors and dependencies for the service.
● Information is pushed to GraphDB(Neo4j) for persistent storage.
● GraphDB allows us to identify dependency chain between systems without complex
joins.
● Identifies list of machines for fault to be executed based on various selection criteria
like - Traffic %, Rack, Colo, Random etc.
How does it work?
Capsule : Self Contained Execution Engine.
Self contained module which acts as an agent and executes faults on the Machine under Test.
● Deployed on demand to the machines under test. Self contained entity on its own,
only dependent on Tower for basic instructions
● Bootstrap checks ensures that the capsule was not tampered with, only used on the
machines it was signed for and also expires after a specific TTL.
● All the actions performed by Capsule are time bound thus ensuring no system will
go into unresponsive state and will automatically trigger fault termination and system
restore to bring back the system to healthy state as soon as possible.
● Executes the faults based on the Ansible styled - Fault Playbook (user defined
steps).
How does it work?
Milestones and Roadmap
● Basic Prototype was completed in January 2016 and it showed promising results.
● Plan to run this tool extensively against Groupon services over next few months (Q2
2016).
● Plan to publish a follow up article with Detailed Implementation, Learnings and Issues
Faced by Q3 2016.
● Long Term goal to Open Source the tool and evangelize for adaptation across industry
and contribution from academia.
Questions?

Más contenido relacionado

La actualidad más candente

Qtp training session I
Qtp training session IQtp training session I
Qtp training session I
Aisha Mazhar
 
Automation Testing
Automation TestingAutomation Testing
Automation Testing
Rajat Tiwari
 

La actualidad más candente (20)

Introduction to Unified Functional Testing 12 (UFT)
Introduction to Unified Functional Testing 12 (UFT)Introduction to Unified Functional Testing 12 (UFT)
Introduction to Unified Functional Testing 12 (UFT)
 
Александр Махомет "Feature Flags. Уменьшаем риски при выпуске изменений"
Александр Махомет "Feature Flags. Уменьшаем риски при выпуске изменений" Александр Махомет "Feature Flags. Уменьшаем риски при выпуске изменений"
Александр Махомет "Feature Flags. Уменьшаем риски при выпуске изменений"
 
100 tests per second - 40 releases per week
100 tests per second - 40 releases per week100 tests per second - 40 releases per week
100 tests per second - 40 releases per week
 
Test automation
Test automationTest automation
Test automation
 
Java Exceptions Best Practices
Java Exceptions Best PracticesJava Exceptions Best Practices
Java Exceptions Best Practices
 
Top 5 pitfalls of software test automatiion
Top 5 pitfalls of software test automatiionTop 5 pitfalls of software test automatiion
Top 5 pitfalls of software test automatiion
 
Case study: QTP to Selenium migration
Case study: QTP to Selenium migrationCase study: QTP to Selenium migration
Case study: QTP to Selenium migration
 
Ug. marketplace testing
Ug. marketplace testingUg. marketplace testing
Ug. marketplace testing
 
Qtp training session I
Qtp training session IQtp training session I
Qtp training session I
 
Feature toggling
Feature togglingFeature toggling
Feature toggling
 
test
testtest
test
 
Automated Regression Test Pitch Presentation
Automated Regression Test Pitch PresentationAutomated Regression Test Pitch Presentation
Automated Regression Test Pitch Presentation
 
Session 01 - Introduction to UFT and Features - Slides
Session 01 - Introduction to UFT and Features - SlidesSession 01 - Introduction to UFT and Features - Slides
Session 01 - Introduction to UFT and Features - Slides
 
How to Select the Right Automation Testing Tool
How to Select the Right Automation Testing ToolHow to Select the Right Automation Testing Tool
How to Select the Right Automation Testing Tool
 
Introduction to Automation Testing
Introduction to Automation TestingIntroduction to Automation Testing
Introduction to Automation Testing
 
Choosing the Best Open Source Test Automation Tool for You
Choosing the Best Open Source Test Automation Tool for YouChoosing the Best Open Source Test Automation Tool for You
Choosing the Best Open Source Test Automation Tool for You
 
Selenium Camp 2016 - Kiev, Ukraine
Selenium Camp 2016 -  Kiev, UkraineSelenium Camp 2016 -  Kiev, Ukraine
Selenium Camp 2016 - Kiev, Ukraine
 
Automation Testing
Automation TestingAutomation Testing
Automation Testing
 
Java Exceptions and Exception Handling
 Java  Exceptions and Exception Handling Java  Exceptions and Exception Handling
Java Exceptions and Exception Handling
 
Merge hells - Feature Toggles to the rescue
Merge hells - Feature Toggles to the rescueMerge hells - Feature Toggles to the rescue
Merge hells - Feature Toggles to the rescue
 

Destacado

Каталог кабельных креплений ПЭОТЭК
Каталог кабельных креплений ПЭОТЭККаталог кабельных креплений ПЭОТЭК
Каталог кабельных креплений ПЭОТЭК
Aleksandr Yakovlev
 

Destacado (20)

Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...
 
D1.03 ppt market research-v5
D1.03 ppt market research-v5D1.03 ppt market research-v5
D1.03 ppt market research-v5
 
Каталог кабельных креплений ПЭОТЭК
Каталог кабельных креплений ПЭОТЭККаталог кабельных креплений ПЭОТЭК
Каталог кабельных креплений ПЭОТЭК
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single Step
 
Tema 2 apps
Tema 2 apps Tema 2 apps
Tema 2 apps
 
Imágenes representando conflicto
Imágenes representando conflicto Imágenes representando conflicto
Imágenes representando conflicto
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
 
TMPA-2015: Formal Methods in Robotics
TMPA-2015: Formal Methods in RoboticsTMPA-2015: Formal Methods in Robotics
TMPA-2015: Formal Methods in Robotics
 
TMPA-2015: Expanding the Meta-Generation of Correctness Conditions by Means o...
TMPA-2015: Expanding the Meta-Generation of Correctness Conditions by Means o...TMPA-2015: Expanding the Meta-Generation of Correctness Conditions by Means o...
TMPA-2015: Expanding the Meta-Generation of Correctness Conditions by Means o...
 
TMPA-2015: Automated process of creating test scenarios for financial protoco...
TMPA-2015: Automated process of creating test scenarios for financial protoco...TMPA-2015: Automated process of creating test scenarios for financial protoco...
TMPA-2015: Automated process of creating test scenarios for financial protoco...
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
TMPA-2015: Multi-Module Application Tracing in z/OS Environment
TMPA-2015: Multi-Module Application Tracing in z/OS EnvironmentTMPA-2015: Multi-Module Application Tracing in z/OS Environment
TMPA-2015: Multi-Module Application Tracing in z/OS Environment
 
TMPA-2015: ClearTH: a Tool for Automated Testing of Post Trade Systems
TMPA-2015: ClearTH: a Tool for Automated Testing of Post Trade SystemsTMPA-2015: ClearTH: a Tool for Automated Testing of Post Trade Systems
TMPA-2015: ClearTH: a Tool for Automated Testing of Post Trade Systems
 
TMPA-2015: Lexical analysis of dynamically formed string expressions
TMPA-2015: Lexical analysis of dynamically formed string expressionsTMPA-2015: Lexical analysis of dynamically formed string expressions
TMPA-2015: Lexical analysis of dynamically formed string expressions
 
TMPA-2015: The Application of Parameterized Hierarchy Templates for Automated...
TMPA-2015: The Application of Parameterized Hierarchy Templates for Automated...TMPA-2015: The Application of Parameterized Hierarchy Templates for Automated...
TMPA-2015: The Application of Parameterized Hierarchy Templates for Automated...
 
TMPA-2015: Information Support System for Autonomous Spacecraft Control Macro...
TMPA-2015: Information Support System for Autonomous Spacecraft Control Macro...TMPA-2015: Information Support System for Autonomous Spacecraft Control Macro...
TMPA-2015: Information Support System for Autonomous Spacecraft Control Macro...
 
TMPA-2015: Automated Testing of Multi-thread Data Structures Solutions Lineri...
TMPA-2015: Automated Testing of Multi-thread Data Structures Solutions Lineri...TMPA-2015: Automated Testing of Multi-thread Data Structures Solutions Lineri...
TMPA-2015: Automated Testing of Multi-thread Data Structures Solutions Lineri...
 

Similar a Automated Fault Tolerance Testing

WSO2Con Asia 2014 - Effective Test Automation in an Agile Environment
WSO2Con Asia 2014 - Effective Test Automation in an Agile EnvironmentWSO2Con Asia 2014 - Effective Test Automation in an Agile Environment
WSO2Con Asia 2014 - Effective Test Automation in an Agile Environment
WSO2
 
Progressive Mobile Test Automation
Progressive Mobile Test AutomationProgressive Mobile Test Automation
Progressive Mobile Test Automation
Rakhi Jain Rohatgi
 

Similar a Automated Fault Tolerance Testing (20)

Monitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafaMonitoring & alerting presentation sabin&mustafa
Monitoring & alerting presentation sabin&mustafa
 
unit-5 SPM.pptx
unit-5 SPM.pptxunit-5 SPM.pptx
unit-5 SPM.pptx
 
Test automation: Are Enterprises ready to bite the bullet?
Test automation: Are Enterprises ready to bite the bullet?Test automation: Are Enterprises ready to bite the bullet?
Test automation: Are Enterprises ready to bite the bullet?
 
WSO2Con Asia 2014 - Effective Test Automation in an Agile Environment
WSO2Con Asia 2014 - Effective Test Automation in an Agile EnvironmentWSO2Con Asia 2014 - Effective Test Automation in an Agile Environment
WSO2Con Asia 2014 - Effective Test Automation in an Agile Environment
 
Wso2con test-automation
Wso2con test-automationWso2con test-automation
Wso2con test-automation
 
The Final Frontier, Automating Dynamic Security Testing
The Final Frontier, Automating Dynamic Security TestingThe Final Frontier, Automating Dynamic Security Testing
The Final Frontier, Automating Dynamic Security Testing
 
JUST EAT: Embracing DevOps
JUST EAT: Embracing DevOpsJUST EAT: Embracing DevOps
JUST EAT: Embracing DevOps
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
Testing strategies and best practices using MUnit
Testing strategies and best practices using MUnitTesting strategies and best practices using MUnit
Testing strategies and best practices using MUnit
 
Android testing part i
Android testing part iAndroid testing part i
Android testing part i
 
Progressive Mobile Test Automation
Progressive Mobile Test AutomationProgressive Mobile Test Automation
Progressive Mobile Test Automation
 
Automation for developers
Automation for developersAutomation for developers
Automation for developers
 
TestIstanbul 2015
TestIstanbul 2015TestIstanbul 2015
TestIstanbul 2015
 
Automated Exploratory Testing
Automated Exploratory TestingAutomated Exploratory Testing
Automated Exploratory Testing
 
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
DEPLOYMENT OF CALABASH AUTOMATION FRAMEWORK TO ANALYZE THE PERFORMANCE OF AN ...
DEPLOYMENT OF CALABASH AUTOMATION FRAMEWORK TO ANALYZE THE PERFORMANCE OF AN ...DEPLOYMENT OF CALABASH AUTOMATION FRAMEWORK TO ANALYZE THE PERFORMANCE OF AN ...
DEPLOYMENT OF CALABASH AUTOMATION FRAMEWORK TO ANALYZE THE PERFORMANCE OF AN ...
 
Justin Ison
Justin IsonJustin Ison
Justin Ison
 
Project Onion unit test environment
Project Onion unit test environmentProject Onion unit test environment
Project Onion unit test environment
 
Software testing mtech project in jalandhar
Software testing mtech project in jalandharSoftware testing mtech project in jalandhar
Software testing mtech project in jalandhar
 

Último

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 

Último (20)

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Automated Fault Tolerance Testing

  • 1. Automated Fault-Tolerance Testing Ajay Vaddadi, Groupon April 11th, 2016
  • 2. About Me and My Team ● Lead Internal Tools Developer @ Groupon. ● Employed with Groupon for 4 years. ● Built various Internal tools ranging from Test Automation Frameworks to Continuous Performance and Deployment Tools (Story for another time …) ● My Team - ○ Adithya Nagarajan (Team Manager and Co-Author on the Paper) ○ Amrutha Badami ○ Kavin Arasu
  • 3. Fault Tolerance - What is it and why is it important ? ● Fault Tolerance is an ability of computer software to continue its normal operation despite the presence of system or hardware faults. ● Fault in Software terms can vary from high latency from dependent or sub- systems to complete failure of the System under test (SUT). ● Fault injection can easily be simulated and overall environment behaviour can be observed to make changes in design to be better tolerant to the failure. ● Many downtimes and economic losses would have been avoided if were tested for fault tolerance - ○ ATT Complete Network Failure (1990) - SPOF switch failure. ○ Amazon Outage (2011) - Interestingly Netflix was not affected.
  • 5. Engineering Scale @ Groupon ● Microservices Based Architecture ● 500+ Production Services ● In-house, Self-Managed Data Centres spread across NA, Europe & Asia Pacific ● Complex Interdependencies between Services
  • 6. What are the Requirements ? ● Ability to Simulate Failures and Identify resiliency issues within the Groupon Architecture ● Ability to understand the Failure propagation pattern when a failure occurs. ● Ability to understand service architecture to better identify and group faults on a unique set of application servers/apps/machines. ● Ability to continuously monitor critical metrics during the faults on the SUT and dependent Services and terminate when anomaly is detected. ● Ability to recover the System under Test to pre-fault state.
  • 7. ● Companies like Google, Amazon, Twitter and Netflix conduct fault tolerance testing on their platform. ● NETFLIX was one of the early players of in this Fault testing domain by developing widely popular CHAOS MONKEY / SIMIAN ARMY ● Other tools such as Varien, Pantera are mocked versions of SIMIAN ARMY developed in other ecosystems like python. ● Either very company specific and are not reusable OR highly geared towards Cloud based Infra such as Amazon Web Services etc. What is already done ? Available Tools : Are we reinventing the wheel ?
  • 8. ScrewDriver - Our Proposed Solution ..
  • 9. ● Capsule Builder - Generates time based expirable, machine specific capsule/container to be deployed on the machines specified by Topology service. ● Controller Core - Starts/Stops the fault/capsule with proper authentication. During fault execution works with Anomaly detector using Publisher/Subscriber model. ● Anomaly Detector - Interacts with Metric Client and identifies Anomalies. If found - aborts the current fault execution. Sometimes, Static Threshold checks are not enough to identify anomalies and need to delve into 2nd and 3rd derivatives. ● Notification Module - Communicates the status about the fault execution and post- execution report to the service owners. Controller (and it’s Friends) Controller is a combination of 4 major modules focussing on 3 Key aspects - Building and Deploying Capsule, Injecting Fault(Core), Monitoring the Fault(Anomaly Detector). How does it work?
  • 10. Topology Translation Engine Service to parse and understand the building blocks (components/layers) of any service. Picks the machines and the corresponding faults to be executed and identifies the dependencies for the given service to monitor. ● Consumes topology YAML file for a given service and identifies underlying components/layers, machines, monitors and dependencies for the service. ● Information is pushed to GraphDB(Neo4j) for persistent storage. ● GraphDB allows us to identify dependency chain between systems without complex joins. ● Identifies list of machines for fault to be executed based on various selection criteria like - Traffic %, Rack, Colo, Random etc. How does it work?
  • 11. Capsule : Self Contained Execution Engine. Self contained module which acts as an agent and executes faults on the Machine under Test. ● Deployed on demand to the machines under test. Self contained entity on its own, only dependent on Tower for basic instructions ● Bootstrap checks ensures that the capsule was not tampered with, only used on the machines it was signed for and also expires after a specific TTL. ● All the actions performed by Capsule are time bound thus ensuring no system will go into unresponsive state and will automatically trigger fault termination and system restore to bring back the system to healthy state as soon as possible. ● Executes the faults based on the Ansible styled - Fault Playbook (user defined steps). How does it work?
  • 12. Milestones and Roadmap ● Basic Prototype was completed in January 2016 and it showed promising results. ● Plan to run this tool extensively against Groupon services over next few months (Q2 2016). ● Plan to publish a follow up article with Detailed Implementation, Learnings and Issues Faced by Q3 2016. ● Long Term goal to Open Source the tool and evangelize for adaptation across industry and contribution from academia.