Interactive Session on "Chaos engineering: Break it to make it" by Anupam Agarwal,Nagarro, Peeyush Girdhar, Cloud / DevOps Nagarro. at #ATAGTR2021.
#ATAGTR2021 was the 6th Edition of Global Testing Retreat.
The video recording of the session is now available on the following link: https://www.youtube.com/watch?v=4bM4f8xNp2A
To know more about #ATAGTR2021, please visit:https://gtr.agiletestingalliance.org/
3. AGENDA
01
02
03
04
Concept of Chaos Engineering
Need for Chaos Engineering
Chaos Engineering vs Normal
Testing
Start your journey with Chaos
Engineering
4. Why the World Needs more Resilient Systems ?
1
BREACH
2
MATURITY
3
TEAMS
4
TESTING
Organizations confirmed or suspected breaches tied to their
applications or Infrastructure.
Organization that are in immature or improving state with respect to
environment resilience.
Teams have not incorporated resilience testing in their design during
initial stages of SDLC
Traditional testing are still not helping them to find the issues within
the ecosystems..
24%
86%
65%
47%
Common issues faced by multiple organizations
5. Chaos Engineering : Where are we ?
The art of breaking things purposefully
Ever since Netflix introduced Chaos Engineering
through their Simian Army toolset in 2012, the idea of
inducing failure as a preventative means has become
one of the preferred resilience techniques for cloud
native distributed systems.
“Chaos Engineering is the discipline of
experimenting on a distributed system in
order to induce artificial failures to build
confidence in the system's capability to withstand
turbulent conditions in production.”
Here's how Netflix describes why they built these chaos tools:
The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee
100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud
architecture where individual components can fail without affecting the availability of the entire
system. In effect, we have to be stronger than our weakest link.
6. Why Chaos Engineering?
Chaos Engineering is Preventive Medicine
Chaos Engineering is an approach for learning about how your
system behaves by applying a discipline of empirical
exploration.
Chaos engineering enables organizations to develop reliable and fault-tolerant
software systems, building your team’s confidence in them. The more stable
your systems are, the more confident you can be that they will function
properly.
By designing and executing Chaos Engineering experiments,
you will learn about weaknesses in your system that could
potentially lead to outages in customer environment.
LEARN
PREVENT
OUTAGES
BUILD
CONFIDENCE
7. Getting Started with Chaos Engineering
Disciplined approach to find failures before they become outages.
DEFINE ‘STEADY
STATE’
CREATE
HYPOTHESIS
RUN EXPERIMENTS INTERPRET THE
RESULTS
LEARN & IMPROVE
Start by defining
‘steady state’ as
some
measurable
output of a
system that
indicates normal
behavior.
Hypothesize that
this steady state
will continue in
both the control
group and the
experimental
group
Introduce attacks
that reflect real
world events like
server crash, hard
drive
malfunctioning,
network outage etc.
Try to disprove
the hypothesis
by looking for a
difference in
steady state
between the
control group
and the
experimental
Improve
functionalities in
the existing
system from the
above
experiments and
their results.
8. Chaos Engineering Meets DevOps
Maximize benefits by practicing automated Chaos Engineering within your
CI/CD pipelines
10. What is Game Day?
Game Day are like fire drills on a dedicated day for
running chaos engineering experiments on our
systems.
Define the timelines
Whiteboarding
Execution
Review
Define the Targets
How to run a
Game Day
Promote Chaos Days !!
11. How Chaos Engineering differ from Testing ?
Practice for generating new information
• Experiments propose a hypothesis,
and if the hypothesis is not
disproven, confidence grows in that
hypothesis. If it is disproven, then
we learn something new.
GENERATE NEW
INFORMATION
• An important distinction can be drawn
between testing and experimentation.
Tests make an assertion, based on
existing knowledge, and then running the
test collapses the valence of that
assertion, usually into either true or false.
DRAW DISTINCTION
When you want to explore the many ways,
a complex system can misbehave,
injecting communication failures like
latency and errors is one good approach.
EXPLORATION OF
UNKNOWN
• Testing, strictly speaking, does not
create new knowledge. Testing
requires that the engineer writing the
test knows specific properties about
the system that they are looking for
in advance.
COMPLEX ECOSYSTEM
12. Tools to kickstart your Chaos Journey
AWS Fault Injection
Which one to choose?
13. Is it even worth embracing?
Pros Cons
• Insights received after running chaos
testing can lead to a reduction in
production incidents for the future.
• Implementing Chaos tools for a large-
scale system and experimenting can
lead to an increase in cost.
• Helps in improving the confidence
and engagement of team members for
carrying out disaster recovery
methods and makes applications
highly reliable.
• Carelessness or Incorrect
steps in formation and implementation
can impact the application, thereby
hampering the customer.
• On a high level, Chaos Engineering
provides us an advantage by overall
system availability.
• It doesn't support all kinds of
deployment.
• Production outages can lead to huge
losses; therefore, chaos engineering
helps in the prevention of large
losses in revenue.
• Most of the chaos Engineering tools
do not covers all type of
environments and its components.
• The team can verify
system's behavior on failure to take
Opportunities & Obstacles