Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Chaos is a ladder !
1. FULLSTACK TECH RADAR DAY
CHAOS is a Ladder
Haggai Philip Zagury (hagzag) | DevOps Group
& Tech Lead @ Tikal Knowledge
2. FULLSTACK TECH RADAR DAY
Haggai Philip Zagury
DevOps Group & Tech Lead -> 10+ years @ Tikal
My open thinking and open techniques ideology is driven by Open Source technologies and the
collaborative manner defining my M.O.
My solution driven approach is strongly based on hands-on and deep understanding of Operating
Systems, Applications stacks and Software languages, Networking, Cloud in general and today more
an more Cloud Native solutions.
@hagzag
3. FULLSTACK TECH RADAR DAY
What is Chaos Engineering ?
The philosophy behind Chaos Engineering
4. FULLSTACK TECH RADAR DAY
http://bit.ly/2VQGCup
Chaos means many different
things to different people…
5. FULLSTACK TECH RADAR DAY
In 1 Sentence
‣ Chaos Engineering is the discipline of
experimenting on a distributed system in
order to build confidence in the system’s
capability to withstand turbulent
conditions in production.
Building Trust
6. FULLSTACK TECH RADAR DAY
Building Resilient Trust in systems is hard !
Backend DevOps Frontend & Mobile
}
12. FULLSTACK TECH RADAR DAY
Building confidence in computer systems is hard !
● Systems fail (Some “Design to Fail”)
● “Best Effort” Infra
● *aaS
● Cloud
● Cloud native
● Hybrid Cloud
● …
14. FULLSTACK TECH RADAR DAY
Additional to “Traditional Testing”
● Chaos Engineering goes beyond
traditional (failure) testing in that it's not
only about verifying assumptions. It also
helps us explore the many unpredictable
things that could happen and discover
new properties of our inherently chaotic
systems.
15. FULLSTACK TECH RADAR DAY
Hypothesis-Driven Experiments
● Hypothesis Define your steady state
16. FULLSTACK TECH RADAR DAY
Hypothesis-Driven Experiments
● Hypothesis Define your steady state
● Experiment by challenging it
17. FULLSTACK TECH RADAR DAY
Hypothesis-Driven Experiments
● Hypothesis Define your steady state
● Experiment by challenging it
● Analyse your findings - spread the word
18. FULLSTACK TECH RADAR DAY
Hypothesis-Driven Experiments
● Hypothesis - Define your steady state
● Experiment by challenging it
● Analyse your findings - spread the word
● Action items should be noted
● Perhaps run another round with
other limits / variables
● Immune your system (eventually)
Immune
19. FULLSTACK TECH RADAR DAY
Chaos engineering is:
● Like injecting a Vaccine to immune yourself.
● Increase system resilience - by discovering vulnerabilities
● Identify failure before it becomes an outage
● Better define your steady state (iterative) and constantly challenge it.
20. FULLSTACK TECH RADAR DAY
Chaos engineering isn’t:
● Breaking down production on purpose.
● A (new) blame mechanism
● Surprising partial outages.
● Taking down all the system at the same time.
25. FULLSTACK TECH RADAR DAY
DevOps
2010 20111998
How Complex Systems Fail (Being a Short
Treatise on the Nature of Failure;
How Failure is Evaluated; How Failure is Attributed to
Proximate Cause; and the Resulting New
25 years Resilience partitionist
26. FULLSTACK TECH RADAR DAY
DevOps
2010 20111998
How Complex Systems Fail (Being a Short
Treatise on the Nature of Failure;
How Failure is Evaluated; How Failure is Attributed to
Proximate Cause; and the Resulting New
25 years Resilience partitionist
http://erikhollnagel.com/ideas/resilience-engineering.html
A system is resilient if it can adjust its
functioning prior to, during, or following
events (changes, disturbances, and
opportunities), and thereby sustain
required operations under both expected and
Resilience Engineering
27. FULLSTACK TECH RADAR DAY
Unleash the Army
DevOps
2010 2011 2014
Chaos Engineer
Role Announced
28. FULLSTACK TECH RADAR DAY
DevOps
2010 2011 2014
Chaos Engineer
Role Announced
gremlin.com
Failure as a service
Unleash the Army
2015
29. FULLSTACK TECH RADAR DAY
DevOps
2010 2011 2014
Chaos Engineer
Role Announced
gremlin.com
Failure as a service
2017
Unleash the Army
2015
A system is resilient if it can adjust its
functioning prior to, during, or following
events (changes, disturbances, and
opportunities), and thereby sustain
required operations under both expected and
Resilience Engineering
30. FULLSTACK TECH RADAR DAY
DevOps
2010 20142011
http://erikhollnagel.com/ideas/resilience-engineering.html
2015
A system is resilient if it can adjust its
functioning prior to, during, or following
events (changes, disturbances, and
opportunities), and thereby sustain
required operations under both expected and
Resilience Engineering
20172016
Building trust in
Chaos Engineering
1998
Chaos Engineer
Role Announced
33. FULLSTACK TECH RADAR DAY
In 1 Sentence
‣ Chaos Engineering is the discipline of experimenting on a
distributed system in order to build confidence in the
system’s capability to withstand turbulent
conditions in production.
‣ Preparing for the unknown …
Building Trust
34. FULLSTACK TECH RADAR DAY
Turbulent condition - failing node in a cluster
default
a b
b
aa a
● 2 services in a 3 node cluster
35. FULLSTACK TECH RADAR DAY
Turbulent conditions
default
a b
b
aa a
● What’s my application going to suffer from ?
36. FULLSTACK TECH RADAR DAY
Turbulent conditions
default
a b
b aa
a
● 2 services in a 3 node cluster
● What’s my application going
to suffer from ?
● Is this OK ?
37. FULLSTACK TECH RADAR DAY
Turbulent conditions
default
a b
b
aa a
● Back to Normal
45. FULLSTACK TECH RADAR DAY
Not just graphs and logs (that too)
● RCA’s - recording and being able to reach it !
● Document, Document, Document - great resources on how to do that.
● We don’t Chaos everything …
● Only what makes sense / repeats
● Game / Chaos Days -> keep experiment definitions for GameDay/
ChaosDay to define
46. FULLSTACK TECH RADAR DAY
SLA … is innovation driven - how fast did you do without
failing ?
https://cloudplatformonline.com/rs/248-TPC-286/images/DORA-State%20of%20DevOps.pdf
47. FULLSTACK TECH RADAR DAY
SLA … is innovation driven - how fast did you do without
failing ?
https://cloudplatformonline.com/rs/248-TPC-286/images/DORA-State%20of%20DevOps.pdf
49. FULLSTACK TECH RADAR DAY
Application
Caching
Database
Hardware
Network
What layer ? - All !
50. FULLSTACK TECH RADAR DAY
The ultimate chaos “butterfly Affect” / “Domino Affect”
● How will my application do
● without cache ?
● without a certain api available ?
● with n sessions
51. FULLSTACK TECH RADAR DAY
The ultimate chaos “butterfly Affect” / “Domino Affect”
● How will my application do
● without cache ?
● without a certain api available ?
● with n sessions
52. FULLSTACK TECH RADAR DAY
Applying Chos Engineering practices
Log | Messure
Monitor
Break Things & Auto Recover
Experiment
Full Cycle - Chaos
Immune
Application
Caching
Database
Hardware
Network
Security
53. FULLSTACK TECH RADAR DAY
Where is Chaos going ?
"the discipline of experimenting on
a distributed system in order to
build confidence in the system's
capability to withstand turbulent
conditions in production."
56. FULLSTACK TECH RADAR DAY
Game-day resources
https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/
Planning your GameDay ?
Feel Free to contact me directly -
we’d be happy to help -> hagzag@tikalk.com
58. FULLSTACK TECH RADAR DAY
Experiment Terminate a pod !
● What to do
● When to do it
{
"type": "action",
"name": "terminate-db-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=my-app",
"name_pattern": "my-app-[0-9]$",
"rand": true,
"ns": "default"
}
},
"pauses": {
"after": 5
}
60. FULLSTACK TECH RADAR DAY
Chaoskube
● chaoskube is a “chaos-monkey lite” it basically takes down pod based
on a schedule to test your resilience (and there are some tweaks via
configuration)
● use —dry-run
https://github.com/linki/chaoskube
61. FULLSTACK TECH RADAR DAY
kube-bench
Find vulnerabilities, configuration flags, define your own policies.
62. FULLSTACK TECH RADAR DAY
kube-hunter (Security)
1. Remote scanning To specify remote machines for hunting, select option 1 or use
the --remote option. Example:./kube-hunter.py --remote some.node.com
2. Internal scanning To specify internal scanning, you can use the --internal option.
(this will scan all of the machine's network interfaces) Example: ./kube-hunter.py --
internal
3. Network scanning To specify a specific CIDR to scan, use the --cidr option.
Example: ./kube-hunter.py --cidr 192.168.0.0/24
63. FULLSTACK TECH RADAR DAY
Many many more ….
● Stay tuned for more stuff about Chaos Engineering
● https://www.tikalk.com/community
64. Thank you for joining us
Haggai Philip Zagury
DevOps Group & Tech Lead @ Tikal