Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction logs, controller and participant logs and reconstructed the exact sequence of steps that led to a failure. Once we get the exact sequence of steps, we reproduce the events using Helix for orchestration.
Salesforce Miami User Group Event - 1st Quarter 2024
Data driven testing: Case study with Apache Helix
1. Data
Driven
Tes,ng
for
Distributed
Systems
Case
study
with
Apache
Helix
Kishore
Gopalakrishna,
@kishoreg1980
hBp://www.linkedin.com/in/kgopalak
2. Outline
• Intro
to
Helix
• Use
case:
Distributed
data
store
• Tradi,onal
approach
• Data
driven
tes,ng
• Q
&
A
3. What
is
Helix
• Generic
cluster
management
framework
– Par,,on
management
– Failure
detec,on
and
handling
– Elas,city
4. Terminologies
Node
A
single
machine
Cluster
Set
of
Nodes
Resource
A
logical
en/ty
e.g.
database,
index,
task
Par,,on
Subset
of
the
resource.
Replica
Copy
of
a
par,,on
State
Status
of
a
par,,on
replica,
e.g
Master,
Slave
Transi,on
Ac,on
that
lets
replicas
change
status
e.g
Slave
-‐>
Master
4
5. Core
concept:
Augmented
finite
state
machine
State
Machine
Constraints
Objec,ves
• States
• States
• Par,,on
Placement
• S1,S2,S3
• S1à
max=1,
S2=min=2
• Failure
seman,cs
• Transi,on
• Transi,ons
• S1àS2,
S2àS1,
S2àS3,
• Concurrent(S1-‐>S2)
S3àS1
across
cluster
<
5
5
7. Use
case:
Distributed
data
store
• Timeline
consistent
par,,oned
data
store
• One
master
replica
per
par,,on
• Even
distribu,on
of
master/slave
• On
failure:
promote
slave
to
master
P.1
P.2
P.3
P.5
P.6
P.7
P.9
P.10
P.11
P.4
P.5
P.6
P.8
P.1
P.2
P.12
P.3
P.4
P.1
P.9
P.10
P.11
P.12
P.7
P.8
Node
1
Node
2
Node
3
8. Helix
based
solu,on
State
Machine
Constraints
Objec,ves
• States
• States
• Par,,on
Placement
• Offline,
Slave,
Master
• M=1,
S=2
• Failure
seman,cs
• Transi,on
• Transi,ons
• O-‐>S,
S-‐>M,S-‐>M,
M-‐>S
• concurrent(0-‐>S)
<
5
COUNT=2 minimize(maxnj∈N
S(nj)
)
t1≤5
S
t1 t2
t3 t4
O
M
COUNT=1 minimize(maxnj∈N
M(nj)
)
8
9. Tes,ng
• Happy
path
func,onality
– Meet
SLA
•
99th
percen,le
latency
etc
– Writes
to
master
• Non
happy
path
– System
failures
– Applica,on
failures
– How
does
system
behave
in
such
scenarios
10. Non
happy
path
-‐
Tradi,onal
approach
• Iden,fy
scenarios
of
interest
– Node
failure
– System
upgrade
• Tested
each
scenario
in
isola,on
via
test
case
– All
test
passed
J
• Deployed
in
alpha
– First
soiware
upgrade
failed
…
but
we
tested
it
11. What
was
missing
• Failures
don’t
happen
in
isola,on
• Induc,on
principle
does
not
work
– If
something
works
once
does
not
mean
it
will
always
work
• Lack
of
tools
to
debug
issues
– Could
not
iden,fy
the
cause
from
one
log
file
• Poor
coverage
– Impossible
to
think
of
all
possible
test
cases
12. What
we
learnt
• Test
with
all
components
integrated
• Simulate
real
produc,on
environment
– Generate
load
– Random
failures
of
mul,ple
components
• BeBer
debugging
tools
– Need
to
co-‐relate
messages
from
mul,ple
logs
– Failure
is
a
symptom,
actual
reason
in
past
logs
of
different
machine.
13. Data
driven
tes,ng
• Instrument
–
•
Zookeeper,
controller,
par,cipant
logs
• Simulate
–
Chaos
monkey
• Analyze
–
Invariants
are
• Respect
state
transi,on
constraints
• Respect
state
count
constraints
• And
so
on
• Debugging
made
easy
• Reproduce
exact
sequence
of
events
13
14. Chaos
monkey
• Select
a
random
component(s)
to
fail
• How
should
it
fail
– Hard/soi
failure
– Network
Par,,on
– Garbage
collec,on
– Process
freeze
15. Automa,on
of
chaos
monkey
• Helix
agent
on
each
node
STATE
MACHINE
• Modify
the
behavior
of
each
service
using
Helix
STOPPED
– Component
1
• Node1:
RUNNING
STOP
FREEZED
START
• Node2:
STOPPED
PAUSE
UNPAUSE
• Node3:
KILLED
RUNNING
– Component
2
KILL
• Node1:
STOPPED
KILLED
16. Pseudo
test
case
setup
cluster
generate
load
do
(c,t)
=
components
to
fail
and
type
of
failure
simulate
failure
verify
system_is_stable
restart
failed
components
while(verify
system_is_stable)
Test
case
failed
&
here
is
the
sequence
of
events
17. Cluster
verifica,on
• Verify
all
constraints
are
sa,sfied
– Is
there
a
master
for
all
par,,on
– Is
slave
replica,ng
– Node/component
down
should
not
maBer
– Validate
every
ac,on
not
just
end
result
• Having
master
is
not
good
enough,
if
two
nodes
became
master
and
later
one
of
them
died.
18. Log
analysis
• Log
important
events
– Becoming
master
from
slave
for
this
par,,on
at
this
,me
• Tools
to
collect,
merge
&
analyze
logs
– Parsed
zookeeper
transac,on
logs
– Gathered
helix
controller,
par,cipant
logs
– Sorted
on
,me.
• Helix
provides
these
tools
out
of
the
box
20. Benefits
• Test
case
stops
as
soon
as
system
is
unstable
– The
cluster
is
available
for
debugging
• Provides
exact
sequence
of
events
– Makes
it
easy
to
debug
and
reproduce
– Best
part:
We
auto
generated
test
case.
21. Reproduce
the
issue
Start
state
Orchestrate
the
sequence
• Helix
brings
the
system
to
• Use
Helix
messaging
api
to
start
state.
replay
the
events
{
"id" : "MyDataStore",
"simpleFields" : { 1. Node1:MyDataStore_0: Master-Slave
"IDEAL_STATE_MODE" : "CUSTOM",
"NUM_PARTITIONS" : ”2",
"REPLICAS" : "3", 2. Node1:HARD KILL
"STATE_MODEL_DEF_REF" : "MasterSlave",
}
"mapFields" : { 3. Node2:START
"MyDataStore_0" : {
"node1" : "MASTER",
"node2" : "OFFLINE",
"node3" : "SLAVE",
},
"MyDataStore_0" : {
"node1" : "SLAVE",
"node2" : "OFFLINE",
"node3" : "MASTER",
},
}
}
22. Constraint
viola,on
No
more
than
R=2
slaves
Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918
23. How
long
was
it
out
of
whack?
Number
of
Slaves
Time
Percentage
0
1082319
0.5
1
35578388
16.46
2
179417802
82.99
3
118863
0.05
83%
of
the
,me,
there
were
2
slaves
to
a
par,,on
93%
of
the
,me,
there
was
1
master
to
a
par,,on
Number
of
Masters
Time
Percentage
0 15490456 7.164960359
1 200706916 92.83503964
24. Invariant
2:
State
Transi,ons
FROM
TO
COUNT
MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0
25. Fun
facts
• For
almost
a
month
the
test
failed
to
run
successfully
for
one
night
• Most
issues
were
found
using
one
test
case
• Reproduced
almost
all
failures
26. Conclusion
• Tradi,onal
approach
is
not
good
enough
• Data
driven
tes,ng
is
way
to
go
– Focus
on
workload
and
analysis
– Produc,on
system
always
in
test
mode
– Leverage
tools
built
for
tes,ng
to
debug
produc,on
issues
In this slide, we will look at the problem from a different perspective and possibly re-define the cluster management problem.So re-cap to solve dds we need to define number of partitions and replicas, and for each replicas we need to different roles like master/slave etcOne of the well proven way to express such behavior is use a state machine
Used in production and manage the core infrastructure components in the companyOperation is easy and easy for dev ops to operate multiple systems