TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow
Live testing distributed system fault tolerance with fault injection techniques
Alexey Vasyukov (Inventa), Vadim Zherder (MOEX)
For video follow the link: https://youtu.be/mGLRH2gqZwc
Would like to know more?
Visit our website:
www.tmpaconf.org
www.exactprosystems.com/events/tmpa
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
TMPA-2017: Live testing distributed system fault tolerance with fault injection techniques
1. [ ИМИДЖЕВОЕ
ИЗОБРАЖЕНИЕ ]
Live Testing of
Distributed System
Fault Tolerance With
Fault Injection
Techniques
Alexey Vasyukov,
Inventa
Vadim Zherder,
Moscow Exchange
7. Messaging
7
UP – transactions
DOWN – responses and messages to
nodes of the cluster
Use CRC (Cyclic redundancy check)
to control transaction flow
Full history: “Late join” is possible
8. Role: “Main”
“Main” = the main TS instance
• Get new transactions from incoming stream
• Broadcast transactions within cluster
• Process transactions
• Check transaction result (compare to results obtained from other nodes)
• Broadcast results within cluster
• Publish replies to clients
8
9. Role: “Backup”
“Backup” = a special state of TS instance
• Can be switched to Main quickly
• Get all transactions from Main
• Process transactions independently
• Check transaction result (compare to results from other nodes)
• Write its own TransLog
• Do not send replies to clients
9
10. Role: “Backup”
2 modes: SYNC (“Hot Backup”) and ASYNC (“Warm Backup”)
10
If Main failed, SYNC can switch to Main
automatically
SYNC publish transaction results
If SYNC failed, ASYNC can switch to SYNC
automatically
ASYNC does not publish transaction result
ASYNC can be switched to Main manually
by Operator
Number of SYNCs is a static parameter determined by Operator
11. Role: Governor
11
Governor can
• Force node to change its role
• Force node to stop
• Start Elections to assign new Main
Only one Governor in the cluster
Governor can be assigned only by Operator
Governor role cannot be changed
If some node asks Governor but it is unavailable then this node stalls
until it recovers connectivity to the Governor
Governor can be recovered or restarted only by Operator
12. Roles Summary
12
Governor Main SYNC Backup ASYNC Backup
Send Table of states V V V V
Get Client Transaction V
Broadcast Transaction V
Process Transaction V V V
Broadcast Transaction
result
V V
Compare transaction
results
V V V
Send replies to clients V
Can Switch to Main SYNC
13. If something goes wrong…
13
IF we detect
• Mismatch in transaction result
• A node does not respond
• No new transactions incoming
• Wrong CRC
• Governor does not response
• Mismatch in tables of states
• …
THEN
ASK Governor
14. Elections
14
Elections Starts to assign new Main
Stop transaction processing
2-fold Generation counter (G:S)
Initial values (0:0)
Every successful election increases G and drop S to 0 (G:0).
Every round of elections increases S
Example: (1:0) -> (1:1) -> (1:2) -> (2:0)
Generation counter in every message to/from Governor
2-Phase commit approach
Governor sends new table of states and waits for confirmation
from all nodes
16. MOEX Consensus Protocol
(by Sergey Kostanbaev, MOEX)
16
We must provide Tables of state to be consistent at all nodes during
normal work period
We must provide Tables of state to become consistent after some nodes
failed
Node 1 Node 2 Node 3 Node 4
uuid1 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC
uuid2 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC
uuid3 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC
uuid4 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC
Table of States at each node
17. MOEX Consensus Protocol
17
Thus, it is an example of a Distributed consensus protocol
Other examples:
• Paxos, 1998, 2001, …
LAMPORT, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25.
• RAFT, 2014 https://raft.github.io/raft.pdf
ONGARO, D., AND OUSTERHOUT, J. In search of an understandable consensus
algorithm. In Proc ATC’14,USENIX Annual Technical Conference (2014), USENIX
• DNCP, 2016 https://tools.ietf.org/html/rfc7787
Open questions:
Is MOEX CP equivalent to any of known protocols?
Hypothesis on MOEX CP features
H1. Byzantine fault tolerance
H2. Safety
H3. No liveness
18. Cluster Normal State Requirements
18
• There is exactly 1 Governor in the cluster
• There is exactly 1 Main in the cluster
• Tables of states at all nodes are consistent
• All active nodes in the cluster have the same value of
Generation Counter
• The cluster is available (for client connection) and process
transactions
• All nodes process the same sequence of transactions
• Either number of SYNCs equals to the predefined value, or it is
less than predefined value and there is no ASYNCs
…
19. Main “Theorem”
19
• Assume that the cluster was in Normal state, and one of Main
or Backup node fails. Then the cluster goes back to Normal
state during finite time.
20. MOEX CP Testing
20
Investigate
• Fault detection
• Implementation correctness
• Timing
• Dependence on load profile
• Dependence on environment configuration
• Statistics
Integration with CI/CD processes
21. Typical Test Scenario
21
1. Start all
2. Wait for normal state
3. Start transactions generator
4. Keep transactions flow for some time
5. Fault injection – emulate fault (single or multiple)
6. Wait for normal state (check timeout)
7. Check state at each node
8. Get artifacts
22. References
22
WIDDER J., Introduction into Fault-tolerant Distributed Algorithms and their Modeling, TMPA
(2014)
LAMPORT, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25.
https://raft.github.io/raft.pdf
ONGARO, D., AND OUSTERHOUT, J. In search of an understandable consensus algorithm. In
Proc ATC’14,USENIX Annual Technical Conference (2014), USENIX
ONGARO D. Consensus: Bridging theory and practice : Doctoral dissertation – Stanford
University, 2014.
25. MOEX Fault Injection Framework
Concepts
• End-to-end testing of cluster implementation
• Starts complete real system on real infrastructure
• Provides modules to inject predictable faults on selected servers
• Provides domain specific libraries to write tests
• System, network, app issues are injected directly
• Misconfiguration problems are tested indirectly
(real infrastructure, config push before test start)
25
27. Inject Techniques
OS Processes
• Kill (SIGKILL)
• Hang (SIGSTOP for N seconds + SIGCONT)
Network
• Interface “blink” (DROP 100% packets for N seconds)
• Interface “noise” (DROP X% packets for N seconds)
• Content filtering – allows “smart” inject into protocol, dropping selected
messages from the flow
Application
• Data corrupt (with gdb script) – emulates application level issues from
incorrect calculation
27
28. Basic Cluster State Validations
28
# Code Description
00 ALIVE_ON_START Cluster nodes should start correctly
01 SINGLE_MAIN Only one node should consider itself MAIN
02 GW_OK All gateways should be connected to correct MAIN
03 GEN_OK All active cluster nodes should have the same generation
04 TE_VIEW_OK Current MAIN should be connected to all alive nodes
05 CLU_VIEW_CONSISTENT All alive nodes should have the same cluster view
06 ELECTIONS_OK Elections count during the test should match inject scenario
07 DEAD_NODES_OK The number of lost nodes should match inject scenario
08 CLIENTS_ALIVE Clients should not notice any issue, fault handling logic is
completely hidden from them
29. Test Targets
• Basic system faults
• Multiple system faults on different nodes
• Application level faults
• Random network instabilities
• Recovery after faults
• Governor stability (failures, restarts, failures during elections)
29
30. Test Summary
30
Logs from all nodes
for root cause analysisCluster state
validations summary
Cluster nodes states
(Sync Backup is dead,
Async Backup switched to Sync)
31. Basic Fault: Overall System Behavior
Event log timeline
BS died,
elections started
Elections,
no transactions
Resumed
operation
32. Restore After Fault: Overall System Behavior
BS hanged,
elections started
Elections,
no transactions
Resumed
operation
BS is alive again
BS rejoins the cluster,
receiving missed transactions
33. Performance Metrics
• Key performance data from all cluster nodes
• How faults influence service quality for consumers?
• Compare configurations (indirectly, together with config push)
33
34. Domain Specific Language
• Useful for ad-hoc tests and quick analysis
• Complements set of 'default' tests (written in Python)
34
35. Statistics
• Multiple runs to identify problems without stable reproducers
• Heatmap to analyze quickly both which tests and which validations fail
35
36. References
36
Similar tools:
1. Netflix Simian Army; http://techblog.netflix.com/2011/07/netflix-simian-army.html
2. Jepsen; https://jepsen.io/
Reading:
1. Caitie McCaffrey. 2015. The Verification of a Distributed System. Queue 13, 9, pages 60
(December 2015), 11 pages. DOI=http://dx.doi.org/10.1145/2857274.2889274
2. Alvaro, P., Rosen, J. and Hellerstein, J.M. 2015. Lineage-driven fault injection.
http://www.cs.berkeley.edu/~palvaro/molly.pdf
3. Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G. R., Zhao, X., Zhang, Y., Jain, P. U., Stumm, M.
2014. Simple testing can prevent most critical failures: an analysis of production failures in
distributed data-intensive systems; https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/yuan
4. Ghosh S. et al. 1997. Software Fault Injection Testing on a Distributed System – A Case Study
5. Lai, M.-Y., Wang S.Y. 1995. Software Fault Insertion Testing for Fault Tolerance. Software Fault
Tolerance, Edited by Lyu, Chapter 13.