SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
May 2016
Raft in details
Ivan Glushkov
ivan.glushkov@gmail.com
@gliush
Raft is a consensus algorithm for managing a
replicated log
Why
❖ Very important topic
❖ [Multi-]Paxos, the key player, is very difficult to
understand
❖ Paxos -> Multi-Paxos -> different approaches
❖ Different optimization, closed-source implementation
Raft goals
❖ Main goal - understandability:
❖ Decomposition: separated leader election, log
replication, safety, membership changes
❖ State space reduction
Replicated state machines
❖ The same state on each server
❖ Compute the state from replicated log
❖ Consensus algorithm: keep the replicated log consistent
Properties of consensus algorithm
❖ Never return an incorrect result: Safety
❖ Functional as long as any majority of the severs are
operational: Availability
❖ Do not depend on timing to ensure consistency (at
worst - unavailable)
Raft in a nutshell
❖ Elect a leader
❖ Give full responsibility for replicated log to leader:
❖ Leader accepts log entries from client
❖ Leader replicates log entries on other servers
❖ Tell servers when it is safe to apply changes on their
state
Raft basics: States
❖ Follower
❖ Candidate
❖ Leader
RequestVote RPC
❖ Become a candidate, if no communication from leader over
a timeout (random election timeout: e.g. 150-300ms)
❖ Spec: (term, candidateId, lastLogIndex, lastLogTerm) 

-> (term, voteGranted)
❖ Receiver:
❖ false if term < currentTerm
❖ true if not voted yet and term and log are Up-To-Date
❖ false
Leader Election
❖ Heartbeats from leader
❖ Election timeout
❖ Candidate results:
❖ It wins
❖ Another server wins
❖ No winner for a period of time -> new election
Leader Election Property
❖ Election Safety: at most one leader can be elected in a
given term
Log Replication
❖ log index
❖ log term
❖ log command
Log Replication
❖ Leader appends command from client to its log
❖ Leader issues AppendEntries RPC to replicate the entry
❖ Leader replicates entry to majority -> apply entry to
state -> “committed entry”
❖ Follower checks if entry is committed by server ->
commit it locally
Log Replication Properties
❖ Leader Append-Only: a leader never overwrites or
deletes entries in its log; it only appends new entries
❖ Log Matching: If two entries in different logs have the
same index and term, then they store the same
command.
❖ Log Matching: If two entries in different logs have the
same index and term, then the logs are identical in all
preceding entries.
Inconsistency
Solving Inconsistency
❖ Leader forces the followers’ logs to duplicate its own
❖ Conflicting entries in the follower logs will be
overwritten with entries from the Leader’s log:
❖ Find latest log where two servers agree
❖ Remove everything after this point
❖ Write new logs
Safety Property
❖ Leader Completeness: if a log entry is committed in a
given term, then that entry will be present in the logs of
the leaders for all higher-numbered terms
❖ State Machine Safety: if a server has applied a log entry
at a given index to its state machine, no other server will
ever apply a different log entry for the same index.
Safety Restriction
❖ Up-to-date: if the logs have last entries with different
terms, then the log with the later term is more up-to-
date. If the logs end with the same term, then whichever
log is longer is more up-to-date.
time
Follower and Candidate Crashes
❖ Retry indefinitely
❖ Raft RPC are idempotent
Raft Joint Consensus
❖ Two phase to change membership configuration
❖ In Joint Consensus state:
❖ replicate log entries to both configuration
❖ any server from either configuration may be a leader
❖ agreement (for election and commitment) requires
separate majorities from both configuration
Log compaction
❖ Independent

snapshotting
❖ Snapshot metadata
❖ InstallSnapshot RPC
Client Interaction
❖ Works with leader
❖ Leader respond to a request when it commits an entry
❖ Assign uniqueID to every command, leader stores latest
ID with response.
❖ Read-only - could be without writing to a log, possible
stale data. Or write “no-op” entry at start
Correctness
❖ TLA+ formal specification for consensus alg (400 LOC)
❖ Proof of linearizability in Raft using Coq (community)
Questions?

Más contenido relacionado

La actualidad más candente

Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
HostedbyConfluent
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 

La actualidad más candente (20)

Blockchain and Apache NiFi
Blockchain and Apache NiFiBlockchain and Apache NiFi
Blockchain and Apache NiFi
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
HAProxy 1.9
HAProxy 1.9HAProxy 1.9
HAProxy 1.9
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Automating linux network performance testing
Automating linux network performance testingAutomating linux network performance testing
Automating linux network performance testing
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
 
Introduction to Resilience4j
Introduction to Resilience4jIntroduction to Resilience4j
Introduction to Resilience4j
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
Rethinking State Management in Cloud-Native Streaming Systems With Yingjun Wu...
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Paxos introduction
Paxos introductionPaxos introduction
Paxos introduction
 
Thanos - Prometheus on Scale
Thanos - Prometheus on ScaleThanos - Prometheus on Scale
Thanos - Prometheus on Scale
 

Similar a Raft in details

Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
Atin Mukherjee
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
Azeem Mumtaz
 
Raft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdfRaft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdf
jsathy97
 

Similar a Raft in details (20)

Consensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_systemConsensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_system
 
Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
 
Module: Mutable Content in IPFS
Module: Mutable Content in IPFSModule: Mutable Content in IPFS
Module: Mutable Content in IPFS
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
 
New awesome features in MySQL 5.7
New awesome features in MySQL 5.7New awesome features in MySQL 5.7
New awesome features in MySQL 5.7
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
 
Raft presentation
Raft presentationRaft presentation
Raft presentation
 
Top 10 interview question and answers for mcsa
Top 10 interview question and answers for mcsaTop 10 interview question and answers for mcsa
Top 10 interview question and answers for mcsa
 
NFSv4 Replication for Grid Computing
NFSv4 Replication for Grid ComputingNFSv4 Replication for Grid Computing
NFSv4 Replication for Grid Computing
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPSLoadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
 
Distributed transactions
Distributed transactionsDistributed transactions
Distributed transactions
 
Real-Time Streaming Protocol -QOS
Real-Time Streaming Protocol -QOSReal-Time Streaming Protocol -QOS
Real-Time Streaming Protocol -QOS
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
Thriftypaxos
ThriftypaxosThriftypaxos
Thriftypaxos
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Raft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdfRaft_Diego_Ongaro.pdf
Raft_Diego_Ongaro.pdf
 
Denser, cooler, faster, stronger: PHP on ARM microservers
Denser, cooler, faster, stronger: PHP on ARM microserversDenser, cooler, faster, stronger: PHP on ARM microservers
Denser, cooler, faster, stronger: PHP on ARM microservers
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
ONOS Platform Architecture
ONOS Platform ArchitectureONOS Platform Architecture
ONOS Platform Architecture
 

Más de Ivan Glushkov

Más de Ivan Glushkov (8)

Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
 
Kubernetes is not needed to 90 percents of the companies.rus
Kubernetes is not needed to 90 percents of the companies.rusKubernetes is not needed to 90 percents of the companies.rus
Kubernetes is not needed to 90 percents of the companies.rus
 
Mystery Machine Overview
Mystery Machine OverviewMystery Machine Overview
Mystery Machine Overview
 
Hashicorp Nomad
Hashicorp NomadHashicorp Nomad
Hashicorp Nomad
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Comparing ZooKeeper and Consul
Comparing ZooKeeper and ConsulComparing ZooKeeper and Consul
Comparing ZooKeeper and Consul
 
fp intro
fp introfp intro
fp intro
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 

Raft in details

  • 1. May 2016 Raft in details Ivan Glushkov ivan.glushkov@gmail.com @gliush
  • 2. Raft is a consensus algorithm for managing a replicated log
  • 3. Why ❖ Very important topic ❖ [Multi-]Paxos, the key player, is very difficult to understand ❖ Paxos -> Multi-Paxos -> different approaches ❖ Different optimization, closed-source implementation
  • 4. Raft goals ❖ Main goal - understandability: ❖ Decomposition: separated leader election, log replication, safety, membership changes ❖ State space reduction
  • 5. Replicated state machines ❖ The same state on each server ❖ Compute the state from replicated log ❖ Consensus algorithm: keep the replicated log consistent
  • 6. Properties of consensus algorithm ❖ Never return an incorrect result: Safety ❖ Functional as long as any majority of the severs are operational: Availability ❖ Do not depend on timing to ensure consistency (at worst - unavailable)
  • 7. Raft in a nutshell ❖ Elect a leader ❖ Give full responsibility for replicated log to leader: ❖ Leader accepts log entries from client ❖ Leader replicates log entries on other servers ❖ Tell servers when it is safe to apply changes on their state
  • 8. Raft basics: States ❖ Follower ❖ Candidate ❖ Leader
  • 9. RequestVote RPC ❖ Become a candidate, if no communication from leader over a timeout (random election timeout: e.g. 150-300ms) ❖ Spec: (term, candidateId, lastLogIndex, lastLogTerm) 
 -> (term, voteGranted) ❖ Receiver: ❖ false if term < currentTerm ❖ true if not voted yet and term and log are Up-To-Date ❖ false
  • 10. Leader Election ❖ Heartbeats from leader ❖ Election timeout ❖ Candidate results: ❖ It wins ❖ Another server wins ❖ No winner for a period of time -> new election
  • 11. Leader Election Property ❖ Election Safety: at most one leader can be elected in a given term
  • 12. Log Replication ❖ log index ❖ log term ❖ log command
  • 13. Log Replication ❖ Leader appends command from client to its log ❖ Leader issues AppendEntries RPC to replicate the entry ❖ Leader replicates entry to majority -> apply entry to state -> “committed entry” ❖ Follower checks if entry is committed by server -> commit it locally
  • 14. Log Replication Properties ❖ Leader Append-Only: a leader never overwrites or deletes entries in its log; it only appends new entries ❖ Log Matching: If two entries in different logs have the same index and term, then they store the same command. ❖ Log Matching: If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
  • 16. Solving Inconsistency ❖ Leader forces the followers’ logs to duplicate its own ❖ Conflicting entries in the follower logs will be overwritten with entries from the Leader’s log: ❖ Find latest log where two servers agree ❖ Remove everything after this point ❖ Write new logs
  • 17. Safety Property ❖ Leader Completeness: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms ❖ State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
  • 18. Safety Restriction ❖ Up-to-date: if the logs have last entries with different terms, then the log with the later term is more up-to- date. If the logs end with the same term, then whichever log is longer is more up-to-date. time
  • 19. Follower and Candidate Crashes ❖ Retry indefinitely ❖ Raft RPC are idempotent
  • 20. Raft Joint Consensus ❖ Two phase to change membership configuration ❖ In Joint Consensus state: ❖ replicate log entries to both configuration ❖ any server from either configuration may be a leader ❖ agreement (for election and commitment) requires separate majorities from both configuration
  • 21. Log compaction ❖ Independent
 snapshotting ❖ Snapshot metadata ❖ InstallSnapshot RPC
  • 22. Client Interaction ❖ Works with leader ❖ Leader respond to a request when it commits an entry ❖ Assign uniqueID to every command, leader stores latest ID with response. ❖ Read-only - could be without writing to a log, possible stale data. Or write “no-op” entry at start
  • 23. Correctness ❖ TLA+ formal specification for consensus alg (400 LOC) ❖ Proof of linearizability in Raft using Coq (community)