SlideShare una empresa de Scribd logo
1 de 40
Copyright Push Technology 2014
Principles of Reliable Communication &
Shared State
What Can Go Wrong & Why You Should Care
Dr Andy Piper – CTO, Push Technology
andy@pushtechnology.com
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/communicatoin-shared-state
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Copyright Push Technology 2014
Copyright Push Technology 2014
About Me?
• Dr Andy Piper
• CTO at Push Technology
• Ex-BEA/Oracle
• Architect for WebLogic Server Core
• Architect and then Engineering Director for
Oracle Event Processing
• Spring contributor and Author
• Contributed to many standards – OMG, JCP, OSGi
Alliance
• PhD, Cambridge, Distributed Systems
• MBA, Warwick Business School
Copyright Push Technology 2014
Agenda
• Communication failure modes
• Basic fault tolerance
• Process failure and resilience
• Classic problems
• Reliable group communications
Copyright Push Technology 2014
Communication
Copyright Push Technology 2014
What Can Go Wrong?
?
?
??
Copyright Push Technology 2014
What Can Go Wrong?
• Crash failure or fail-stop
– A participant disappears, but was ok up until then. Server reboot, JVM crash, data centre flood.
• Omission failure
– A participant fails to respond to an incoming message
– Receive omission
• A participant fails to receive incoming messages. Network failure, buffer overflow.
– Send omission
• A participant fails to send messages. Network failure, buffer overflow, queue overflow.
• Timing failure
– A participant’s response is outside the specified time interval. Stop-the-world GC
• Response failure
– A participant’s response is incorrect
– Value failure
• The value of the response is wrong. JVM bugs, concurrency bugs, over-zealous proxies.
– State transition failure
• A participant deviates from the correct flow of control. Application bugs, concurrency bugs.
• Arbitrary or Byzantine failure
– A participant produces arbitrary responses at arbitrary time. Malicious intent. Application retries with network
retries.
[Cristian, 1991; Hadzilacos & Toueg, 1993]
Copyright Push Technology 2014
Why Does it Matter?
• The key question is “What effect does failure have on the state of my system?”
• Actions and failures without consequences can be ignored
• Actions and failures with consequences cannot be ignored
• Often difficult to tell which is which and what the consequences are
• Theoretical models which prove certain outcomes under certain conditions are therefore
essential
“Reports that say that something hasn't happened are always interesting to me,
because as we know, there are known knowns; there are things we know that we know.
There are known unknowns; that is to say, there are things that we now know we don't
know. But there are also unknown unknowns – there are things we do not know we
don't know.”
—United States Secretary of Defense, Donald Rumsfeld
Copyright Push Technology 2014
Law of Unintended Consequences
• Robert K. Merton 1936
– Ignorance – impossible to anticipate everything (complexity)
– Error – incorrect analysis (complexity)
– Immediate interest – not considering long-term outcomes
– Basic values – prohibition of certain outcomes
– Self-defeating prophecy – fixing a problem that’s not there
• Without sound theoretical models the cure can be worse than the problem
– Addition of complexity to solve unanticipated edge cases increases overall risk of unknowns and hence failure
• The Rumsfeld effect
– Theoretical models are provably able to deal with all circumstances within the scope of the defined problem
– My own experience
• WebLogic clustering, sliding windows circa 1999 – major outages at globally visible brand
• But … sound theoretical models can be really hard to implement
– My own experience
• TOTEM implementation for WebLogic Event Server circa 2006
• Generally KISS is always good where reliability is concerned
Copyright Push Technology 2014
Coping With Failures – aka Fault Tolerance
• We can’t actually prevent failures – only mask the fact that they occurred
• If a failure can occur it generally will occur at some point and you have to either cope or not care
– Not caring is a cheap, viable strategy
– If you care then basing decisions on the happy path is largely pointless, get over it!
• The key to masking faults is redundancy
– Information redundancy – adding extra information to allow the original to be recovered
– Time redundancy – actions can be re-performed until success, e.g. transactions
– Physical redundancy – extra pieces are added in hardware or software to cope with loss
Copyright Push Technology 2014
Reliable Communication
• It helps if your transport is reliable
– TCP/IP – reliable
– UDP – not reliable
– UDP Multicast – not reliable
• Reliable transports mask omission failures
• Unreliable transports do not!
– Not necessarily bad - some algorithms are tolerant of unreliable transports – e.g. TOTEM
• Crash failures are not masked by reliable transports
– Most algorithms focus on masking crash failures
Copyright Push Technology 2014
Acks
Adrian?
Hi!
Copyright Push Technology 2014
Acks - What Can Go Wrong?
Adrian?
Hi!


Copyright Push Technology 2014
Acks – What Can Go Wrong?
1. The server cannot be found
2. The request message is lost
3. The server crashes after it receives a request
4. The ack/reply is lost
5. The client crashes after sending a request
• If the server cannot be found then we cannot proceed, but we know nothing bad happened
• If a request message is lost we can resend after a certain time period
– Requires that the server deal with retransmissions if it wasn’t actually lost
• If the server crashes we need to know whether it did something before it crashed
Copyright Push Technology 2014
What Did The Server Do?
Adrian?
Adrian?
Copyright Push Technology 2014
What Did The Server Do?
• Essentially we don’t know from a client perspective
• So what should we do?
– Three schools of thought
• Always retry – at-least-once semantics
• Never retry – at-most-once semantics
• Make no guarantees
• None of these are attractive
– We really want exactly-once semantics
– But in general we cannot arrange this
• Lost replies have similar problems
– Cannot distinguish between a reply that was truly lost and is just slow
Copyright Push Technology 2014
Idempotency
• One way round these issues is to guarantee that making a request multiple times is safe
• This is called idempotency
• If requests are idempotent they can always be retried and at-least-once semantics save the day
• For many distributed problems this works well, but not all requests can be idempotent
IDEMPOTENT
“We will transfer your money at least once”
Copyright Push Technology 2014
Sequencing
• Another way of de-duping multiple retries is to add sequence numbers to messages.
• This adds a per-client overhead and, more crucially, state to the server that must be maintained
across a crash
• Sequence numbers do not resolve lost replies, they must always be sent again
Copyright Push Technology 2014
Reliable Communications to Fault Tolerant Systems
• Reliable communications is a building block for reliable systems as a whole
• Often systems builders are interested in data rather than communication
– Communication is a way of manipulating data
• Reliable systems need sender and receiver to be reliable, not just the communication layer
• In many circumstances reliable communication is not even a pre-requisite
– For a reliable system as a whole
• We define a system as k fault tolerant if it can survive k faults before failing
Copyright Push Technology 2014
Process Resilience
• Making sender and receiver resilient requires process resilience
• The design principle is to create a group of homogenous processes that can be used to tolerate
process failure
• For well behaved processes it is sufficient to have k+1 members to be k fault tolerant
• When a message is sent to a group all members receive it
• If a process fails another can take over because all have the same world view
• This simple premise involves a number of challenges
– Group membership needs to be maintained
• Membership can either be agreed or coordinated
• However leader election generally must be agreed anyway or become a single-point-of-failure
– Group membership needs to be synchronized with messages
• Don’t send messages to crashed members
• Do send messages to new members
• Otherwise your world view is corrupted
– Group formation must survive catastrophe
– Managing groups requires communication!
• How easy is it to reach agreement?
Copyright Push Technology 2014
Agreement and the Two-Army Problem
• Famous problem – the two-army problem
• The red army has 5000 men
• The blue army has two groups of 3000 men
• If the blue army can coordinate its attack it can win
• If either blue army attacks on its own it will be wiped out
• Communication is inherently unreliable
?
Copyright Push Technology 2014
Agreement (or Consensus)
• Neither party can ever be sure that the other is going to do what they expect
– No matter how many steps the last one is always the decider
• Agreement between two parties is impossible even if the parties are perfect
– Note that only communications is unreliable, the parties always do the right thing
Copyright Push Technology 2014
Agreement and the Byzantine Generals
• Another classic problem
• Let’s now assume that communications are reliable but the armies are not
• Still one red army but n blue generals head armies nearby
• Communications are pairwise, instantaneous and perfect
• m of the generals are traitors!
• Can the loyal generals reach agreement on an attack?
Copyright Push Technology 2014
Agreement and the Byzantine Generals
• The generals can reach agreement, but only if enough of them agree
• Lamport (1982) showed that in general agreement can be reached but only if 2m+1 correctly
functioning processes are present
– Thus you need to have a minimum of 3m+1 processes to tolerate m faulty processes
– In other words more than 2/3’s need to be working – hence a minimum of 4
– 100% reliability requires infinite processes
• Lamport also provided an algorithm that achieved this in certain circumstances
• Agreement is an active area of research and several famous algorithms exist
– The key is reaching agreement in as short a time as possible while minimizing communications
– For example Paxos – a Lamport invention, used by Google Chubby/Spanner, Heroku etc
– Another example is the Chandra–Toueg consensus algorithm
• http://www.andrew.cmu.edu/course/15-749/READINGS/required/resilience/lamport82.pdf
Copyright Push Technology 2014
Consensus and Split Brain
• How common is byzantine failure?
• Actually pretty common when dealing with network partitions
• If a network partition occurs in a process group both sides of the partition can believe they are the
only ones alive
– Leads to inconsistent updates/messages
– Commonly know as split-brain behaviour
• Consensus is the only reasonable way of dealing rigorously with split-brain
– Typically consensus is separately reached on what side is active and the minority side fences itself
• Consensus is an expensive way of dealing with split-brain
– Potentially a significant number of servers cannot be used
– It’s easy to partition in a way that there are more than m faults
– Many systems instead optimistically ignore split-brain and reconcile updates on re-joining
• E.g. Hazelcast
Copyright Push Technology 2014
Consensus and Failure Detection
• Read the small print!
• Most consensus algorithms assume failure detection
• Failure detection is a non-trivial problem
• Failure detectors can themselves suffer from failure
– False positives (especially with fail fast) and false negatives
– “Glitches”, for instance GC, can look like failures
• Some algorithms assumes particular characteristics of failure
detectors
• Some algorithms have failure detection built-in
"Quis custodiet ipsos custodes?"
Copyright Push Technology 2014
FLP Impossibility
• Fischer, Lynch and Paterson proved that consensus in a fully asynchronous system cannot be
guaranteed within bounded time
• In practice this is highly unlikely to occur…
• …but illustrates the point that ultimately there are no guarantees
• …But wait - I thought you said Lamport proved it was possible with 3m+1 processes
• Yes – possible, but not guaranteed
– in other words algorithms will sometimes not terminate
• http://www.ict.kth.se/courses/ID2203/material/assignments/misunderstanding.pdf
Copyright Push Technology 2014
Group management to reliable group communication
• Group management using consensus (or other means) is a necessary precondition to:
– Masking process failure
– Communicating reliably in a group
• Once we know who comprises a group how do we use that information to communicate reliably?
Copyright Push Technology 2014
Reliable Group Communications
• In order to ensure that processes maintain the same world view you want to be able to communicate
reliably with them as a group
• Messages must either reach all or none of the non-faulty processes and each must know this
– Thus if a process fails the others know what it received and what to send it if it re-joins
– Absolutely cannot afford to have some processes receive messages and others not
Copyright Push Technology 2014
Reliable Multicast with Unreliable Multicast
• As long as processes don’t fail it’s relatively easy to achieve reliable group communication via
reliable multicast
• Reliable multicast can be achieved using unreliable multicast
• In the simple case sender tags messages with sequence numbers and receiver ACKs
• This leads to feedback implosion
• A better solution is to NACK with timeouts
– Does require that sender retain all messages
– Feedback implosion still occurs for larger numbers of participants
• An even better solution is to multicast NACKS with timeouts
– Avoids duplicate NACKS
• How likely is feedback implosion?
• Actually pretty likely
– WebLogic suffered from this circa 2003
– Feedback implosion also causes collisions with other multicast messages – multicast storms
Copyright Push Technology 2014
Virtual Synchrony
• Process failure is less easy to handle
• Reliable multicast in the presence of process failures can be modelled as virtual synchrony
• Essentially each process has a consistent group view of membership and the processes to which a
message should be delivered
• View changes are then communicated and accepted interleaved with actual messages
• The key is that message delivery to the target application is not the same as receiving a message
Copyright Push Technology 2014
Multicast Ordering and Atomic Multicast
• Virtual synchrony provides a synchronization primitive around group changes
– No messages may pass the barrier
– It means each process has a consistent view of the world
• But what about the ordering of messages? There are four possibilities
1. Unordered multicasts
2. FIFO-order multicasts - messages from the same process delivered in the order they were sent
3. Causally order multicasts - causality between messages preserved regardless of sender
4. Totally-ordered multicasts - messages delivered in the same order to all group members
• Virtual synchrony of reliable multicast with total ordering is called atomic multicast
Copyright Push Technology 2014
Implementing Virtual Synchrony
• Messages are only declared stable when each process has received the message
• Only stable messages can be delivered
• Stability is ensured through a kind of 2pc protocol across group view changes
• When a view change occurs each process with unstable messages sends them to the others
• Each process then sends a flush message
• When a process has received a flush message from every other it installs the new group view
• Isis is an example of a system using virtual synchrony
Copyright Push Technology 2014
Summary
• The possibilities for communication failure are endless
• Failures can be masked through information, time or hardware redundancy
• Reliable communication is one dimension of reliability in general
• The problem needs to be considered holistically
• The problem needs to be considered rigorously as the opportunities for mistakes are high
“We know of no area in computer science or
mathematics in which informal reasoning is
more likely to lead to errors than in the study
of this type of algorithm”
- Lamport et. al 1982
Copyright Push Technology 2014
One More Thing…
• The fallacies of distributed computing
– The network is reliable
– Latency is zero
– Bandwidth is infinite
– The network is secure
– Topology doesn't change
– There is one administrator
– Transport cost is zero
– The network is homogeneous
• It happens
– Don’t assume it doesn’t
– All solutions are compromise …
– … but understand the compromise
• Ideally have someone else solve the problem
– Many products do this well
Copyright Push Technology 2014
Q&A
Copyright Push Technology 2014
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/communicatoin
-shared-state

Más contenido relacionado

Más de C4Media

Más de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Principles of Reliable Communication & Shared State

  • 1. Copyright Push Technology 2014 Principles of Reliable Communication & Shared State What Can Go Wrong & Why You Should Care Dr Andy Piper – CTO, Push Technology andy@pushtechnology.com
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /communicatoin-shared-state http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 5. Copyright Push Technology 2014 About Me? • Dr Andy Piper • CTO at Push Technology • Ex-BEA/Oracle • Architect for WebLogic Server Core • Architect and then Engineering Director for Oracle Event Processing • Spring contributor and Author • Contributed to many standards – OMG, JCP, OSGi Alliance • PhD, Cambridge, Distributed Systems • MBA, Warwick Business School
  • 6. Copyright Push Technology 2014 Agenda • Communication failure modes • Basic fault tolerance • Process failure and resilience • Classic problems • Reliable group communications
  • 7. Copyright Push Technology 2014 Communication
  • 8. Copyright Push Technology 2014 What Can Go Wrong? ? ? ??
  • 9. Copyright Push Technology 2014 What Can Go Wrong? • Crash failure or fail-stop – A participant disappears, but was ok up until then. Server reboot, JVM crash, data centre flood. • Omission failure – A participant fails to respond to an incoming message – Receive omission • A participant fails to receive incoming messages. Network failure, buffer overflow. – Send omission • A participant fails to send messages. Network failure, buffer overflow, queue overflow. • Timing failure – A participant’s response is outside the specified time interval. Stop-the-world GC • Response failure – A participant’s response is incorrect – Value failure • The value of the response is wrong. JVM bugs, concurrency bugs, over-zealous proxies. – State transition failure • A participant deviates from the correct flow of control. Application bugs, concurrency bugs. • Arbitrary or Byzantine failure – A participant produces arbitrary responses at arbitrary time. Malicious intent. Application retries with network retries. [Cristian, 1991; Hadzilacos & Toueg, 1993]
  • 10. Copyright Push Technology 2014 Why Does it Matter? • The key question is “What effect does failure have on the state of my system?” • Actions and failures without consequences can be ignored • Actions and failures with consequences cannot be ignored • Often difficult to tell which is which and what the consequences are • Theoretical models which prove certain outcomes under certain conditions are therefore essential “Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know that we know. There are known unknowns; that is to say, there are things that we now know we don't know. But there are also unknown unknowns – there are things we do not know we don't know.” —United States Secretary of Defense, Donald Rumsfeld
  • 11. Copyright Push Technology 2014 Law of Unintended Consequences • Robert K. Merton 1936 – Ignorance – impossible to anticipate everything (complexity) – Error – incorrect analysis (complexity) – Immediate interest – not considering long-term outcomes – Basic values – prohibition of certain outcomes – Self-defeating prophecy – fixing a problem that’s not there • Without sound theoretical models the cure can be worse than the problem – Addition of complexity to solve unanticipated edge cases increases overall risk of unknowns and hence failure • The Rumsfeld effect – Theoretical models are provably able to deal with all circumstances within the scope of the defined problem – My own experience • WebLogic clustering, sliding windows circa 1999 – major outages at globally visible brand • But … sound theoretical models can be really hard to implement – My own experience • TOTEM implementation for WebLogic Event Server circa 2006 • Generally KISS is always good where reliability is concerned
  • 12. Copyright Push Technology 2014 Coping With Failures – aka Fault Tolerance • We can’t actually prevent failures – only mask the fact that they occurred • If a failure can occur it generally will occur at some point and you have to either cope or not care – Not caring is a cheap, viable strategy – If you care then basing decisions on the happy path is largely pointless, get over it! • The key to masking faults is redundancy – Information redundancy – adding extra information to allow the original to be recovered – Time redundancy – actions can be re-performed until success, e.g. transactions – Physical redundancy – extra pieces are added in hardware or software to cope with loss
  • 13. Copyright Push Technology 2014 Reliable Communication • It helps if your transport is reliable – TCP/IP – reliable – UDP – not reliable – UDP Multicast – not reliable • Reliable transports mask omission failures • Unreliable transports do not! – Not necessarily bad - some algorithms are tolerant of unreliable transports – e.g. TOTEM • Crash failures are not masked by reliable transports – Most algorithms focus on masking crash failures
  • 14. Copyright Push Technology 2014 Acks Adrian? Hi!
  • 15. Copyright Push Technology 2014 Acks - What Can Go Wrong? Adrian? Hi!  
  • 16. Copyright Push Technology 2014 Acks – What Can Go Wrong? 1. The server cannot be found 2. The request message is lost 3. The server crashes after it receives a request 4. The ack/reply is lost 5. The client crashes after sending a request • If the server cannot be found then we cannot proceed, but we know nothing bad happened • If a request message is lost we can resend after a certain time period – Requires that the server deal with retransmissions if it wasn’t actually lost • If the server crashes we need to know whether it did something before it crashed
  • 17. Copyright Push Technology 2014 What Did The Server Do? Adrian? Adrian?
  • 18. Copyright Push Technology 2014 What Did The Server Do? • Essentially we don’t know from a client perspective • So what should we do? – Three schools of thought • Always retry – at-least-once semantics • Never retry – at-most-once semantics • Make no guarantees • None of these are attractive – We really want exactly-once semantics – But in general we cannot arrange this • Lost replies have similar problems – Cannot distinguish between a reply that was truly lost and is just slow
  • 19. Copyright Push Technology 2014 Idempotency • One way round these issues is to guarantee that making a request multiple times is safe • This is called idempotency • If requests are idempotent they can always be retried and at-least-once semantics save the day • For many distributed problems this works well, but not all requests can be idempotent IDEMPOTENT “We will transfer your money at least once”
  • 20. Copyright Push Technology 2014 Sequencing • Another way of de-duping multiple retries is to add sequence numbers to messages. • This adds a per-client overhead and, more crucially, state to the server that must be maintained across a crash • Sequence numbers do not resolve lost replies, they must always be sent again
  • 21. Copyright Push Technology 2014 Reliable Communications to Fault Tolerant Systems • Reliable communications is a building block for reliable systems as a whole • Often systems builders are interested in data rather than communication – Communication is a way of manipulating data • Reliable systems need sender and receiver to be reliable, not just the communication layer • In many circumstances reliable communication is not even a pre-requisite – For a reliable system as a whole • We define a system as k fault tolerant if it can survive k faults before failing
  • 22. Copyright Push Technology 2014 Process Resilience • Making sender and receiver resilient requires process resilience • The design principle is to create a group of homogenous processes that can be used to tolerate process failure • For well behaved processes it is sufficient to have k+1 members to be k fault tolerant • When a message is sent to a group all members receive it • If a process fails another can take over because all have the same world view • This simple premise involves a number of challenges – Group membership needs to be maintained • Membership can either be agreed or coordinated • However leader election generally must be agreed anyway or become a single-point-of-failure – Group membership needs to be synchronized with messages • Don’t send messages to crashed members • Do send messages to new members • Otherwise your world view is corrupted – Group formation must survive catastrophe – Managing groups requires communication! • How easy is it to reach agreement?
  • 23. Copyright Push Technology 2014 Agreement and the Two-Army Problem • Famous problem – the two-army problem • The red army has 5000 men • The blue army has two groups of 3000 men • If the blue army can coordinate its attack it can win • If either blue army attacks on its own it will be wiped out • Communication is inherently unreliable ?
  • 24. Copyright Push Technology 2014 Agreement (or Consensus) • Neither party can ever be sure that the other is going to do what they expect – No matter how many steps the last one is always the decider • Agreement between two parties is impossible even if the parties are perfect – Note that only communications is unreliable, the parties always do the right thing
  • 25. Copyright Push Technology 2014 Agreement and the Byzantine Generals • Another classic problem • Let’s now assume that communications are reliable but the armies are not • Still one red army but n blue generals head armies nearby • Communications are pairwise, instantaneous and perfect • m of the generals are traitors! • Can the loyal generals reach agreement on an attack?
  • 26. Copyright Push Technology 2014 Agreement and the Byzantine Generals • The generals can reach agreement, but only if enough of them agree • Lamport (1982) showed that in general agreement can be reached but only if 2m+1 correctly functioning processes are present – Thus you need to have a minimum of 3m+1 processes to tolerate m faulty processes – In other words more than 2/3’s need to be working – hence a minimum of 4 – 100% reliability requires infinite processes • Lamport also provided an algorithm that achieved this in certain circumstances • Agreement is an active area of research and several famous algorithms exist – The key is reaching agreement in as short a time as possible while minimizing communications – For example Paxos – a Lamport invention, used by Google Chubby/Spanner, Heroku etc – Another example is the Chandra–Toueg consensus algorithm • http://www.andrew.cmu.edu/course/15-749/READINGS/required/resilience/lamport82.pdf
  • 27. Copyright Push Technology 2014 Consensus and Split Brain • How common is byzantine failure? • Actually pretty common when dealing with network partitions • If a network partition occurs in a process group both sides of the partition can believe they are the only ones alive – Leads to inconsistent updates/messages – Commonly know as split-brain behaviour • Consensus is the only reasonable way of dealing rigorously with split-brain – Typically consensus is separately reached on what side is active and the minority side fences itself • Consensus is an expensive way of dealing with split-brain – Potentially a significant number of servers cannot be used – It’s easy to partition in a way that there are more than m faults – Many systems instead optimistically ignore split-brain and reconcile updates on re-joining • E.g. Hazelcast
  • 28. Copyright Push Technology 2014 Consensus and Failure Detection • Read the small print! • Most consensus algorithms assume failure detection • Failure detection is a non-trivial problem • Failure detectors can themselves suffer from failure – False positives (especially with fail fast) and false negatives – “Glitches”, for instance GC, can look like failures • Some algorithms assumes particular characteristics of failure detectors • Some algorithms have failure detection built-in "Quis custodiet ipsos custodes?"
  • 29. Copyright Push Technology 2014 FLP Impossibility • Fischer, Lynch and Paterson proved that consensus in a fully asynchronous system cannot be guaranteed within bounded time • In practice this is highly unlikely to occur… • …but illustrates the point that ultimately there are no guarantees • …But wait - I thought you said Lamport proved it was possible with 3m+1 processes • Yes – possible, but not guaranteed – in other words algorithms will sometimes not terminate • http://www.ict.kth.se/courses/ID2203/material/assignments/misunderstanding.pdf
  • 30. Copyright Push Technology 2014 Group management to reliable group communication • Group management using consensus (or other means) is a necessary precondition to: – Masking process failure – Communicating reliably in a group • Once we know who comprises a group how do we use that information to communicate reliably?
  • 31. Copyright Push Technology 2014 Reliable Group Communications • In order to ensure that processes maintain the same world view you want to be able to communicate reliably with them as a group • Messages must either reach all or none of the non-faulty processes and each must know this – Thus if a process fails the others know what it received and what to send it if it re-joins – Absolutely cannot afford to have some processes receive messages and others not
  • 32. Copyright Push Technology 2014 Reliable Multicast with Unreliable Multicast • As long as processes don’t fail it’s relatively easy to achieve reliable group communication via reliable multicast • Reliable multicast can be achieved using unreliable multicast • In the simple case sender tags messages with sequence numbers and receiver ACKs • This leads to feedback implosion • A better solution is to NACK with timeouts – Does require that sender retain all messages – Feedback implosion still occurs for larger numbers of participants • An even better solution is to multicast NACKS with timeouts – Avoids duplicate NACKS • How likely is feedback implosion? • Actually pretty likely – WebLogic suffered from this circa 2003 – Feedback implosion also causes collisions with other multicast messages – multicast storms
  • 33. Copyright Push Technology 2014 Virtual Synchrony • Process failure is less easy to handle • Reliable multicast in the presence of process failures can be modelled as virtual synchrony • Essentially each process has a consistent group view of membership and the processes to which a message should be delivered • View changes are then communicated and accepted interleaved with actual messages • The key is that message delivery to the target application is not the same as receiving a message
  • 34. Copyright Push Technology 2014 Multicast Ordering and Atomic Multicast • Virtual synchrony provides a synchronization primitive around group changes – No messages may pass the barrier – It means each process has a consistent view of the world • But what about the ordering of messages? There are four possibilities 1. Unordered multicasts 2. FIFO-order multicasts - messages from the same process delivered in the order they were sent 3. Causally order multicasts - causality between messages preserved regardless of sender 4. Totally-ordered multicasts - messages delivered in the same order to all group members • Virtual synchrony of reliable multicast with total ordering is called atomic multicast
  • 35. Copyright Push Technology 2014 Implementing Virtual Synchrony • Messages are only declared stable when each process has received the message • Only stable messages can be delivered • Stability is ensured through a kind of 2pc protocol across group view changes • When a view change occurs each process with unstable messages sends them to the others • Each process then sends a flush message • When a process has received a flush message from every other it installs the new group view • Isis is an example of a system using virtual synchrony
  • 36. Copyright Push Technology 2014 Summary • The possibilities for communication failure are endless • Failures can be masked through information, time or hardware redundancy • Reliable communication is one dimension of reliability in general • The problem needs to be considered holistically • The problem needs to be considered rigorously as the opportunities for mistakes are high “We know of no area in computer science or mathematics in which informal reasoning is more likely to lead to errors than in the study of this type of algorithm” - Lamport et. al 1982
  • 37. Copyright Push Technology 2014 One More Thing… • The fallacies of distributed computing – The network is reliable – Latency is zero – Bandwidth is infinite – The network is secure – Topology doesn't change – There is one administrator – Transport cost is zero – The network is homogeneous • It happens – Don’t assume it doesn’t – All solutions are compromise … – … but understand the compromise • Ideally have someone else solve the problem – Many products do this well
  • 40. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/communicatoin -shared-state