From Mainframe to Microservice: An Introduction to Distributed Systems

From Mainframe to Microservice
An Introduction to
Distributed Systems
@tyler_treat
Workiva

An Introduction to Distributed Systems
❖ Building a foundation of understanding
❖ Why distributed systems?
❖ Universal fallacies
❖ Characteristics and the CAP theorem
❖ Common pitfalls
❖ Digging deeper
❖ Byzantine Generals Problem and consensus
❖ Split-brain
❖ Hybrid consistency models
❖ Scaling shared data and CRDTs

“A distributed system is one in which the failure of
a computer you didn't even know existed can
render your own computer unusable.”
–Leslie Lamport

Scale Up vs. Scale Out
Vertical Scaling
❖ Add resources to a node
❖ Increases node capacity, load
is unaffected
❖ System complexity unaffected
Horizontal Scaling
❖ Add nodes to a cluster
❖ Decreases load, capacity is
unaffected
❖ Availability and throughput w/
increased complexity

A distributed system
is a collection of independent computers
that behave as a single coherent system.

Why Distributed Systems?
Availability
Fault Tolerance
Throughput
Architecture
Economics
serve every request
resilient to failures
parallel computation
decoupled, focused services
scale-out becoming manageable/
cost-effective

“You have to design distributed systems with the
expectation of failure.”
–Ken Arnold

Distributed systems engineers are
the world’s biggest pessimists.

Universal Fallacy #1
The network is reliable.
❖ Message delivery is never guaranteed
❖ Best effort
❖ Is it worth it?
❖ Resiliency/redundancy/failover

Latency is zero.
❖ We cannot defy the laws of physics
❖ LAN to WAN deteriorates quickly
❖ Minimize network calls (batch)
❖ Design asynchronous systems

Bandwidth is infinite.
❖ Out of our control
❖ Limit message sizes
❖ Use message queueing

The network is secure.
❖ Everyone is out to get you
❖ Build in security from day 1
❖ Multi-layered
❖ Encrypt, pentest, train developers

Topology doesn’t change.
❖ Network topology is dynamic
❖ Don’t statically address hosts
❖ Collection of services, not nodes
❖ Service discovery

There is one administrator.
❖ May integrate with third-party systems
❖ “Is it our problem or theirs?”
❖ Conflicting policies/priorities
❖ Third parties constrain; weigh the risk

Transport cost is zero.
❖ Monetary and practical costs
❖ Building/maintaining a network is not
trivial
❖ The “perfect” system might be too costly

The network is homogenous.
❖ Networks are almost never homogenous
❖ Third-party integration?
❖ Consider interoperability
❖ Avoid proprietary protocols

These problems apply to LAN and WAN systems
(single-data-center and cross-data-center)
No one is safe.

“Anything that can go
wrong will go wrong.”
–Murphy’s Law

Characteristics of a Reliable Distributed System
Fault-tolerant
Available
Scalable
Consistent
Secure
Performant
nodes can fail
serve all the requests, all the time
behave correctly with changing
topologies
state is coordinated across nodes
access is authenticated
it’s fast!

Distributed systems are
all about trade-offs.

CAP Theorem
❖ Presented in 1998 by Eric
Brewer
❖ Impossible to guarantee
all three:
❖ Consistency
❖ Availability
❖ Partition tolerance

Consistency
❖ Linearizable - there exists a total order of all state
updates and each update appears atomic
❖ E.g. mutexes make operations appear atomic
❖ When operations are linearizable, we can assign a unique
“timestamp” to each one (total order)
❖ A system is consistent if every node shares the same
total order
❖ Consistency which is both global and instantaneous is
impossible

Consistency
Eventual consistency
replicas allowed to diverge,
eventually converge
Strong consistency
replicas can’t diverge;
requires linearizability

Availability
❖ Every request received by a non-failing node must be
served
❖ If a piece of data required for a request is unavailable,
the system is unavailable
❖ 100% availability is a myth

Partition Tolerance
❖ A partition is a split in the network—many causes
❖ Partition tolerance means partitions can happen
❖ CA is easy when your network is perfectly reliable
❖ Your network is not perfectly reliable

Common Pitfalls
❖ Halting failure - machine stops
❖ Network failure - network connection breaks
❖ Omission failure - messages are lost
❖ Timing failure - clock skew
❖ Byzantine failure - arbitrary failure

Digging Deeper
Exploring some higher-level concepts

Byzantine Generals Problem
❖ Consider a city under siege by two allied armies
❖ Each army has a general
❖ One general is the leader
❖ Armies must agree when to attack
❖ Must use messengers to communicate
❖ Messengers can be captured by defenders

Byzantine Generals Problem
❖ Send 100 messages, attack no matter what
❖ A might attack without B
❖ Send 100 messages, wait for acks, attack if confident
❖ B might attack without A
❖ Messages have overhead
❖ Can’t reliably make decision (provenly impossible)

Distributed Consensus
❖ Replace 2 generals with N
generals
❖ Nodes must agree on data
value
❖ Solutions:
❖ Multi-phase commit
❖ State replication

Two-Phase Commit
❖ Blocking protocol
❖ Coordinator waits for
cohorts
❖ Cohorts wait for
commit/rollback
❖ Can deadlock

Three-Phase Commit
❖ Non-blocking
protocol
❖ Abort on timeouts
❖ Susceptible to
network partitions

State Replication
❖ E.g. Paxos, Raft protocols
❖ Elect a leader (coordinator)
❖ All changes go through leader
❖ Each change appends log entry
❖ Each node has log replica

State Replication
❖ Must have quorum (majority)
to proceed
❖ Commit once quorum acks
❖ Quorums mitigate partitions
❖ Logs allow state to be rebuilt

Split-Brain
❖ Optimistic (AP) - let partitions work as usual
❖ Pessimistic (CP) - quorum partition works, fence others

Hybrid Consistency Models
❖ Weak == available, low latency, stale reads
❖ Strong == fresh reads, less available, high latency
❖ How do you choose a consistency model?
❖ Hybrid models
❖ Weaker models when possible (likes, followers, votes)
❖ Stronger models when necessary
❖ Tunable consistency models (Cassandra, Riak, etc.)

Scaling Shared Data
❖ Sharing mutable data at large scale is difficult
❖ Solutions:
❖ Immutable data
❖ Last write wins
❖ Application-level conflict resolution
❖ Causal ordering (e.g. vector clocks)
❖ Distributed data types (CRDTs)

Scaling Shared Data
Imagine a shared, global
counter…
“Get, add 1, and put”
transaction will not
scale

CRDT
❖ Conflict-free Replicated Data Type
❖ Convergent: state-based
❖ Commutative: operations-based
❖ E.g. distributed sets, lists, maps, counters
❖ Update concurrently w/o writer coordination

CRDT
❖ CRDTs always converge (provably)
❖ Operations commute (order doesn’t matter)
❖ Highly available, eventually consistent
❖ Always reach consistent state
❖ Drawbacks:
❖ Requires knowledge of all clients
❖ Must be associative, commutative, and idempotent

CRDT
❖ Add to set is associative, commutative, idempotent
❖ add(“a”), add(“b”), add(“a”) => {“a”, “b”}
❖ Adding and removing items is not
❖ add(“a”), remove(“a”) => {}
❖ remove(“a”), add(“a”) => {“a”}
❖ CRDTs require interpretation of common data
structures w/ limitations

Two-Phase Set
❖ Use two sets, one for adding, one for removing
❖ Elements can be added once and removed once
❖ {
“a”: [“a”, “b”, “c”],
“r”: [“a”]
}
❖ => {“b”, “c”}
❖ add(“a”), remove(“a”) => {“a”: [“a”], “r”: [“a”]}
❖ remove(“a”), add(“a”) => {“a”: [“a”], “r”: [“a”]}

Distributed architectures allow us to build
highly available, fault-tolerant systems.

We can't live in this fantasy land
where everything works perfectly
all of the time.

Shit happens — network partitions,
hardware failure, GC pauses,
latency, dropped packets…

Consider the trade-off between
consistency and availability.

Partition tolerance is not an option,
it’s required.
(if you’re building a distributed system)

Use weak consistency when possible,
strong when necessary.

Sharing data at scale is hard,
let’s go shopping.
(or consider your options)

Further Readings
❖ Jepsen series
Kyle Kingsbury (aphyr)
❖ A Comprehensive Study of Convergent and Commutative
Replicated Data Types
Shapiro et al.
❖ In Search of an Understandable Consensus Algorithm
Ongaro et al.
❖ CAP Twelve Years Later
Eric Brewer
❖ Many, many more…

Thanks!
@tyler_treat
github.com/tylertreat
bravenewgeek.com

From Mainframe to Microservice: An Introduction to Distributed Systems

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a From Mainframe to Microservice: An Introduction to Distributed Systems

Similar a From Mainframe to Microservice: An Introduction to Distributed Systems (20)

Más de Tyler Treat

Más de Tyler Treat (10)

Último

Último (20)

From Mainframe to Microservice: An Introduction to Distributed Systems