6. ● Several TB in Cassandra clusters with tens of nodes each
● Close to a million messages/second in streaming pipeline
● Several TB in Elasticsearch with several hundred nodes in
each
● Many PB archived to S3 every month
● Multi-AZ Multi-Region
● And growing…
Distributed Systems
7.
8. “Need to run logical backup on a fleet without disruption
to ingress traffic”
“Run anti-entropy repair on Cassandra cluster without
spiking read latency”
“Reboot 1000 instances without taking a millennia but not
bringing down site either”
“Upgrade an Elasticsearch cluster from m3.medium to
m3.xlarge safely without downtime”
17. ● Schedulable
● Reusable
● Auditability
○ Not Ad-hoc
○ More Declarative, Less Imperative
■ Configuration Management
● Maintainability
● Observability
● Resilience
Desirable
18. ● Paramount*
● Serialized execution
○ ‘m’ out of ‘n’
○ Disjoint jobs.
● Avoid cascade
● Privilege escalation
● Pull-based
* Unless oncall is automated too.
Safety
19.
20.
21. ● Network is reliable
● Latency is zero
● Bandwidth is infinite
● Network is secure
● One administrator
● Transport cost is zero
● Network is homogenous
● Topology doesn't change
Fallacies of Distributed System
22. Quotes
There are 2 hard problems in computer science: cache
invalidation, naming things, and off-by-1 errors.
@secretGeek
There are only two hard problems in distributed
systems: 2. Exactly-once delivery 1. Guaranteed order
of messages 2. Exactly-once delivery @mathiasverraes
35. ● The executor of Taskerman
● Dequeue task and executes
○ Pre-defined reviewed code.
● Cron-ed on node
● Zookeeper for coordination
● Task deleted upon success
○ Crash safety
● Dead letter queue
TaskRunner
36. class TestTaskRunner(TaskRunner):
def __init__(self, task,..):
# State mgmt and datastore specific
def pre_check(self):
# Is the task safe to execute on this cluster
def execute_action(self):
# Actual execution of task:action
def post_check(self):
# cluster good after execution or is it on fire
46. ● Heartbeat ping
○ End-to-end monitoring
● Dead Letter Queue
○ Retry-based
○ Recycle bin of failed tasks.
○ Hooks into human side of
monitoring
● Status logging
Failure detection
47.
48. ● End-to-end logging
○ Un/structured
● Metrics
○ Counters
○ Queue lengths
● Aggregation and dashboards
● Staleness checks
● Dead Letter Queue
● Multi-modal Alerting
Monitoring
49. ● Restarts
● Reboots
● Scale Up
● Instance updates
● Kafka config reload
● Failure injection
● Backup and restore
● Search indexing
● .. and many more.
Use cases