Taskerman: A Distributed Cluster Task Manager

Raghavendra D Prabhu
rprabhu@yelp.com
@randomsurfer
Distributed Systems
Taskerman
A Distributed Cluster Task Manager

Yelp’s Mission
Connecting people with great
local businesses.

Cassandra
Elasticsearch
Zookeeper
PostgreSQL

5
….
● Memcached
● Redis
● Spark
● Redshift
● DynamoDB
● PaaStorm
● S3
Any many more..

● Several TB in Cassandra clusters with tens of nodes each
● Close to a million messages/second in streaming pipeline
● Several TB in Elasticsearch with several hundred nodes in
each
● Many PB archived to S3 every month
● Multi-AZ Multi-Region
● And growing…
Distributed Systems

“Need to run logical backup on a fleet without disruption
to ingress traffic”
“Run anti-entropy repair on Cassandra cluster without
spiking read latency”
“Reboot 1000 instances without taking a millennia but not
bringing down site either”
“Upgrade an Elasticsearch cluster from m3.medium to
m3.xlarge safely without downtime”

Maintenance Cost
Engineering Efficiency
Scalability

● Safe
● Security
● Generic and Extensible
● Distributed
● Loosely coupled
● Cluster awareness
Requirements

● Schedulable
● Reusable
● Auditability
○ Not Ad-hoc
○ More Declarative, Less Imperative
■ Configuration Management
● Maintainability
● Observability
● Resilience
Desirable

● Paramount*
● Serialized execution
○ ‘m’ out of ‘n’
○ Disjoint jobs.
● Avoid cascade
● Privilege escalation
● Pull-based
* Unless oncall is automated too.
Safety

● Network is reliable
● Latency is zero
● Bandwidth is infinite
● Network is secure
● One administrator
● Transport cost is zero
● Network is homogenous
● Topology doesn't change
Fallacies of Distributed System

Quotes
There are 2 hard problems in computer science: cache
invalidation, naming things, and off-by-1 errors.
@secretGeek
There are only two hard problems in distributed
systems: 2. Exactly-once delivery 1. Guaranteed order
of messages 2. Exactly-once delivery @mathiasverraes

● Scheduler
● Router
● Co-ordinator
● Transport
● Executor
● Error handler
● Configuration
● Monitoring
● Tooling
Building Blocks

RouterQueue
Q2
Q1 Q3
Dead
Letter
Queue
T1
T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
Zookeeper
EC2 API

#Anatomy of a Taskerman Task
# Restart action for 2 nodes of geo_counter
# cassandra cluster owned by gsi
{
‘action’: ‘cassandra_task:restart’,
‘version’: 1.2,
‘limit’: 2,
‘cluster_name’: ‘cassandra:geo_counter’,
‘discovery’ : ‘aws_tags’,
‘owner’: ‘gsi’,
‘task_id’: <uuid>,

#Anatomy of a Taskerman Task
‘taskerman_params’: {
‘action_args’: {‘force’: true},
‘workqueue_args’: {‘retry_count’:3},
},
‘nodes’: [],
‘destnode’: ‘’,
}
# force=true for restart, retry_count for queue
# [a,b,c,d] to skip discovery

● Runs on Chronos
● Emits a task
● Enqueues into global queue
● Ad-hoc invocation
● Deployment granularities
● Task tracking
● Yelpsoa-configs
Task Scheduler
PaaSTA

● AWS SQS
● Best-effort FIFO
● Reliable and simple
● Low latency
● Properties
○ Read without delete
○ Visibility timeout
○ Retry
○ Dead Letter Queue
WorkQueue
AWS SQS

● Stateless Marathon worker
● Routes tasks to clusters
● ‘DNS’ of Taskerman
● At-least once delivery
○ Crash safety
● Pluggable discovery
○ AWS
○ Smartstack
Task Router
PaaSTA

● The executor of Taskerman
● Dequeue task and executes
○ Pre-defined reviewed code.
● Cron-ed on node
● Zookeeper for coordination
● Task deleted upon success
○ Crash safety
● Dead letter queue
TaskRunner

class TestTaskRunner(TaskRunner):
def __init__(self, task,..):
# State mgmt and datastore specific
def pre_check(self):
# Is the task safe to execute on this cluster
def execute_action(self):
# Actual execution of task:action
def post_check(self):
# cluster good after execution or is it on fire

RouterQueue
Q2
Q1 Q3
Dead
Letter
Queue
T1
T2
T3
Lease
Failure
Workqueue
Flow of task
Task Scheduler
Cluster
Node Queues
Retries
EC2 API
Zookeeper

● Distributed Coordinator
● Non Blocking Lease
○ Time-based lease
○ Namespaces
● Ephemeral locks
● Atomic Counters
○ Statistics
○ Circuit breaker
Zookeeper

● Staleness
○ Nodes can go down
● Garbage collection
○ Cleanup of ZK data structures
● Composition
● Starvation
● Uptime
Zookeeper: Challenges

● Puppet
● Terraform & EC2
● Yelpsoa-configs
● SQS
● PaaSTA
● Jenkins
● AWS Lambda
Infrastructure
PaaSTA

● Multiple vectors of failure
● Idempotency
● Pessimistic approach
○ Job retry
● Separation of state
○ Highly available components
● Mutability
● Circuit breakers
Failure handling

● Heartbeat ping
○ End-to-end monitoring
● Dead Letter Queue
○ Retry-based
○ Recycle bin of failed tasks.
○ Hooks into human side of
monitoring
● Status logging
Failure detection

● End-to-end logging
○ Un/structured
● Metrics
○ Counters
○ Queue lengths
● Aggregation and dashboards
● Staleness checks
● Dead Letter Queue
● Multi-modal Alerting
Monitoring

● Restarts
● Reboots
● Scale Up
● Instance updates
● Kafka config reload
● Failure injection
● Backup and restore
● Search indexing
● .. and many more.
Use cases

● Safety
● Cassandra
● Elasticsearch
● Common issues
● Constraints
○ Limit
○ Healthcheck
○ Mutual exclusion
Scheduled Backups

Secure Infrastructure
$ uptime
06:52:54 up 99 days, 19:20, 1 user,
load average: 0.02, 0.03, 0.07
ps -eo pid,cmd,lstart | grep ..
10058 zookeeper Tue Dec 5 05:23:43 2017

www.yelp.com/careers/
We're Hiring!

@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Q & A
● Slides will also be uploaded to
slideshare.net/slidunder.

Q & A
❖ Q: What challenges remain with Taskerman.
➢ A:
❖ Q: …
➢ A: …

● https://www.elastic.co/products/elasticsearch
● https://zookeeper.apache.org/
● https://kafka.apache.org/
● https://www.flickr.com/photos/dapuglet/6291424431
● http://www.alamy.com/stock-photo/cattle-penning.html
● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg
● https://sensuapp.org/img/logo-flat-white.png
● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif
● https://www.percona.com/sites/default/files/dashboard.png
● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d
● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve
n-know-existed-can-leslie-lamport-346227.jpg
● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg
● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg
● https://github.com/mesos/chronos
● https://github.com/mesosphere
● https://www.pinterest.com/pin/481955597602608003/
● http://www.explainxkcd.com//wiki/images/6/6a/good_code.png
● https://media.giphy.com/media/l3q2FnW3yZRJVZH2g/giphy.gif
Image Credits

● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png
● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png
● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png
● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor
e-if-you-write-brian-kernighan-66-91-06.jpg
● https://thenounproject.com/
● https://aws.amazon.com/
● https://www.splunk.com/
● https://www.terraform.io/
● http://yelp.com
● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
● https://media.giphy.com/media/l49JS1RGbjKFxd4L6/giphy.gif
● https://unsplash.com
● https://media.giphy.com/media/XhHCcUF5fgtag/giphy.gif
● https://media.giphy.com/media/POt0lBkIkmD7y/giphy.gif
● https://media.giphy.com/media/XhHCcUF5fgtag/giphy.gif
● http://dilbert.com/
● https://giphy.com
● https://puppet.com/
Image Credits

● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html
● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
● https://martinfowler.com/bliki/TwoHardThings.html
● https://zookeeper.apache.org/
● https://www.terraform.io/
● https://github.com/Yelp/service-principles
● https://en.wikipedia.org/wiki/Law_of_Demeter
Further Reading

Taskerman: A Distributed Cluster Task Manager

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Taskerman: A Distributed Cluster Task Manager

Similar a Taskerman: A Distributed Cluster Task Manager (20)

Más de Raghavendra Prabhu

Más de Raghavendra Prabhu (20)

Último

Último (20)

Taskerman: A Distributed Cluster Task Manager