Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Taskerman: A Distributed Cluster Task Manager

72 visualizaciones

Publicado el

This is the talk presented at Percona Live 2018: https://www.percona.com/live/18/sessions/taskerman-a-distributed-cluster-task-manager

Publicado en: Software
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Taskerman: A Distributed Cluster Task Manager

  1. 1. Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems Taskerman A Distributed Cluster Task Manager
  2. 2. Yelp’s Mission Connecting people with great local businesses.
  3. 3. Datastore Ecosystem @
  4. 4. Cassandra Elasticsearch Zookeeper PostgreSQL
  5. 5. 5 …. ● Memcached ● Redis ● Spark ● Redshift ● DynamoDB ● PaaStorm ● S3 Any many more..
  6. 6. ● Several TB in Cassandra clusters with tens of nodes each ● Close to a million messages/second in streaming pipeline ● Several TB in Elasticsearch with several hundred nodes in each ● Many PB archived to S3 every month ● Multi-AZ Multi-Region ● And growing… Distributed Systems
  7. 7. “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra cluster without spiking read latency” “Reboot 1000 instances without taking a millennia but not bringing down site either” “Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”
  8. 8. Pet vs Cattle
  9. 9. Maintenance Cost Engineering Efficiency Scalability
  10. 10. Taskerman
  11. 11. ● Safe ● Security ● Generic and Extensible ● Distributed ● Loosely coupled ● Cluster awareness Requirements
  12. 12. ● Safe ● Security ● Generic and Extensible ● Distributed ● Loosely coupled ● Cluster awareness Requirements
  13. 13. ● Schedulable ● Reusable ● Auditability ○ Not Ad-hoc ○ More Declarative, Less Imperative ■ Configuration Management ● Maintainability ● Observability ● Resilience Desirable
  14. 14. ● Paramount* ● Serialized execution ○ ‘m’ out of ‘n’ ○ Disjoint jobs. ● Avoid cascade ● Privilege escalation ● Pull-based * Unless oncall is automated too. Safety
  15. 15. ● Network is reliable ● Latency is zero ● Bandwidth is infinite ● Network is secure ● One administrator ● Transport cost is zero ● Network is homogenous ● Topology doesn't change Fallacies of Distributed System
  16. 16. Quotes There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery @mathiasverraes
  17. 17. ● Scheduler ● Router ● Co-ordinator ● Transport ● Executor ● Error handler ● Configuration ● Monitoring ● Tooling Building Blocks
  18. 18. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
  19. 19. #Anatomy of a Taskerman Task # Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi { ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: <uuid>,
  20. 20. #Anatomy of a Taskerman Task ‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’, } # force=true for restart, retry_count for queue # [a,b,c,d] to skip discovery
  21. 21. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
  22. 22. ● Runs on Chronos ● Emits a task ● Enqueues into global queue ● Ad-hoc invocation ● Deployment granularities ● Task tracking ● Yelpsoa-configs Task Scheduler PaaSTA
  23. 23. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
  24. 24. ● AWS SQS ● Best-effort FIFO ● Reliable and simple ● Low latency ● Properties ○ Read without delete ○ Visibility timeout ○ Retry ○ Dead Letter Queue WorkQueue AWS SQS
  25. 25. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
  26. 26. ● Stateless Marathon worker ● Routes tasks to clusters ● ‘DNS’ of Taskerman ● At-least once delivery ○ Crash safety ● Pluggable discovery ○ AWS ○ Smartstack Task Router PaaSTA
  27. 27. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
  28. 28. ● The executor of Taskerman ● Dequeue task and executes ○ Pre-defined reviewed code. ● Cron-ed on node ● Zookeeper for coordination ● Task deleted upon success ○ Crash safety ● Dead letter queue TaskRunner
  29. 29. class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific def pre_check(self): # Is the task safe to execute on this cluster def execute_action(self): # Actual execution of task:action def post_check(self): # cluster good after execution or is it on fire
  30. 30. RouterQueue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries EC2 API Zookeeper
  31. 31. ● Distributed Coordinator ● Non Blocking Lease ○ Time-based lease ○ Namespaces ● Ephemeral locks ● Atomic Counters ○ Statistics ○ Circuit breaker Zookeeper
  32. 32. ● Staleness ○ Nodes can go down ● Garbage collection ○ Cleanup of ZK data structures ● Composition ● Starvation ● Uptime Zookeeper: Challenges
  33. 33. ● Puppet ● Terraform & EC2 ● Yelpsoa-configs ● SQS ● PaaSTA ● Jenkins ● AWS Lambda Infrastructure PaaSTA
  34. 34. ● Multiple vectors of failure ● Idempotency ● Pessimistic approach ○ Job retry ● Separation of state ○ Highly available components ● Mutability ● Circuit breakers Failure handling
  35. 35. Debugging
  36. 36. ● Heartbeat ping ○ End-to-end monitoring ● Dead Letter Queue ○ Retry-based ○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring ● Status logging Failure detection
  37. 37. ● End-to-end logging ○ Un/structured ● Metrics ○ Counters ○ Queue lengths ● Aggregation and dashboards ● Staleness checks ● Dead Letter Queue ● Multi-modal Alerting Monitoring
  38. 38. ● Restarts ● Reboots ● Scale Up ● Instance updates ● Kafka config reload ● Failure injection ● Backup and restore ● Search indexing ● .. and many more. Use cases
  39. 39. ● Safety ● Cassandra ● Elasticsearch ● Common issues ● Constraints ○ Limit ○ Healthcheck ○ Mutual exclusion Scheduled Backups
  40. 40. Secure Infrastructure $ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017
  41. 41. www.yelp.com/careers/ We're Hiring!
  42. 42. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp
  43. 43. Q & A ● Slides will also be uploaded to slideshare.net/slidunder.
  44. 44. Q & A ❖ Q: What challenges remain with Taskerman. ➢ A: ❖ Q: … ➢ A: …
  45. 45. ● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/ ● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos ● https://github.com/mesosphere ● https://www.pinterest.com/pin/481955597602608003/ ● http://www.explainxkcd.com//wiki/images/6/6a/good_code.png ● https://media.giphy.com/media/l3q2FnW3yZRJVZH2g/giphy.gif Image Credits
  46. 46. ● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://media.giphy.com/media/l49JS1RGbjKFxd4L6/giphy.gif ● https://unsplash.com ● https://media.giphy.com/media/XhHCcUF5fgtag/giphy.gif ● https://media.giphy.com/media/POt0lBkIkmD7y/giphy.gif ● https://media.giphy.com/media/XhHCcUF5fgtag/giphy.gif ● http://dilbert.com/ ● https://giphy.com ● https://puppet.com/ Image Credits
  47. 47. ● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter Further Reading

×