Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Espresso Database Replication with Kafka, Tom Quiggle

2.597 visualizaciones

Publicado el

The initial deployment of Espresso relies on MySQL’s built-in mechanism for Master-Slave replication. Storage hosts running MySQL masters service HTTP requests to store and retrieve documents, while hosts running slave replicas remain mostly idle. Since replication is at the MySQL instance level, masters and slaves must contain the exact same partitions – precluding flexible and dynamic partition placement and migration within the cluster.

Espresso is migrating to a new deployment topology where each Storage Node may host a combination of master and slave partitions; thus distributing the application requests equally across all available hardware resources. This topology requires per-partition replication between master and slave nodes. Kafka will be used as the transport for replication between partitions.

For use as the replication stream for the source-of-truth data store for LinkedIn’s most valuable data, Kafka must be as reliable as MySQL replication. The session will cover Kafka configuration options to ensure highly reliable, in-order message delivery. Additionally, the application logic maintains state both within the Kafka event stream and externally to detect message re-delivery, out of order delivery, and messages inserted out-of-band. These application protocols to guarantee high fidelity will be discussed.

Publicado en: Ingeniería
  • Do you need a genuine loan to pay your bills or solve any financial problems, kindly contact us now via email: rbinvestment11@outlook.com We offer all kinds of loan. Here comes an Affordable loan that will change your life for ever. We offer genuine and legitimate loan, I offer loan to individual and public sector that are in need of financial Assistance, you will never regret anything in this loan transaction because i will make you smile through out this transaction because you where not born to be a loser, The Terms and Conditions are very simple and considerate. What are you waiting for why don't you try this company and be free from debts and poverty.If you are in need of a genuine loan we can help you, i am certified,registered and legit lender. kindly contact us now via email: rbinvestment11@outlook.com, rbinvestment12@outlook.com
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Espresso Database Replication with Kafka, Tom Quiggle

  1. 1. Distributed Data Systems 1©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Database Replication with Kafka Tom Quiggle Principal Staff Software Engineer tquiggle@linkedin.com www.linkedin.com/in/tquiggle @TomQuiggle
  2. 2. Distributed Data Systems 2©2016 LinkedIn Corporation. All Rights Reserved.  ESPRESSO Overview – Architecture – GTIDs and SCNs – Per-instance replication (0.8) – Per-partition replication (1.0)  Kafka Per-Partition Replication – Requirements – Kafka Configuration – Message Protocol – Producer – Consumer  Q&A Agenda
  3. 3. Distributed Data Systems 3©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Overview ESPRESSO Database Replication with Kafka
  4. 4. Distributed Data Systems 4©2016 LinkedIn Corporation. All Rights Reserved.  Hosted, Scalable, Data as a Service (DaaS) for LinkedIn’s Online Structured Data Needs  Databases are partitioned  Partitions distributed across available hardware  HTTP proxy routes requests to appropriate database node  Apache Helix provides centralized cluster management ESPRESSO1 1. Elastic, Scalable, Performant, Reliable, Extensible, Stable, Speedy and Operational
  5. 5. Distributed Data Systems 5©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Architecture Storage Node API Server MySQL Router Router Router Apache Helix ZooKeeper Storage Node API Server MySQL Storage Node API Server MySQL Storage Node API Server MySQL Data Control Routing Table r r r HTTP Client HTTP
  6. 6. Distributed Data Systems 6©2016 LinkedIn Corporation. All Rights Reserved. GTIDs and SCNs MySQL 5.6 Global Transaction Identifier  Unique, monotonically increasing, identifier for each transaction committed  GTID :== source_id:transaction_id  ESPRESSO conventions – source_id encodes database name and partition number – transaction_id is a 64 bit numeric value  High Order 32 bits is generation count  Low order 32 bit are sequence within generation – Generation increments with every change in mastership – Sequence increases with each transaction – We refer to a transaction_id component as a Sequence Commit Number (SCN)
  7. 7. Distributed Data Systems 7©2016 LinkedIn Corporation. All Rights Reserved. GTIDs and SCNs Example binlog transaction: SET @@SESSION.GTID_NEXT= 'hash(db_part):(gen<<32 + seq)'; SET TIMESTAMP=<seconds_since_Unix_epoch> BEGIN Table_map: `db_part`.`table1` mapped to number 1234 Update_rows: table id 1234 BINLOG '...' BINLOG '...' Table_map: `db_part`.`table2` mapped to number 5678 Update_rows: table id 5678 BINLOG '...' COMMIT
  8. 8. Distributed Data Systems 8©2016 LinkedIn Corporation. All Rights Reserved. Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 ESPRESSO: 0.8 Per-Instance Replication Master Slave Offline
  9. 9. Distributed Data Systems 9©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO: 0.8 Per-Instance Replication Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 Master Slave Offline
  10. 10. Distributed Data Systems 10©2016 LinkedIn Corporation. All Rights Reserved. Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 Master Slave Offline ESPRESSO: 0.8 Per-Instance Replication
  11. 11. Distributed Data Systems 11©2016 LinkedIn Corporation. All Rights Reserved. Issues with Per-Instance Replication  Poor resource utilization – only 1/3 of nodes service application requests  Partitions unnecessarily share fate  Cluster expansion is an arduous process  Upon node failure, 100% of the traffic is redirected to one node
  12. 12. Distributed Data Systems 12©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO: 1.0 Per-Partition Replication Per-Instance MySQL replication replaced with Per-Partition Kafka HELIX P4: Master: 1 Slave: 3 … EXTERNALVIEW Node 1 Node 2 Node 3 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Kafka
  13. 13. Distributed Data Systems 13©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Initial State with 12 partitions, 3 storage nodes, r=2 HELIX EXTERNALVIEW Node 1 Node 2 Node 3 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline P4: Master: 1 Slave: 3 …
  14. 14. Distributed Data Systems 14©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Adding Node: Helix Sends OfflineToSlave for new partitions HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Node 4 P4 P8 P1 P12 P7 P9 P4: Master: 1 Slave: 3 Offline: 4 …
  15. 15. Distributed Data Systems 15©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Once a new partition is ready, transfer ownership and drop old HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 P1 P2 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Node 4 P4 P8 P1 P12 P7 P9 P4: Master: 4 Slave: 3 …
  16. 16. Distributed Data Systems 16©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Continue migration of master and slave partitions HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 P1 P2 P3 P5 P6 P9 P10 Node 2 P5 P6 P7 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Node 4 P4 P8 P1 P12 P7 P9 P9: Master: 3 Slave: 1 Offline: 4 …
  17. 17. Distributed Data Systems 17©2016 LinkedIn Corporation. All Rights Reserved. Cluster Expansion Rebalancing is complete after last partition migration HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 Node 2 Node 3 Node 4 P4 P8 P1 P12 P7 P9 P5 P6 P2 P7 P11 P12 P9 P10 P3 P11 P4 P8 P1 P2 P5 P3 P6 P10 P9: Master: 4 Slave: 3 …
  18. 18. Distributed Data Systems 18©2016 LinkedIn Corporation. All Rights Reserved. Node Failover During failure or planned maintenance, promote slaves to master HELIX EXTERNALVIEW Node 1 Node 2 Node 3 Node 4 LIVEINSTANCES Node 1 Node 2 Node 3 Node 4 P4 P8 P1 P12 P7 P9 P5 P6 P2 P7 P11 P12 P9 P10 P3 P11 P4 P8 P1 P2 P5 P3 P6 P10 P9: Master: 4 Slave: 3 …
  19. 19. Distributed Data Systems 19©2016 LinkedIn Corporation. All Rights Reserved. Node Failover During failure or planned maintenance, promote slaves to master HELIX EXTERNALVIEW Node 1 Node 2 Node 4 LIVEINSTANCES Node 1 Node 2 Node 3 Node 4 P4 P8 P1 P12 P7 P9 P5 P6 P2 P7 P11 P12 P9 P10 P3 P11 P4 P8 P1 P2 P5 P3 P6 P10 P9: Master: 4 Offline: 3 …
  20. 20. Distributed Data Systems 20©2016 LinkedIn Corporation. All Rights Reserved. Advantages of Per-Partition Replication  Better hardware utilization – All nodes service application requests  Mastership hand-off done in parallel  After node failure, can restore full replication factor in parallel  Cluster expansion is as easy as: – Add node(s) to cluster – Rebalance  Single platform for all Change Data Capture – Internal replication – Cross-colo replication – Application CDC consumers
  21. 21. Distributed Data Systems 21©2016 LinkedIn Corporation. All Rights Reserved. Kafka Per-Partition Replication ESPRESSO Database Replication with Kafka
  22. 22. Distributed Data Systems 22©2016 LinkedIn Corporation. All Rights Reserved. Kafka for Internal Replication Storage Node MySQL Open Replicator Kafka Producer API Server binlog binlog event Kafka Consumer SQL INSERT..UPDATE SQL INSERT..UPDATE Storage Node MySQL Open Replicator Kafka Producer API Server binlog binlog event Kafka Consumer SQL INSERT..UPDATE SQL INSERT..UPDATE Kafka Partition Kafka Message Kafka Message Client HTTP PUT/POST
  23. 23. Distributed Data Systems 23©2016 LinkedIn Corporation. All Rights Reserved. Requirements Delivery Must Be:  Guaranteed  In-Order  Exactly Once (sort of)
  24. 24. Distributed Data Systems 24©2016 LinkedIn Corporation. All Rights Reserved. Broker Configuration  Replication factor = 3 (most LinkedIn clusters use 2)  min.isr=2  Disable unclean leader elections
  25. 25. Distributed Data Systems 25©2016 LinkedIn Corporation. All Rights Reserved. B – Begin txn E – End txn C – Control Message Protocol Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E
  26. 26. Distributed Data Systems 26©2016 LinkedIn Corporation. All Rights Reserved. Message Protocol – Mastership Handoff Old Master MySQ L ProducerConsumer Promoted Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E 4:0 C
  27. 27. Distributed Data Systems 27©2016 LinkedIn Corporation. All Rights Reserved. Message Protocol – Mastership Handoff Master MySQ L ProducerConsumer Promoted Slave MySQ L ProducerConsumer Consumed own control message 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E 4:0 C
  28. 28. Distributed Data Systems 28©2016 LinkedIn Corporation. All Rights Reserved. Message Protocol – Mastership Handoff Old Master MySQ L ProducerConsumer Master MySQ L ProducerConsumer Enable writes with new gen 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 DB_0: 3:104 E 4:0 C 4:0 B
  29. 29. Distributed Data Systems 29©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Configuration  acks = “all”  retries = Integer.MAX_VALUE  block.on.buffer.full=true  max.in.flight.requests.per.connection=1  linger=0  On non-retryable exception: – destroy producer – create new producer – resume from last checkpoint
  30. 30. Distributed Data Systems 30©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 Can’t Checkpoint Here Periodically writes (SCN, Kafka Offset) to MySQL table May only checkpoint offset at end of valid transaction!
  31. 31. Distributed Data Systems 31©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 Producer checkpoint will lag current producer Kafka Offset Kafka Offset obtained from callback Last Checkpoint Here
  32. 32. Distributed Data Systems 32©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Last Checkpoint Here send() FAILS X 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:104
  33. 33. Distributed Data Systems 33©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Recreate producer and resume from last checkpoint Resume From Checkpoint Messages will be replayed 3:102 B 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104
  34. 34. Distributed Data Systems 34©2016 LinkedIn Corporation. All Rights Reserved. Kafka Producer Checkpointing Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Kafka stream now contains replayed transactions (possibly including partial transactions) Can Checkpoint Here Replayed Messages 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104
  35. 35. Distributed Data Systems 35©2016 LinkedIn Corporation. All Rights Reserved. Partition 3 Kafka Consumer  Uses Low Level Consumer  Consume Kafka partitions slaved on node Partition 1 Partition 2 Kafka Broker A Kafka Broker B Kafka Consumer poll() Consumer Thread EspressoKafkaConsumer EspressoReplicationApplier MySQL P1 P2 P3 Applier Threads
  36. 36. Distributed Data Systems 36©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 Slave updates (SCN, Kafka Offset) row for every committed txn 3:101@2
  37. 37. Distributed Data Systems 37©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Client only applies messages with SCN greater than last committed Replayed Messages 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 BEGIN Transaction 3:104 3:103@6
  38. 38. Distributed Data Systems 38©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Incomplete transaction is rolled back 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 ROLLBACK 3:104 Replayed Messages 3:103@6
  39. 39. Distributed Data Systems 39©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer Client only applies messages with SCN greater than last committed 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:104 B 3:104 SKIP 3:102..3:10 3 Replayed Messages 3:103@6 3:103 B,E
  40. 40. Distributed Data Systems 40©2016 LinkedIn Corporation. All Rights Reserved. Kafka Consumer Master MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:101 B,E 3:102 B 3:102 3:102 E 3:100 B,E 3:103 B,E 3:104 B 3:104 3:102 B 3:102 3:102 E 3:104 B 3:104 Replayed Messages BEGIN 3:104 (again) 3:104 E 3:103@6 3:103 B,E
  41. 41. Distributed Data Systems 41©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering  What if stalled master continues writing after transition?
  42. 42. Distributed Data Systems 42©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering MASTER MySQ L ProducerConsumer Slave MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 Master Stalled
  43. 43. Distributed Data Systems 43©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering Master MySQ L ProducerConsumer Promoted Slave MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 Master Stalled 4:0 C Helix sends SlaveToMaster transition to one of the slaves
  44. 44. Distributed Data Systems 44©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering Master MySQ L ProducerConsumer New Master MySQ L ProducerConsumer Master Stalled 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 4:0 C 4:1 B,E 4:2 B 4:2 E Slave becomes master and starts taking writes
  45. 45. Distributed Data Systems 45©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering Master MySQ L ProducerConsumer New Master MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 4:0 C 4:1 B,E 4:2 B 4:2 E 3:104 E 3:105 B,E Stalled Master resumes and sends binlog entries to Kafka
  46. 46. Distributed Data Systems 46©2016 LinkedIn Corporation. All Rights Reserved. Zombie Write Filtering ERROR MySQ L ProducerConsumer New Master MySQ L ProducerConsumer 3:102 B 3:102 3:102 E 3:103 B,E 3:104 B 3:104 4:0 C 4:1 B,E 4:2 B 4:2 E 3:104 E 3:105 B,E 4:3 B,E Former master goes into ERROR state Zombie writes filtered by all consumers based on increasing SCN rule
  47. 47. Distributed Data Systems 47©2016 LinkedIn Corporation. All Rights Reserved. Current Status ESPRESSO Database Replication with Kafka
  48. 48. Distributed Data Systems 48©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Kafka Replication: Current Status  Pre-Production integration environment migrated to Kafka replication  8 production clusters migrated (as of 4/11)  Migration will continue through Q3 of 2016  Average replication latency < 90ms
  49. 49. Distributed Data Systems 49©2016 LinkedIn Corporation. All Rights Reserved. Conclusions  Configure Kafka for reliable, at least once, delivery. See: http://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844  Carefully control producer and consumer checkpoints along txn boundaries  Embed sequence information in message stream to implement exactly- once application of messages
  50. 50. Distributed Data Systems Even our workspace is Horizontally Scalable!

×