Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Kafka Reliability - When it absolutely, positively has to be there

21.909 visualizaciones

Publicado el

Best practices for making sure messages don't get lost - ever.

Publicado en: Datos y análisis

Kafka Reliability - When it absolutely, positively has to be there

  1. 1. When it absolutely, positively, has to be there Reliability Guarantees in Apache Kafka @jeffholoman @gwenshap Gwen Shapira Confluent Jeff Holoman Cloudera
  2. 2. Kafka  High Throughput  Low Latency  Scalable  Centralized  Real-time
  3. 3. “If data is the lifeblood of high technology, Apache Kafka is the circulatory system” --Todd Palino Kafka SRE @ LinkedIn
  4. 4. If Kafka is a critical piece of our pipeline  Can we be 100% sure that our data will get there?  Can we lose messages?  How do we verify?  Who’s fault is it?
  5. 5. Distributed Systems  Things Fail  Systems are designed to tolerate failure  We must expect failures and design our code and configure our systems to handle them
  6. 6. Network Broker MachineClient Machine Data Flow Kafka Client Broker O/S Socket Buffer NIC NIC Page Cache Disk Application Thread O/S Socket Buffer async callback ✗ ✗ ✗ ✗ ✗ ✗ ✗✗ data ack / exception
  7. 7. Client Machine Kafka Client O/S Socket Buffer NIC Application Thread ✗ ✗ ✗Broker Machine Broker NIC Page Cache Disk O/S Socket Buffer miss ✗ ✗ ✗ ✗ Network Data Flow ✗ data offsets ZK Kafka✗
  8. 8. Replication is your friend  Kafka protects against failures by replicating data  The unit of replication is the partition  One replica is designated as the Leader  Follower replicas fetch data from the leader  The leader holds the list of “in-sync” replicas
  9. 9. Replication and ISRs 00 11 22 00 11 22 00 11 22 ProducerProducer Broker 100 Broker 101 Broker 102 Topic: Partitions: Replicas: my_topic 3 3 Partition: Leader: ISR: 1 101 100,102 Partition: Leader: ISR: 2 102 101,100 Partition: Leader: ISR: 0 100 101,102
  10. 10. ISR  2 things make a replica in-sync - Lag behind leader - replica.lag.time.max.ms – replica that didn’t fetch or is behind - replica.lag.max.messages – will go away in 0.9 - Connection to Zookeeper
  11. 11. Terminology  Acked - Producers will not retry sending. - Depends on producer setting  Committed - Consumers can read. - Only when message got to all ISR.  replica.lag.time.max.ms - how long can a dead replica prevent consumers from reading?
  12. 12. Replication  Acks = all - only waits for in-sync replicas to reply. Replica 3 100 Replica 2 100 Replica 1 100 Time
  13. 13. Replication Replica 2 100 101 Replica 1 100 101 Time  Replica 3 stopped replicating for some reason Acked in acks = all “committed” Acked in acks = 1 but not “committed”
  14. 14. Replication Replica 2 100 101 Replica 1 100 101 Time  One replica drops out of ISR, or goes offline  All messages are now acked and committed
  15. 15. Replication Replica 1 100 101 102 103 104Time  2nd Replica drops out, or is offline
  16. 16. Replication Time  Now we’re in trouble ✗
  17. 17. Replication Replica 3 100 Replica 2 100 101 Time All those are “acked” and “committed”
  18. 18. So what to do  Disable Unclean Leader Election - unclean.leader.election.enable = false  Set replication factor - default.replication.factor = 3  Set minimum ISRs - min.insync.replicas = 2
  19. 19. Warning  min.insync.replicas is applied at the topic-level.  Must alter the topic configuration manually if created before the server level change  Must manually alter the topic < 0.9.0 (KAFKA-2114)
  20. 20. Replication  Replication = 3  Min ISR = 2 Replica 3 100 Replica 2 100 Replica 1 100 Time
  21. 21. Replication Replica 2 100 101 Replica 1 100 101 Time  One replica drops out of ISR, or goes offline
  22. 22. Replication Replica 1 100 101102 103 104 Time  2nd Replica fails out, or is out of sync Buffers in Producer
  23. 23. Producer Internals  Producer sends batches of messages to a buffer M3 Application Thread Application Thread Application Thread send() M2 M1 M0 Batch 3 Batch 2 Batch 1 Fail? response retry Update Future callback drain Metadata or Exception
  24. 24. Basics  Durability can be configured with the producer configuration request.required.acks - 0 The message is written to the network (buffer) - 1 The message is written to the leader - all The producer gets an ack after all ISRs receive the data; the message is committed  Make sure producer doesn’t just throws messages away! - block.on.buffer.full = true
  25. 25.  All calls are non-blocking async  2 Options for checking for failures: - Immediately block for response: send().get() - Do followup work in Callback, close producer after error threshold - Be careful about buffering these failures. Future work? KAFKA-1955 - Don’t forget to close the producer! producer.close() will block until in-flight txns complete  retries (producer config) defaults to 0  message.send.max.retries (server config) defaults to 3  In flight requests could lead to message re-ordering
  26. 26. Consumer  Two choices for Consumer API - Simple Consumer - High Level Consumer
  27. 27. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4
  28. 28. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit?
  29. 29. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit?
  30. 30. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Auto-commit enabled ✗Commit
  31. 31. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Auto-commit enabled ✗
  32. 32. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Auto-commit enabled Consumer Picks up here
  33. 33. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit
  34. 34. Consumer Offsets P0 P2 P3 P4 P5 P6 Consumer Thread 1 Thread 2 Thread 3 Thread 4 Commit Offset commits for all threads
  35. 35. P0 P2 P3 P4 P5 P6 Consumer 1 Consumer 2 Consumer 3 Consumer 4 Consumer Offsets Auto-commit DISABLED Commit
  36. 36. Consumer Recommendations  Set autocommit.enable = false  Manually commit offsets after the message data is processed / persisted consumer.commitOffsets();  Run each consumer in it’s own thread
  37. 37. New Consumer!  No Zookeeper! At all!  Rebalance listener  Commit: - Commit - Commit async - Commit( offset)  Seek(offset)
  38. 38. Exactly Once Semantics  At most once is easy  At least once is not bad either – commit after 100% sure data is safe  Exactly once is tricky - Commit data and offsets in one transaction - Idempotent producer
  39. 39. Monitoring for Data Loss  Monitor for producer errors – watch the retry numbers  Monitor consumer lag – MaxLag or via offsets  Standard schema: - Each message should contain timestamp and originating service and host  Each producer can report message counts and offsets to a special topic  “Monitoring consumer” reports message counts to another special topic  “Important consumers” also report message counts  Reconcile the results
  40. 40. Be Safe, Not Sorry  Acks = all  Block.on.buffer.full = true  Retries = MAX_INT  ( Max.inflight.requests.per.connect = 1 )  Producer.close()  Replication-factor >= 3  Min.insync.replicas = 2  Unclean.leader.election = false  Auto.offset.commit = false  Commit after processing  Monitor!

×