SlideShare una empresa de Scribd logo
1 de 16
Common issues with Apache
Kafka® Producer
Badai Aqrandista, Senior Technical Support Engineer
Introduction
2
• My name is BADAI AQRANDISTA
• I started as a web developer, building website with Perl
and PHP in 2005.
• Experience supporting applications on Linux/UNIX
environment, from hotel booking engine,
telecommunication billing system, and mining equipment
monitoring system.
• Currently working for Confluent as Senior Technical
Support Engineer.
Kafka in a nutshell
3
• Kafka is a Pub/Sub system
• Kafka Producer sends record into Kafka
broker
• Kafka Consumer fetches record from
Kafka broker
• Kafka broker persists any data it receives
until retention period expires
PRODUCER CONSUME
R
Kafka Producer Internals
Kafka Producer Internals
5
• KafkaProducer API:
• public Future<RecordMetadata> send(ProducerRecord<K,V> record)
• public Future<RecordMetadata> send(ProducerRecord<K,V> record, Callback callback)
• KafkaProducer#send method is asynchronous.
• It does not immediately send the record to Kafka broker.
• It puts the record in an internal queue and an internal queue will send multiple records as a
batch.
Batch
Record
Key
Value
Record
Key
Value
Record
Key
Value
Kafka Producer Internals
6
• Each Kafka Producer batch corresponds to a partitions.
• Kafka Producer determines the batch to append a record to based on the record key.
• If record key is “null”, Kafka Producer will choose the batch randomly.
• If record key is not “null”, Kafka Producer will use the hash of the record key to determine
the partition number.
• One or more batches are sent to the Kafka broker in a PRODUCE request.
Kafka Producer Internals
7
• Kafka Producer internal thread sends a batch to Kafka broker based on these
configuration:
• “batch.size” – defaults to 16 kB
• “linger.ms” – defaults to 0
• So, Kafka Producer internal thread sends a batch to Kafka broker when:
• The total size of records in the batch exceeds “batch.size”, or
• The time since batch creation exceeds “linger.ms”, or
• Kafka Producer ”flush()” method is called (directly or indirectly via “close()”).
• Kafka Producer only creates one connection to each broker.
• In the end, every batch for a Kafka broker must be sent sequentially through this one
connection.
• The maximum number of batches sent to each broker at any one time is controlled by
“max.in.flight.requests.per.connection”, which defaults to 5.
Kafka Producer Issues
Kafka Producer Issues
9
1. Failure to connect to Kafka broker
2. Record is too large
3. Batch expires before sending
4. Not enough replicas error
Failure to connect to Kafka broker
10
• This error is not obvious, but it means failure to connect to Kafka broker.
• The error message looks like this:
• [2021-08-02 12:57:44,097] WARN [Producer clientId=producer-1] Connection to node -1
(kafka1/172.20.0.6:9093) could not be established. Broker may not be available.
(org.apache.kafka.clients.NetworkClient)
• How to fix this:
• Check the broker configuration to confirm the listener port and security protocol
• Check the hostname or the IP address of the broker
• Confirm that Kafka Producer’s bootstrap.server configuration is correct
• Confirm that connectivity exists between Kafka Producer’s host and Kafka broker hosts with commands
such as:
• ping {BROKER_HOST}
• nc {BROKER_HOST} {BROKER PORT}
• openssl s_client -connect {BROKER_HOST}:{BROKER_PORT}
Record is too large
11
• This error is because the record size is greater than “max.request.size” configuration, which
defaults to 1048576 (1 MB).
• The error message is like this:
• org.apache.kafka.common.errors.RecordTooLargeException: The message is 1600088 bytes when
serialized which is larger than 1048576, which is the value of the max.request.size configuration.
• How to fix it:
• Reduce the record size. This requires a change in the application that generates the record.
• If you cannot reduce the record size, you can increase producer configuration “max.request.size”. If you
do this, you also need to increase topic configuration “max.message.bytes”.
• Note: “max.request.size” is the maximum request size AFTER serialization but BEFORE
compression. So, setting compression will not fix this.
Batch expires before sending
12
• This error is a symptom of slow transfer time (on network) or slow processing (on Kafka
broker).
• The error looks like this:
• org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test1-0:1500 ms has
passed since batch creation
• Sanity checks:
• Is the topic partition online? Topic partition is online if one or more Kafka brokers hosting the replicas
are online.
• Use “kafka-topics --bootstrap-server {BROKER HOST:PORT} --describe --topic {TOPIC NAME}”
• “delivery.timeout.ms” – An upper bound on the time to report success or failure after a call to send()
returns.
• The default value is 120000 ms (2 minutes).
• If ”delivery.timeout.ms” is set to a very low value, it can cause batches to be expired too early.
• “batch.size” – The maximum size of a record batch.
• The default value is 16384 bytes (16 kB).
• If the message size is large, this configuration may need to be increased to allow more records per
batch. More records per batch means higher throughput and lower latency per record.
Batch expires before sending
13
• How to investigate this issue (cont’d):
• First, we need to identify whether this is caused by slow transfer time or slow processing.
• To check if it is slow transfer time, execute “ping {BROKER HOST}” from the producer host. The round trip time
(RTT) should be reasonable. For example: If both producer and Kafka brokers are in the same data center, the
RTT should be less than 10 ms, mostly should be under 1 ms.
• If ”ping” result is good (i.e. consistently under 10 ms with 0% packet loss), then network latency is unlikely
to be the cause.
• To check if it is slow processing, check the following on Kafka brokers:
• Number of connections on the Kafka broker with “netstat -n | grep 9092 | wc -l”. More than 1000
connections is usually too high and can cause slow processing or connectivity issue.
• Number of topic partitions per broker. More than 1500 partitions per broker is usually too high and can
cause slow processing. Check it with “kafka-topics --describe | awk ‘{print $5, $6}’ | sort | uniq –c”.
• If Kafka broker host has enough CPU and memory, then you can increment “num.replica.fetchers” to 2 or 3 to allow
more partitions per broker.
• Inter-broker ”ping” latency. If the brokers are running on multiple data center (e.g. multiple Availability
Zone), then this may be significant contributor to produce latency.
• CPU usage of Kafka brokers. Following JMX metrics also show the internal thread idle-ness if you need:
• kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent – if this is low (< 0.5), that
means it needs higher “num.io.threads”, if CPU allows.
• kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent – if this is low (< 0.5), that means it
needs higher “num.network.threads”, if CPU allows.
Not enough replicas error
14
• This means the number of replicas in ISR is less than “min.insync.replicas” configuration.
• The error looks like this:
• [2021-08-03 01:34:05,077] WARN [Producer clientId=producer-1] Got error produce response with
correlation id 3 on topic-partition test2-0, retrying (2147483646 attempts left). Error:
NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender)
• This error occurs when:
• Topic replication factor is 3.
• Topic configuration includes “min.insync.replicas=2”.
• Producer uses “acks=all” configuration.
Not enough replica error
15
• What is ISR? Short for “In Sync Replicas”. This means the follower replicas that are in sync
with the leader. In other word, the follower replicas that have all records that the leader
replica has.
• How can a replica become out of sync? Either because the broker is offline or replication
failure or slow replication.
• How to fix this error:
• If it is out of sync because Kafka broker being offline, start the broker hosting the offline replicas.
• If it is out of sync because of replication failure, fix the failure. This is separate discussion. But the most
common one is disk failure. If the disk storing the replica data is full, Kafka broker will stop replicating all
replicas on that disk.
• If it is out of sync because of slow replication, fix the slow replication. This is also separate discussion.
But the most common cause is inter-broker latency or too many topic partitions per broker.
Thank you. Any questions?

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
 
Kafka: All an engineer needs to know
Kafka: All an engineer needs to knowKafka: All an engineer needs to know
Kafka: All an engineer needs to know
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 

Similar a Common issues with Apache Kafka® Producer

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 

Similar a Common issues with Apache Kafka® Producer (20)

Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
 
World of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaWorld of Tanks Experience of Using Kafka
World of Tanks Experience of Using Kafka
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Understanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleUnderstanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at Scale
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 

Más de confluent

Más de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Common issues with Apache Kafka® Producer

  • 1. Common issues with Apache Kafka® Producer Badai Aqrandista, Senior Technical Support Engineer
  • 2. Introduction 2 • My name is BADAI AQRANDISTA • I started as a web developer, building website with Perl and PHP in 2005. • Experience supporting applications on Linux/UNIX environment, from hotel booking engine, telecommunication billing system, and mining equipment monitoring system. • Currently working for Confluent as Senior Technical Support Engineer.
  • 3. Kafka in a nutshell 3 • Kafka is a Pub/Sub system • Kafka Producer sends record into Kafka broker • Kafka Consumer fetches record from Kafka broker • Kafka broker persists any data it receives until retention period expires PRODUCER CONSUME R
  • 5. Kafka Producer Internals 5 • KafkaProducer API: • public Future<RecordMetadata> send(ProducerRecord<K,V> record) • public Future<RecordMetadata> send(ProducerRecord<K,V> record, Callback callback) • KafkaProducer#send method is asynchronous. • It does not immediately send the record to Kafka broker. • It puts the record in an internal queue and an internal queue will send multiple records as a batch. Batch Record Key Value Record Key Value Record Key Value
  • 6. Kafka Producer Internals 6 • Each Kafka Producer batch corresponds to a partitions. • Kafka Producer determines the batch to append a record to based on the record key. • If record key is “null”, Kafka Producer will choose the batch randomly. • If record key is not “null”, Kafka Producer will use the hash of the record key to determine the partition number. • One or more batches are sent to the Kafka broker in a PRODUCE request.
  • 7. Kafka Producer Internals 7 • Kafka Producer internal thread sends a batch to Kafka broker based on these configuration: • “batch.size” – defaults to 16 kB • “linger.ms” – defaults to 0 • So, Kafka Producer internal thread sends a batch to Kafka broker when: • The total size of records in the batch exceeds “batch.size”, or • The time since batch creation exceeds “linger.ms”, or • Kafka Producer ”flush()” method is called (directly or indirectly via “close()”). • Kafka Producer only creates one connection to each broker. • In the end, every batch for a Kafka broker must be sent sequentially through this one connection. • The maximum number of batches sent to each broker at any one time is controlled by “max.in.flight.requests.per.connection”, which defaults to 5.
  • 9. Kafka Producer Issues 9 1. Failure to connect to Kafka broker 2. Record is too large 3. Batch expires before sending 4. Not enough replicas error
  • 10. Failure to connect to Kafka broker 10 • This error is not obvious, but it means failure to connect to Kafka broker. • The error message looks like this: • [2021-08-02 12:57:44,097] WARN [Producer clientId=producer-1] Connection to node -1 (kafka1/172.20.0.6:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) • How to fix this: • Check the broker configuration to confirm the listener port and security protocol • Check the hostname or the IP address of the broker • Confirm that Kafka Producer’s bootstrap.server configuration is correct • Confirm that connectivity exists between Kafka Producer’s host and Kafka broker hosts with commands such as: • ping {BROKER_HOST} • nc {BROKER_HOST} {BROKER PORT} • openssl s_client -connect {BROKER_HOST}:{BROKER_PORT}
  • 11. Record is too large 11 • This error is because the record size is greater than “max.request.size” configuration, which defaults to 1048576 (1 MB). • The error message is like this: • org.apache.kafka.common.errors.RecordTooLargeException: The message is 1600088 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration. • How to fix it: • Reduce the record size. This requires a change in the application that generates the record. • If you cannot reduce the record size, you can increase producer configuration “max.request.size”. If you do this, you also need to increase topic configuration “max.message.bytes”. • Note: “max.request.size” is the maximum request size AFTER serialization but BEFORE compression. So, setting compression will not fix this.
  • 12. Batch expires before sending 12 • This error is a symptom of slow transfer time (on network) or slow processing (on Kafka broker). • The error looks like this: • org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test1-0:1500 ms has passed since batch creation • Sanity checks: • Is the topic partition online? Topic partition is online if one or more Kafka brokers hosting the replicas are online. • Use “kafka-topics --bootstrap-server {BROKER HOST:PORT} --describe --topic {TOPIC NAME}” • “delivery.timeout.ms” – An upper bound on the time to report success or failure after a call to send() returns. • The default value is 120000 ms (2 minutes). • If ”delivery.timeout.ms” is set to a very low value, it can cause batches to be expired too early. • “batch.size” – The maximum size of a record batch. • The default value is 16384 bytes (16 kB). • If the message size is large, this configuration may need to be increased to allow more records per batch. More records per batch means higher throughput and lower latency per record.
  • 13. Batch expires before sending 13 • How to investigate this issue (cont’d): • First, we need to identify whether this is caused by slow transfer time or slow processing. • To check if it is slow transfer time, execute “ping {BROKER HOST}” from the producer host. The round trip time (RTT) should be reasonable. For example: If both producer and Kafka brokers are in the same data center, the RTT should be less than 10 ms, mostly should be under 1 ms. • If ”ping” result is good (i.e. consistently under 10 ms with 0% packet loss), then network latency is unlikely to be the cause. • To check if it is slow processing, check the following on Kafka brokers: • Number of connections on the Kafka broker with “netstat -n | grep 9092 | wc -l”. More than 1000 connections is usually too high and can cause slow processing or connectivity issue. • Number of topic partitions per broker. More than 1500 partitions per broker is usually too high and can cause slow processing. Check it with “kafka-topics --describe | awk ‘{print $5, $6}’ | sort | uniq –c”. • If Kafka broker host has enough CPU and memory, then you can increment “num.replica.fetchers” to 2 or 3 to allow more partitions per broker. • Inter-broker ”ping” latency. If the brokers are running on multiple data center (e.g. multiple Availability Zone), then this may be significant contributor to produce latency. • CPU usage of Kafka brokers. Following JMX metrics also show the internal thread idle-ness if you need: • kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent – if this is low (< 0.5), that means it needs higher “num.io.threads”, if CPU allows. • kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent – if this is low (< 0.5), that means it needs higher “num.network.threads”, if CPU allows.
  • 14. Not enough replicas error 14 • This means the number of replicas in ISR is less than “min.insync.replicas” configuration. • The error looks like this: • [2021-08-03 01:34:05,077] WARN [Producer clientId=producer-1] Got error produce response with correlation id 3 on topic-partition test2-0, retrying (2147483646 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender) • This error occurs when: • Topic replication factor is 3. • Topic configuration includes “min.insync.replicas=2”. • Producer uses “acks=all” configuration.
  • 15. Not enough replica error 15 • What is ISR? Short for “In Sync Replicas”. This means the follower replicas that are in sync with the leader. In other word, the follower replicas that have all records that the leader replica has. • How can a replica become out of sync? Either because the broker is offline or replication failure or slow replication. • How to fix this error: • If it is out of sync because Kafka broker being offline, start the broker hosting the offline replicas. • If it is out of sync because of replication failure, fix the failure. This is separate discussion. But the most common one is disk failure. If the disk storing the replica data is full, Kafka broker will stop replicating all replicas on that disk. • If it is out of sync because of slow replication, fix the slow replication. This is also separate discussion. But the most common cause is inter-broker latency or too many topic partitions per broker.
  • 16. Thank you. Any questions?