SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Presented By: Amit Kumar
Deep Dive into Kafka for Big
Data Application
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 Kafka Introduction
02 Topic Config
03 Producer Config
04 Consumer Config
05 Demo
Introduction
● Apache Kafka is used primarily to build real-time data streaming pipelines.
● It is used to store Stream of Data. It is build on pub/sub model.
● Streams can receive tens of thousands of records per second, and some will receive one or two records per
hour.
● Thus kafka is most important tool to store intermediate data/events of big data application.
● These stream of data is store in kafka as Kafka topic.
● Once the stream is stored in kafka, it can be consumed by multiple applications for different use cases such as
storing in database, Analytics.
● Apache Kafka works as cushion between two application mostly if later application is slower.
Kafka Features
● Low Latency:- It offers low latency low latency value, i.e., upto 10 milliseconds.
● High Throughput:- due to low latency it can handle high velocity and volume.
● Fault Tolerance:- handle node/machine failure within the cluster.
● Durability:- because of its replication feature.
● Distributed:- kafka contains a distributed architecture which makes it scalable.
● Real Time Handling:- kafka is able to handle real-time data pipeline.
● Batching:- Kafka works with batch-like use cases, supports batching.
Kafka Components
● Apache Kafka stores data stream in Topic.
● Producer API is used to write data/publish in kafka topic.
● Consumer API is used to consumed data from the topic for further use.
Kafka Topic
It’s similar to the table of database, kafka uses topics to organise the message of a particular catogery.
We can do query on kafka topic unlike the database table. We need to create the producer to write data and
consumer to read that too in sequential order.
Data in topics are deleted as per retention period.
Important kafka topic config:-
Number of Partition:-
Replication Factor:-
Message Size:-
Log CleanUp Policy:-
Config Details
Number of partition govern the parallelism of application.
In order to do parallel computation we need multiple consumer instance and since we know one partition can’t feed data
to multiple consumer. We have to increase the partition count to achieve same.
Continue _ _ _ _
Replication Factor is multiple copy of the data over different broker. It help us in dealing
with data loss when broker goes offline or fails. Replicated data server the data.
In ideal case we give replication factor as 3.
If we increase the replication factor more it will hit the performance and keeping it as less we
will lose the data.
Message Size:- Kafka has a default limit of 1MB per message in the topic.
in few scenario we need to send data which is larger than 1 Mb. In that case we can modify
the default message size till 10 MB.
replica.fetch.max.bytes=10485880
Continue _ _ _ _
Log cleanup policy make sure that older message in the topic is getting cleaned. so that it
free up memory of the broker.
It is being controlled by following two configuration.
log.retention.hours :- The most common configuration for how long Kafka will retain
messages is by time. The default is specified in the configuration file using the
log.retention.hours parameter, and it is set to 168 hours, the equivalent of one week.
log.retention.bytes:- Another way to expire messages is based on the total number of bytes of
messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition.
The default is -1, meaning that there is no limit and only a time limit is applied.
Kafka Producer
Once a topic has been created with Kafka, the next step is to send data into the topic. This is
where Kafka Producers come in.
Kafka producer sends messages to a topic, and messages are distributed to partitions
according to a mechanism such as key hashing
Continue _ _ _ _
Kafka messages are created by the producer. A Kafka message consists of the following elements:
Continue _ _ _ _
Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the
key is serialized into binary format.
Value represents the content of the message and can also be null. The value format is arbitrary and is then also
serialized into binary format.
Compression Type Kafka messages can be compressed. The compression Options are none, gzip, lz4, snappy,
and zstd
Headers. This is key value pair added especially for tracing of the message.
Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The
combination of topic+partition+offset uniquely identifies the message
Timestamp. A timestamp is added either by the user or the system in the message.
Continue _ _ _ _
// Must Have Config
bootstrap.servers -> bootstrapServers,
key.serializer -> stringSerializer,
value.serializer -> stringSerializer,
// safe producer
enable.idempotence -> "true",
acks -> "all",
retries -> Integer.MAX_VALUE.toString(),
max.in.flight.requests.per.connection -> "5",
// high throughput producer at the expense of a bit of latency and CPU usage
compression.type -> "snappy",
linger.ms -> "20",
batch.size -> Integer.toString(32 * 1024) // 32KB
Ack
Safe Producer:-
Acks is the number of brokers who need to acknowledge receiving the message before it is considered a
successful write.
acks=0 producers consider messages as "written successfully" the moment the message was sent without
waiting for the broker to accept it at all. this is fastest approaches but data loss is possible.
acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only
the leader.
acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync
replicas (ISR)
Retry
Retries ensure that no messages are dropped when sent to Apache Kafka.
for kafka > 2.1 default retires value is max int retries = 214748364
it doesn’t mean that it will keep retrying forever. it is being controlled by delivery.timeout.ms
default setting for the timeout is 2 minute delivery.timeout.ms=120000
max.in.flight.requests.per.connection = 1 if we want to keep the ordering maintained
then we have to set the max in fight request = 1 but it impact the performance to keep
the performance high we should set it as 5
Compression
Producers group messages in a batch before sending.
If the producer is sending compressed messages, all the messages in a single producer batch are compressed
together and sent as the "value" of a "wrapper message".
Compression is more effective the bigger the batch of messages being sent to Kafka.
Compression options are are compression.type= none, gzip, lz4, snappy, and zstd
Batching
By Default, Kafka producers try to send records as soon as possible.
If we want to increase the throughput we have to enable batching.
Batching is mainly controlled by two producer settings - linger.ms and batch.size
the default value of longest.ms = 20ms and batch.size = 16KB
Producer will wait either till 16Kb of the message or till 20 ms before sending message.
kafka Consumer
Applications that pull event data from one or more Kafka topics are known as Kafka Consumers.
Consumers can read from one or more partitions at a time in Apache Kafka.
Data is being read in order within each partition.
Delivery Semantics
A consumer reading from a Kafka partition may choose when to commit offsets.
At Most Once Delivery:- offsets are committed as soon as a message batch is received after calling poll().
If processing fails, the message will be lost as, it will not be read again as the offsets of those messages have been
committed already.
At Least Once Delivery:- In this semantic we don’t want to lose the message to ensure it we commit the offset after
processing is done but due to retries it leads to duplicate processing.
This is suitable for consumers that cannot afford any data loss.
Exactly Once Delivery:- In this semantic we want the message to be processing exactly once we don’t want the
duplicate data. it’s applicable in case of payment and similar sensitive use cases.
processing.guarantee=exactly.once
Polling
Kafka consumers poll the Kafka broker to receive batches of data.
Polling allows consumers to control:-
● From where in the log they want to consume
● How fast they want to consume
● Ability to replay events
consumer sends the heartbeat on regular interval (heartbeat.interval.ms). if it stops sending heartbeat group
coordinator wait till (session.timeout.ms) time and retrigger a rebalance.
Consumers poll brokers periodically using the .poll() method. If two .poll() calls are separated by more than
max.poll.interval.ms time, then the consumer will be disconnected from the group.
default value of max.poll.interval.ms = 5 minute
Continue _ _ _
max.poll.records: (default 500) :- It control the maximum number of records that a single call to poll() will fetch.
This is the important config which control how fast the data will be coming to our application.
Suppose your application is process the data slower then you must set this max.poll.records value as lower.
and if the application is processing data faster then we can set this value as high.
If processing of a particular batch takes more time than the max.poll.interval.ms time, then the consumer will
be disconnected from the group. to make sure this doesn’t happen reduce the max.poll.records value.
Auto Offsets Reset
● A consumer is expected to read from a log continuously.
● But due to some network error or bug in application. We are not able to process the message and
● once we restart the application the current offset data is not available (deleted due to retention policy) then
● The specific consumer have two option either to read from beginning or read from latest.
The same behavior is controlled by auto.offset.reset it can take following values:-
latest:- consumers will read messages from the tail of the partition
earliest:- reading from the oldest offset in the partition
none:- throw exception to the consumer if no previous offset is found
Thank You !
Get in touch with us:

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
 
Quarkus k8s
Quarkus   k8sQuarkus   k8s
Quarkus k8s
 
DevOps with Kubernetes
DevOps with KubernetesDevOps with Kubernetes
DevOps with Kubernetes
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
Pub/Sub Messaging
Pub/Sub MessagingPub/Sub Messaging
Pub/Sub Messaging
 
REST & RESTful Web Services
REST & RESTful Web ServicesREST & RESTful Web Services
REST & RESTful Web Services
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
kafka
kafkakafka
kafka
 
Event Driven Software Architecture Pattern
Event Driven Software Architecture PatternEvent Driven Software Architecture Pattern
Event Driven Software Architecture Pattern
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 

Similar a Kafka Deep Dive

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 

Similar a Kafka Deep Dive (20)

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka RealTime Streaming
Kafka RealTime StreamingKafka RealTime Streaming
Kafka RealTime Streaming
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Intoduction to Apache Kafka
Intoduction to Apache KafkaIntoduction to Apache Kafka
Intoduction to Apache Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 

Más de Knoldus Inc.

Más de Knoldus Inc. (20)

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Kafka Deep Dive

  • 1. Presented By: Amit Kumar Deep Dive into Kafka for Big Data Application
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 Kafka Introduction 02 Topic Config 03 Producer Config 04 Consumer Config 05 Demo
  • 4. Introduction ● Apache Kafka is used primarily to build real-time data streaming pipelines. ● It is used to store Stream of Data. It is build on pub/sub model. ● Streams can receive tens of thousands of records per second, and some will receive one or two records per hour. ● Thus kafka is most important tool to store intermediate data/events of big data application. ● These stream of data is store in kafka as Kafka topic. ● Once the stream is stored in kafka, it can be consumed by multiple applications for different use cases such as storing in database, Analytics. ● Apache Kafka works as cushion between two application mostly if later application is slower.
  • 5.
  • 6. Kafka Features ● Low Latency:- It offers low latency low latency value, i.e., upto 10 milliseconds. ● High Throughput:- due to low latency it can handle high velocity and volume. ● Fault Tolerance:- handle node/machine failure within the cluster. ● Durability:- because of its replication feature. ● Distributed:- kafka contains a distributed architecture which makes it scalable. ● Real Time Handling:- kafka is able to handle real-time data pipeline. ● Batching:- Kafka works with batch-like use cases, supports batching.
  • 7. Kafka Components ● Apache Kafka stores data stream in Topic. ● Producer API is used to write data/publish in kafka topic. ● Consumer API is used to consumed data from the topic for further use.
  • 8. Kafka Topic It’s similar to the table of database, kafka uses topics to organise the message of a particular catogery. We can do query on kafka topic unlike the database table. We need to create the producer to write data and consumer to read that too in sequential order. Data in topics are deleted as per retention period. Important kafka topic config:- Number of Partition:- Replication Factor:- Message Size:- Log CleanUp Policy:-
  • 9. Config Details Number of partition govern the parallelism of application. In order to do parallel computation we need multiple consumer instance and since we know one partition can’t feed data to multiple consumer. We have to increase the partition count to achieve same.
  • 10. Continue _ _ _ _ Replication Factor is multiple copy of the data over different broker. It help us in dealing with data loss when broker goes offline or fails. Replicated data server the data. In ideal case we give replication factor as 3. If we increase the replication factor more it will hit the performance and keeping it as less we will lose the data. Message Size:- Kafka has a default limit of 1MB per message in the topic. in few scenario we need to send data which is larger than 1 Mb. In that case we can modify the default message size till 10 MB. replica.fetch.max.bytes=10485880
  • 11. Continue _ _ _ _ Log cleanup policy make sure that older message in the topic is getting cleaned. so that it free up memory of the broker. It is being controlled by following two configuration. log.retention.hours :- The most common configuration for how long Kafka will retain messages is by time. The default is specified in the configuration file using the log.retention.hours parameter, and it is set to 168 hours, the equivalent of one week. log.retention.bytes:- Another way to expire messages is based on the total number of bytes of messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition. The default is -1, meaning that there is no limit and only a time limit is applied.
  • 12. Kafka Producer Once a topic has been created with Kafka, the next step is to send data into the topic. This is where Kafka Producers come in. Kafka producer sends messages to a topic, and messages are distributed to partitions according to a mechanism such as key hashing
  • 13. Continue _ _ _ _ Kafka messages are created by the producer. A Kafka message consists of the following elements:
  • 14. Continue _ _ _ _ Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the key is serialized into binary format. Value represents the content of the message and can also be null. The value format is arbitrary and is then also serialized into binary format. Compression Type Kafka messages can be compressed. The compression Options are none, gzip, lz4, snappy, and zstd Headers. This is key value pair added especially for tracing of the message. Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The combination of topic+partition+offset uniquely identifies the message Timestamp. A timestamp is added either by the user or the system in the message.
  • 15. Continue _ _ _ _ // Must Have Config bootstrap.servers -> bootstrapServers, key.serializer -> stringSerializer, value.serializer -> stringSerializer, // safe producer enable.idempotence -> "true", acks -> "all", retries -> Integer.MAX_VALUE.toString(), max.in.flight.requests.per.connection -> "5", // high throughput producer at the expense of a bit of latency and CPU usage compression.type -> "snappy", linger.ms -> "20", batch.size -> Integer.toString(32 * 1024) // 32KB
  • 16. Ack Safe Producer:- Acks is the number of brokers who need to acknowledge receiving the message before it is considered a successful write. acks=0 producers consider messages as "written successfully" the moment the message was sent without waiting for the broker to accept it at all. this is fastest approaches but data loss is possible. acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only the leader. acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync replicas (ISR)
  • 17. Retry Retries ensure that no messages are dropped when sent to Apache Kafka. for kafka > 2.1 default retires value is max int retries = 214748364 it doesn’t mean that it will keep retrying forever. it is being controlled by delivery.timeout.ms default setting for the timeout is 2 minute delivery.timeout.ms=120000 max.in.flight.requests.per.connection = 1 if we want to keep the ordering maintained then we have to set the max in fight request = 1 but it impact the performance to keep the performance high we should set it as 5
  • 18. Compression Producers group messages in a batch before sending. If the producer is sending compressed messages, all the messages in a single producer batch are compressed together and sent as the "value" of a "wrapper message". Compression is more effective the bigger the batch of messages being sent to Kafka. Compression options are are compression.type= none, gzip, lz4, snappy, and zstd
  • 19. Batching By Default, Kafka producers try to send records as soon as possible. If we want to increase the throughput we have to enable batching. Batching is mainly controlled by two producer settings - linger.ms and batch.size the default value of longest.ms = 20ms and batch.size = 16KB Producer will wait either till 16Kb of the message or till 20 ms before sending message.
  • 20. kafka Consumer Applications that pull event data from one or more Kafka topics are known as Kafka Consumers. Consumers can read from one or more partitions at a time in Apache Kafka. Data is being read in order within each partition.
  • 21. Delivery Semantics A consumer reading from a Kafka partition may choose when to commit offsets. At Most Once Delivery:- offsets are committed as soon as a message batch is received after calling poll(). If processing fails, the message will be lost as, it will not be read again as the offsets of those messages have been committed already. At Least Once Delivery:- In this semantic we don’t want to lose the message to ensure it we commit the offset after processing is done but due to retries it leads to duplicate processing. This is suitable for consumers that cannot afford any data loss. Exactly Once Delivery:- In this semantic we want the message to be processing exactly once we don’t want the duplicate data. it’s applicable in case of payment and similar sensitive use cases. processing.guarantee=exactly.once
  • 22. Polling Kafka consumers poll the Kafka broker to receive batches of data. Polling allows consumers to control:- ● From where in the log they want to consume ● How fast they want to consume ● Ability to replay events consumer sends the heartbeat on regular interval (heartbeat.interval.ms). if it stops sending heartbeat group coordinator wait till (session.timeout.ms) time and retrigger a rebalance. Consumers poll brokers periodically using the .poll() method. If two .poll() calls are separated by more than max.poll.interval.ms time, then the consumer will be disconnected from the group. default value of max.poll.interval.ms = 5 minute
  • 23. Continue _ _ _ max.poll.records: (default 500) :- It control the maximum number of records that a single call to poll() will fetch. This is the important config which control how fast the data will be coming to our application. Suppose your application is process the data slower then you must set this max.poll.records value as lower. and if the application is processing data faster then we can set this value as high. If processing of a particular batch takes more time than the max.poll.interval.ms time, then the consumer will be disconnected from the group. to make sure this doesn’t happen reduce the max.poll.records value.
  • 24. Auto Offsets Reset ● A consumer is expected to read from a log continuously. ● But due to some network error or bug in application. We are not able to process the message and ● once we restart the application the current offset data is not available (deleted due to retention policy) then ● The specific consumer have two option either to read from beginning or read from latest. The same behavior is controlled by auto.offset.reset it can take following values:- latest:- consumers will read messages from the tail of the partition earliest:- reading from the oldest offset in the partition none:- throw exception to the consumer if no previous offset is found
  • 25. Thank You ! Get in touch with us: