SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
#JCConf
Apache Kafka

A high-throughput distributed messaging
system
陸振恩 (popcorny)
popcorny@cacafly.com
#JCConf
● What is Kafka
● Basic concept
● Why Kafka fast
● Programming Kafka
● Using scenarios
● Recap
Outline
2
#JCConf
What is Kafka
#JCConf
● More and more data and metrics need to be
collected
- Web activity tracking
- Operation metrics
- Application log aggregation
- Commit log
- …
● We need a message bus to collect and relay these
data
- Big Volume
- Fast and Scalable
Motivation
4
#JCConf
● Developed by Linkedin
● Message System
- Queue
- Pub / Sub
● Written in Scala
● Features
- Durability
- Scalability
- High Availability
- High Throughput
Kafka
5
#JCConf
BigData World
6
Traditional BigData
File System NFS HDFS, S3
Database RDBMS Cassandra, HBase
Batch
Processing
SQL
Hadoop MapReduce

Spark
Stream
Processing
In-App Processing
Strom, 

Spark Streaming
Message
Service
AMQP-compliant Kafka
#JCConf
● Durability
- All messages are persisted
- Sequential read & write (like log file)
- Consumers keep the message offset (like file
descriptor)
- The log files are rotated (like logrotate)
- Messages are only deleted on expired. (like
logrotate)
- Support Batch Load and Real Time Usage!
• cat access.log | grep ‘jcconf’
● tail -F access.log | grep ‘jcconf’
Features
7
#JCConf
Design like Message Queue
Implementation like Distributed Log File
8
#JCConf
● Scalability
- Horizontal scale out
- Topic is partitioned (sharded)
● High Availability
- Partition can be replicated
Features
9
#JCConf
● High Throughput
Features
10
source: http://www.infoq.com/articles/apache-kafka
#JCConf
Basic Concept
#JCConf
● Producer - The role to send message to broker
● Consumer -The role to receive message from broker
● Broker - One node of Kafka cluster.
● ZooKeeper - Coordinator of Kafka cluster and costumer
groups.
Kafka Cluster
Physical Components
12
Producer BroakerBroaker
Broaker
Zookeeper
Consumer Group
Consumer
#JCConf
● Topic!
- The named destination of
partition.
● Partition
- One Topic can have multiple
partition
- Unit of parallelism
● Message!
• Key/value pair
• Message offset
Logical Components
Topic
B
Partition 2
0 1
E
2
F
3
M
4
N
5
Q
6
R
7
S
8
Y
9
b
C
Partition 3
0 1
D
2
K
3
L
4
O
5
P
6
T
7
U
A
Partition 1
0 1
G
2
H
3
I
4
J
5
V
6
W
7
X
8
c
13
#JCConf
One Partition One Consumer 

(Queue)
P CA
Partition 1
0 1
B
2
C
3
D
4
E
5
F
6
G
7
H
8
I
9
J
offset = 8
14
Consumers keep the offset.

Broker has no idea about if message is proceeded
#JCConf
One Partition Multiple Consumer 

(Pub/Sub)
P A
Partition 1
0 1
B
2
C
3
D
4
E
5
F
6
G
7
H
8
I
9
J
C1
C2
C3
offset = 8
offset = 7
offset = 9
15
Each Consumer keep its own offset.
#JCConf
broker2
Multiple Partitions
broker1
P
A
Partition 1
0 1
G
2
H
3
I
4
J
5
V
6
W
7
X
8
c
B
Partition 2
0 1
E
2
F
3
M
4
N
5
Q
6
R
7
S
8
Y
9
b
C
Partition 3
0 1
D
2
K
3
L
4
O
5
P
6
T
7
U
16
C1
p1.offset = 7
p2.offset = 9
p3.offset = 7
Dispatched by hashed key
#JCConf
broker2
Multiple Partitions
broker1
P
A
Partition 1
0 1
G
2
H
3
I
4
J
5
V
6
W
7
X
8
c
B
Partition 2
0 1
E
2
F
3
M
4
N
5
Q
6
R
7
S
8
Y
9
b
C
Partition 3
0 1
D
2
K
3
L
4
O
5
P
6
T
7
U
17
C2
offset = 9
offset = 7
C3
offset = 7
C1
#JCConf
Can we auto-rebalance the consumers
to partitions?
18
Yes, Consumer Group!!
#JCConf
● A group of workers
● Share the offsets
● Offsets are synced to ZooKeeper
● Auto Rebalancing
Consumer Group
19
#JCConf
Consumer Group
20
broker2
broker1
P
A
Partition 1
0 1
G
2
H
3
I
4
J
5
V
6
W
7
X
8
c
B
Partition 2
0 1
E
2
F
3
M
4
N
5
Q
6
R
7
S
8
Y
9
b
C
Partition 3
0 1
D
2
K
3
L
4
O
5
P
6
T
7
U
Consumer Group
‘group1’
C2
p1.offset = 7
p2.offset = 9
p3.offset = 7
C1
#JCConf
Consumer Group
21
broker2
broker1
P
A
Partition 1
0 1
G
2
H
3
I
4
J
5
V
6
W
7
X
8
c
B
Partition 2
0 1
E
2
F
3
M
4
N
5
Q
6
R
7
S
8
Y
9
b
C
Partition 3
0 1
D
2
K
3
L
4
O
5
P
6
T
7
U
’group1’
C2
C1
C1
’group2’
#JCConf
Consumer Group
P A
Partition 1
0 1
B
2
C
3
D
4
E
5
F
6
G
7
H
8
I
9
J
C1
C2
C3
offset = 9
Consumer Group
22
Partition to Consumer is Many to One relation (In One
Consumer Group)
#JCConf
● Messages from the same partition guarantee FIFO
semantic
● Traditional MQ can only guarantee message are
delivered in order
● Kafka can guarantee messages are handled in order (for
same partition)
Message Ordering
23
P B
C1
C2
P
P1 C1
C2P2
Traditional MQ Kafka
#JCConf
● At most once - Messages may be lost but are
never redelivered.
● At least once - Messages are never lost but may
be redelivered.
● Exactly once - each message is delivered once
and only once. (this is what people actually want)
- Two-Phase Commit
- At least once + Idempotence
Delivery Semantic
24
Apply multiple times without changing the final result
#JCConf
● Which part do we discuss?
Delivery Semantic
25
Producer Broker Consumer
Producer Broker Consumer
#JCConf
● At most once - Async send
● At least once - Sync send (with retry count)
" Exactly once!
- Idempotent delivery does not support until next
version (0.9)
Producer To Broker
26
Producer Broker Consumer
#JCConf
● At most once - Store the offset before handling the
message
● At least once - Store the offset after handling the
message
● Exactly once - At least once + Idempotent
operation
Broker to Consumer
27
Producer Broker Consumer
#JCConf
● The unit of replication is the partition!
● Each partition has a single leader and zero or more
followers
● All reads and writes go to the leader of the partition
Replication
28
source: http://www.infoq.com/articles/apache-kafka
Leader FollowerFollower
Producer Consumer
sync sync
write read
#JCConf29
#JCConf
● Many data system retain a latest state for data by
some key.
● Log compaction adds an alternative retention
mechanism, log compaction, to support retaining
messages by key instead of purely by time.
● This would describe both many common data
systems — a search index, a cache, etc
Log Compaction
30
#JCConf
Log Compaction
31
#JCConf
Log Compaction
32
#JCConf
Why Kafka Fast?
#JCConf34
Persistence and Fast?
#JCConf
● Don’t fear file system
● Six 7,200 RPM SATA RAID-5 array
- Sequential write: 600MB/sec
- Random write: 100K/sec
● Sequential read in disk faster than random access in memory?
Sequential vs Random
35
source: http://queue.acm.org/detail.cfm?id=1563874
#JCConf
If we persist data, should we cache
the data in memory?
36
#JCConf
● In-Process Cache
- Message as object
- Cache in JVM heap.
● Page Cache
- Disk cache by OS
In-Process Cache vs Page Cache
37
#JCConf
In-Process Cache vs Page Cache
38
In Process Cache Disk Page Cache
Memory
Usage
In-heap memory Free Memory
Overhead Object overhead No
Garbage
Collection
Yes No
Process
Restart
Lost Still Warm
Controled
by
App OS
#JCConf
● Fact
- All disk reads and writes will go through page
cache. This feature cannot easily be turned off
without using direct I/O, so even if a process
maintains an in-process cache of the data, this
data will likely be duplicated in OS pagecache,
effectively storing everything twice.
● Conclusion
- Relying on pagecache is superior to maintaining
an in-process cache or other structure
In-Process Cache vs Page Cache
39
#JCConf
How to transfer to consumers?
40
#JCConf
Application Copy vs Zero Copying
41
#JCConf
● Traditional Queue
- Broker keep the message state and metadata
- B-Tree O(log n)
- Random Access
● Kafka
- Consumers keep the offset
- Sequential Disk Read/Write O(1)
Constant Time
42
#JCConf
Programming Kafka
#JCConf
Producer
44
Sync Send
#JCConf
Producer
45
Async Send
#JCConf
High Level Consumer
46
Open The Consumer Connector
Open the stream for topic
#JCConf
High Level Consumer
47
Receive the message
#JCConf
Using Scenarios
#JCConf
● Realtime processing and analyzing
● Stream processing frameworks
- Strom
- Spark Streaming
- Samza
● Distributed stream source + Distributed stream
processing
● All these three frameworks support Kafka as stream
source.
Source of Stream Processing
49
Kafka
Cluster
Stream
Processing
#JCConf
● The most reliable source for stream processing
Source of Stream Processing
50
source: http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
#JCConf
● Centralized Log Framework
● Distributed Log Collectors
- Logstash
- Fluentd
- Flume
Source and/or Sink of Distributed Log
Collectors
51
Kafka
Cluster
Distributed Log 

Collector
Other
Sink
Kafka
Cluster
Distributed Log 

Collector
Other
Source
#JCConf
● Push vs Pull











● Distributed Log Collector provide Configurable
producer and consumer
● Kafka Cluster provide distributed, high availability,
reliable message system
Source and/or Sink of Distributed Log
Collectors (cont.)
52
Distributed Log 

Collector
Kafka Cluster
pull
pull
push
push
#JCConf
● What is lambda architecture?
- Stream for realtime data
- Batch for historical data
- Query by merged view.
Source of Lambda Architecture
53
source: http://lambda-architecture.net/
#JCConf
Lambda Architecture (cont.)
54
source: https://metamarkets.com/2014/building-a-data-pipeline-that-handles-billions-of-events-in-real-time/
#JCConf
● Features
- Durability
- Scalability
- High Availability
- High Throughput
● Basic Concept
- Producer, Broker, Consumer, Consumer Group
- Topic, Partition, Message
- Message Ordering
- Delivery Semantic
- Replication
● Why Kafka fast
● Using Scenarios
- Source of stream processing
- Source or sink of distributed log framework
- Source of lambda architecture
Recap
55
#JCConf
● Kafka Documentation

kafka.apache.org/documentation.html
● Kafka Wiki

https://cwiki.apache.org/confluence/display/KAFKA/Index
● The Log: What every software engineer should know about real-
time data's unifying abstraction

engineering.linkedin.com/distributed-systems/log-what-every-software-
engineer-should-know-about-real-time-datas-unifying
● Benchmarking Apache Kafka: 2 Million Writes Per Second (On
Three Cheap Machines)

engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-
writes-second-three-cheap-machines
● Apache Kafka for Beginners

blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
Reference
56
#JCConf57
producer.send(“thanks”);
#JCConf
// any question?

question = consumer.receive();
58

Más contenido relacionado

La actualidad más candente

RabbitMQ vs Apache Kafka Part II Webinar
RabbitMQ vs Apache Kafka Part II WebinarRabbitMQ vs Apache Kafka Part II Webinar
RabbitMQ vs Apache Kafka Part II WebinarErlang Solutions
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)Hyunmin Lee
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우if kakao
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...SANG WON PARK
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 

La actualidad más candente (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka - Overview
Apache Kafka - OverviewApache Kafka - Overview
Apache Kafka - Overview
 
RabbitMQ vs Apache Kafka Part II Webinar
RabbitMQ vs Apache Kafka Part II WebinarRabbitMQ vs Apache Kafka Part II Webinar
RabbitMQ vs Apache Kafka Part II Webinar
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Kafka: Internals
Kafka: InternalsKafka: Internals
Kafka: Internals
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 

Destacado

Gradle起步走: 以CLI Application為例 @ JCConf 2014
Gradle起步走: 以CLI Application為例 @ JCConf 2014Gradle起步走: 以CLI Application為例 @ JCConf 2014
Gradle起步走: 以CLI Application為例 @ JCConf 2014Chen-en Lu
 
Cassandra 2.1 簡介
Cassandra 2.1 簡介Cassandra 2.1 簡介
Cassandra 2.1 簡介Cloud Tu
 
java8-patterns
java8-patternsjava8-patterns
java8-patternsJustin Lin
 
淺談 Geb 網站自動化測試(JCConf 2014)
淺談 Geb 網站自動化測試(JCConf 2014)淺談 Geb 網站自動化測試(JCConf 2014)
淺談 Geb 網站自動化測試(JCConf 2014)Kyle Lin
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message BrokerHaluan Irsad
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Building a robot with the .Net Micro Framework
Building a robot with the .Net Micro FrameworkBuilding a robot with the .Net Micro Framework
Building a robot with the .Net Micro FrameworkDucas Francis
 
Messaging
Messaging Messaging
Messaging rbpasker
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuHeroku
 
Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use casesIndrajeet Kumar
 
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...Amazon Web Services
 
From Java Stream to Java DataFrame
From Java Stream to Java DataFrameFrom Java Stream to Java DataFrame
From Java Stream to Java DataFrameChen-en Lu
 
IBM MQ: Using Publish/Subscribe in an MQ Network
IBM MQ: Using Publish/Subscribe in an MQ NetworkIBM MQ: Using Publish/Subscribe in an MQ Network
IBM MQ: Using Publish/Subscribe in an MQ NetworkDavid Ware
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Jeff Holoman
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupJeff Holoman
 
Advanced Pattern Authoring with WebSphere Message Broker
Advanced Pattern Authoring with WebSphere Message BrokerAdvanced Pattern Authoring with WebSphere Message Broker
Advanced Pattern Authoring with WebSphere Message BrokerAnt Phillips
 
Effective Application Development with WebSphere Message Broker
Effective Application Development with WebSphere Message BrokerEffective Application Development with WebSphere Message Broker
Effective Application Development with WebSphere Message BrokerAnt Phillips
 
Introduction to Patterns in WebSphere Message Broker
Introduction to Patterns in WebSphere Message BrokerIntroduction to Patterns in WebSphere Message Broker
Introduction to Patterns in WebSphere Message BrokerAnt Phillips
 

Destacado (20)

Gradle起步走: 以CLI Application為例 @ JCConf 2014
Gradle起步走: 以CLI Application為例 @ JCConf 2014Gradle起步走: 以CLI Application為例 @ JCConf 2014
Gradle起步走: 以CLI Application為例 @ JCConf 2014
 
Cassandra 2.1 簡介
Cassandra 2.1 簡介Cassandra 2.1 簡介
Cassandra 2.1 簡介
 
java8-patterns
java8-patternsjava8-patterns
java8-patterns
 
淺談 Geb 網站自動化測試(JCConf 2014)
淺談 Geb 網站自動化測試(JCConf 2014)淺談 Geb 網站自動化測試(JCConf 2014)
淺談 Geb 網站自動化測試(JCConf 2014)
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message Broker
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Building a robot with the .Net Micro Framework
Building a robot with the .Net Micro FrameworkBuilding a robot with the .Net Micro Framework
Building a robot with the .Net Micro Framework
 
Messaging
Messaging Messaging
Messaging
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use cases
 
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
 
From Java Stream to Java DataFrame
From Java Stream to Java DataFrameFrom Java Stream to Java DataFrame
From Java Stream to Java DataFrame
 
IBM MQ: Using Publish/Subscribe in an MQ Network
IBM MQ: Using Publish/Subscribe in an MQ NetworkIBM MQ: Using Publish/Subscribe in an MQ Network
IBM MQ: Using Publish/Subscribe in an MQ Network
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
 
Advanced Pattern Authoring with WebSphere Message Broker
Advanced Pattern Authoring with WebSphere Message BrokerAdvanced Pattern Authoring with WebSphere Message Broker
Advanced Pattern Authoring with WebSphere Message Broker
 
Effective Application Development with WebSphere Message Broker
Effective Application Development with WebSphere Message BrokerEffective Application Development with WebSphere Message Broker
Effective Application Development with WebSphere Message Broker
 
Introduction to Patterns in WebSphere Message Broker
Introduction to Patterns in WebSphere Message BrokerIntroduction to Patterns in WebSphere Message Broker
Introduction to Patterns in WebSphere Message Broker
 

Similar a Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014

A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...HostedbyConfluent
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)Yuuki Takano
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Snabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterIgalia
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaRicardo Bravo
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchChun Ming Ou
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
Getting Started with Kafka on k8s
Getting Started with Kafka on k8sGetting Started with Kafka on k8s
Getting Started with Kafka on k8sVMware Tanzu
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 

Similar a Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014 (20)

A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Snabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporter
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable Switch
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
Getting Started with Kafka on k8s
Getting Started with Kafka on k8sGetting Started with Kafka on k8s
Getting Started with Kafka on k8s
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 

Último

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014

  • 1. #JCConf Apache Kafka
 A high-throughput distributed messaging system 陸振恩 (popcorny) popcorny@cacafly.com
  • 2. #JCConf ● What is Kafka ● Basic concept ● Why Kafka fast ● Programming Kafka ● Using scenarios ● Recap Outline 2
  • 4. #JCConf ● More and more data and metrics need to be collected - Web activity tracking - Operation metrics - Application log aggregation - Commit log - … ● We need a message bus to collect and relay these data - Big Volume - Fast and Scalable Motivation 4
  • 5. #JCConf ● Developed by Linkedin ● Message System - Queue - Pub / Sub ● Written in Scala ● Features - Durability - Scalability - High Availability - High Throughput Kafka 5
  • 6. #JCConf BigData World 6 Traditional BigData File System NFS HDFS, S3 Database RDBMS Cassandra, HBase Batch Processing SQL Hadoop MapReduce
 Spark Stream Processing In-App Processing Strom, 
 Spark Streaming Message Service AMQP-compliant Kafka
  • 7. #JCConf ● Durability - All messages are persisted - Sequential read & write (like log file) - Consumers keep the message offset (like file descriptor) - The log files are rotated (like logrotate) - Messages are only deleted on expired. (like logrotate) - Support Batch Load and Real Time Usage! • cat access.log | grep ‘jcconf’ ● tail -F access.log | grep ‘jcconf’ Features 7
  • 8. #JCConf Design like Message Queue Implementation like Distributed Log File 8
  • 9. #JCConf ● Scalability - Horizontal scale out - Topic is partitioned (sharded) ● High Availability - Partition can be replicated Features 9
  • 10. #JCConf ● High Throughput Features 10 source: http://www.infoq.com/articles/apache-kafka
  • 12. #JCConf ● Producer - The role to send message to broker ● Consumer -The role to receive message from broker ● Broker - One node of Kafka cluster. ● ZooKeeper - Coordinator of Kafka cluster and costumer groups. Kafka Cluster Physical Components 12 Producer BroakerBroaker Broaker Zookeeper Consumer Group Consumer
  • 13. #JCConf ● Topic! - The named destination of partition. ● Partition - One Topic can have multiple partition - Unit of parallelism ● Message! • Key/value pair • Message offset Logical Components Topic B Partition 2 0 1 E 2 F 3 M 4 N 5 Q 6 R 7 S 8 Y 9 b C Partition 3 0 1 D 2 K 3 L 4 O 5 P 6 T 7 U A Partition 1 0 1 G 2 H 3 I 4 J 5 V 6 W 7 X 8 c 13
  • 14. #JCConf One Partition One Consumer 
 (Queue) P CA Partition 1 0 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J offset = 8 14 Consumers keep the offset.
 Broker has no idea about if message is proceeded
  • 15. #JCConf One Partition Multiple Consumer 
 (Pub/Sub) P A Partition 1 0 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J C1 C2 C3 offset = 8 offset = 7 offset = 9 15 Each Consumer keep its own offset.
  • 16. #JCConf broker2 Multiple Partitions broker1 P A Partition 1 0 1 G 2 H 3 I 4 J 5 V 6 W 7 X 8 c B Partition 2 0 1 E 2 F 3 M 4 N 5 Q 6 R 7 S 8 Y 9 b C Partition 3 0 1 D 2 K 3 L 4 O 5 P 6 T 7 U 16 C1 p1.offset = 7 p2.offset = 9 p3.offset = 7 Dispatched by hashed key
  • 17. #JCConf broker2 Multiple Partitions broker1 P A Partition 1 0 1 G 2 H 3 I 4 J 5 V 6 W 7 X 8 c B Partition 2 0 1 E 2 F 3 M 4 N 5 Q 6 R 7 S 8 Y 9 b C Partition 3 0 1 D 2 K 3 L 4 O 5 P 6 T 7 U 17 C2 offset = 9 offset = 7 C3 offset = 7 C1
  • 18. #JCConf Can we auto-rebalance the consumers to partitions? 18 Yes, Consumer Group!!
  • 19. #JCConf ● A group of workers ● Share the offsets ● Offsets are synced to ZooKeeper ● Auto Rebalancing Consumer Group 19
  • 20. #JCConf Consumer Group 20 broker2 broker1 P A Partition 1 0 1 G 2 H 3 I 4 J 5 V 6 W 7 X 8 c B Partition 2 0 1 E 2 F 3 M 4 N 5 Q 6 R 7 S 8 Y 9 b C Partition 3 0 1 D 2 K 3 L 4 O 5 P 6 T 7 U Consumer Group ‘group1’ C2 p1.offset = 7 p2.offset = 9 p3.offset = 7 C1
  • 21. #JCConf Consumer Group 21 broker2 broker1 P A Partition 1 0 1 G 2 H 3 I 4 J 5 V 6 W 7 X 8 c B Partition 2 0 1 E 2 F 3 M 4 N 5 Q 6 R 7 S 8 Y 9 b C Partition 3 0 1 D 2 K 3 L 4 O 5 P 6 T 7 U ’group1’ C2 C1 C1 ’group2’
  • 22. #JCConf Consumer Group P A Partition 1 0 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J C1 C2 C3 offset = 9 Consumer Group 22 Partition to Consumer is Many to One relation (In One Consumer Group)
  • 23. #JCConf ● Messages from the same partition guarantee FIFO semantic ● Traditional MQ can only guarantee message are delivered in order ● Kafka can guarantee messages are handled in order (for same partition) Message Ordering 23 P B C1 C2 P P1 C1 C2P2 Traditional MQ Kafka
  • 24. #JCConf ● At most once - Messages may be lost but are never redelivered. ● At least once - Messages are never lost but may be redelivered. ● Exactly once - each message is delivered once and only once. (this is what people actually want) - Two-Phase Commit - At least once + Idempotence Delivery Semantic 24 Apply multiple times without changing the final result
  • 25. #JCConf ● Which part do we discuss? Delivery Semantic 25 Producer Broker Consumer Producer Broker Consumer
  • 26. #JCConf ● At most once - Async send ● At least once - Sync send (with retry count) " Exactly once! - Idempotent delivery does not support until next version (0.9) Producer To Broker 26 Producer Broker Consumer
  • 27. #JCConf ● At most once - Store the offset before handling the message ● At least once - Store the offset after handling the message ● Exactly once - At least once + Idempotent operation Broker to Consumer 27 Producer Broker Consumer
  • 28. #JCConf ● The unit of replication is the partition! ● Each partition has a single leader and zero or more followers ● All reads and writes go to the leader of the partition Replication 28 source: http://www.infoq.com/articles/apache-kafka Leader FollowerFollower Producer Consumer sync sync write read
  • 30. #JCConf ● Many data system retain a latest state for data by some key. ● Log compaction adds an alternative retention mechanism, log compaction, to support retaining messages by key instead of purely by time. ● This would describe both many common data systems — a search index, a cache, etc Log Compaction 30
  • 35. #JCConf ● Don’t fear file system ● Six 7,200 RPM SATA RAID-5 array - Sequential write: 600MB/sec - Random write: 100K/sec ● Sequential read in disk faster than random access in memory? Sequential vs Random 35 source: http://queue.acm.org/detail.cfm?id=1563874
  • 36. #JCConf If we persist data, should we cache the data in memory? 36
  • 37. #JCConf ● In-Process Cache - Message as object - Cache in JVM heap. ● Page Cache - Disk cache by OS In-Process Cache vs Page Cache 37
  • 38. #JCConf In-Process Cache vs Page Cache 38 In Process Cache Disk Page Cache Memory Usage In-heap memory Free Memory Overhead Object overhead No Garbage Collection Yes No Process Restart Lost Still Warm Controled by App OS
  • 39. #JCConf ● Fact - All disk reads and writes will go through page cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice. ● Conclusion - Relying on pagecache is superior to maintaining an in-process cache or other structure In-Process Cache vs Page Cache 39
  • 40. #JCConf How to transfer to consumers? 40
  • 41. #JCConf Application Copy vs Zero Copying 41
  • 42. #JCConf ● Traditional Queue - Broker keep the message state and metadata - B-Tree O(log n) - Random Access ● Kafka - Consumers keep the offset - Sequential Disk Read/Write O(1) Constant Time 42
  • 46. #JCConf High Level Consumer 46 Open The Consumer Connector Open the stream for topic
  • 49. #JCConf ● Realtime processing and analyzing ● Stream processing frameworks - Strom - Spark Streaming - Samza ● Distributed stream source + Distributed stream processing ● All these three frameworks support Kafka as stream source. Source of Stream Processing 49 Kafka Cluster Stream Processing
  • 50. #JCConf ● The most reliable source for stream processing Source of Stream Processing 50 source: http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  • 51. #JCConf ● Centralized Log Framework ● Distributed Log Collectors - Logstash - Fluentd - Flume Source and/or Sink of Distributed Log Collectors 51 Kafka Cluster Distributed Log 
 Collector Other Sink Kafka Cluster Distributed Log 
 Collector Other Source
  • 52. #JCConf ● Push vs Pull
 
 
 
 
 
 ● Distributed Log Collector provide Configurable producer and consumer ● Kafka Cluster provide distributed, high availability, reliable message system Source and/or Sink of Distributed Log Collectors (cont.) 52 Distributed Log 
 Collector Kafka Cluster pull pull push push
  • 53. #JCConf ● What is lambda architecture? - Stream for realtime data - Batch for historical data - Query by merged view. Source of Lambda Architecture 53 source: http://lambda-architecture.net/
  • 54. #JCConf Lambda Architecture (cont.) 54 source: https://metamarkets.com/2014/building-a-data-pipeline-that-handles-billions-of-events-in-real-time/
  • 55. #JCConf ● Features - Durability - Scalability - High Availability - High Throughput ● Basic Concept - Producer, Broker, Consumer, Consumer Group - Topic, Partition, Message - Message Ordering - Delivery Semantic - Replication ● Why Kafka fast ● Using Scenarios - Source of stream processing - Source or sink of distributed log framework - Source of lambda architecture Recap 55
  • 56. #JCConf ● Kafka Documentation
 kafka.apache.org/documentation.html ● Kafka Wiki
 https://cwiki.apache.org/confluence/display/KAFKA/Index ● The Log: What every software engineer should know about real- time data's unifying abstraction
 engineering.linkedin.com/distributed-systems/log-what-every-software- engineer-should-know-about-real-time-datas-unifying ● Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)
 engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million- writes-second-three-cheap-machines ● Apache Kafka for Beginners
 blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/ Reference 56
  • 58. #JCConf // any question?
 question = consumer.receive(); 58