SlideShare una empresa de Scribd logo
1 de 53
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Apache Kafka at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
About Me
2
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Why We Build Kafka?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
We Have a lot of Data
5
• User activity tracking
• Page views, ad impressions, etc
• Server logs and metrics
• Syslogs, request-rates, etc
• Messaging
• Emails, news feeds, etc
• Computation derived
• Results of Hadoop / data warehousing, etc
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and We Build Products on Data
6
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Newsfeed
7
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
8HADOOP SUMMIT 2013
People you may know
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Search
10
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Metrics and Monitoring
11
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and a LOT of Monitoring
12
The Problem:
How to integrate this variety of data
and make it available to all products?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010:
Point-to-Point Pipeplines
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15
Example: User Activity Data Flow
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16
What We Want
• A centralized data pipeline
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17
Apache Kafka
We tried some systems off-
the-shelf, but…
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18
What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20
Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
21
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Key Idea #1:
Data-parallelism leads to scale-out
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Produce/consume requests are randomly balanced
among brokers
23
Distribute Clients across Partitions
Key Idea #2:
Disks are fast when used sequentially
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1
Reads (offset 7)
Consumer2
Reads (offset 7)
Partition i of Topic A
Key Idea #3:
Batching makes best use of network/IO
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28
The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {
// process the message
}
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
29
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30
Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:
• Writes: 97 TB/day
• Reads: 430 TB/day
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
34
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Problems
• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36
Standardized Schema on Avro
• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema
• Auto compatibility check
• Code review
• “Ship it!”
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
37
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38
Kafka to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39
Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:
– https://github.com/linkedin/camus
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
40
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Does it really work?
“All published messages must be delivered to all consumers (quickly)”
Audit Trail
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43
More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)
• Highly availability,
• Reduced latency
• Log compaction (0.8.1)
• State storage
• Operational tools (0.8.2)
• Topic management
• Automated leader rebalance
• etc ..
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44
Kafka 0.9
• Clients Rewrite
• Remove ZK dependency
• Even better throughput
• Security
• More operability, multi-tenancy ready
• Transactional Messaing
• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Kafka Users: Next Maybe You?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46
Acknowledgements
Questions? Guozhang Wang
guwang@linkedin.com
www.linkedin.com/in/guozhangwang
Backup Slides
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49
Real-time Analysis with Kafka
• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds
• Kafka - > Hadoop: < 1 minute
• ETL in Hadoop: ~ 45 minutes
• MapReduce in Hadoop: maybe hours
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50
Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS
• Spark, Shark
• Pinot, Druid, FastBit
• Solution No. 3: stream processing
• Apache Samza
• Storm
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51
How Fast can Kafka Go?
• Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space
• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper
• Mainly due to offset commit, will be lifted in 0.9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52
Intra-cluster Replication
• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare
• Latency less than an issue
• Separate data replication and consensus
• Consensus => Zookeeper
• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:
• http://www.slideshare.net/junrao/kafka-replication-apachecon2013
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53
Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

Más contenido relacionado

La actualidad más candente

Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 

La actualidad más candente (20)

Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 

Similar a Apache Kafka at LinkedIn

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
CA Technologies Customer Presentation
CA Technologies Customer PresentationCA Technologies Customer Presentation
CA Technologies Customer PresentationSplunk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to KafkaAkash Vacher
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureRiccardo Romani
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 

Similar a Apache Kafka at LinkedIn (20)

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
CA Technologies Customer Presentation
CA Technologies Customer PresentationCA Technologies Customer Presentation
CA Technologies Customer Presentation
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 

Más de Guozhang Wang

Consensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfConsensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaGuozhang Wang
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedGuozhang Wang
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduceGuozhang Wang
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsGuozhang Wang
 

Más de Guozhang Wang (13)

Consensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfConsensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdf
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of Kafka
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 

Último

Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 

Último (20)

Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 

Apache Kafka at LinkedIn

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Apache Kafka at LinkedIn
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure About Me 2
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 3 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 4. Why We Build Kafka?
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure We Have a lot of Data 5 • User activity tracking • Page views, ad impressions, etc • Server logs and metrics • Syslogs, request-rates, etc • Messaging • Emails, news feeds, etc • Computation derived • Results of Hadoop / data warehousing, etc
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and We Build Products on Data 6
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Newsfeed 7
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 8HADOOP SUMMIT 2013 People you may know
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 9
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Search 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Metrics and Monitoring 11 HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and a LOT of Monitoring 12
  • 13. The Problem: How to integrate this variety of data and make it available to all products?
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14 Life back in 2010: Point-to-Point Pipeplines
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15 Example: User Activity Data Flow
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16 What We Want • A centralized data pipeline
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17 Apache Kafka We tried some systems off- the-shelf, but…
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18 What We REALLY Want • A centralized data pipeline that is • Elastically scalable • Durable • High-throughput • Easy to use
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • A distributed pub-sub messaging system • Scale-out from groundup • Persistent to disks • High-Throughput (10s MB/sec per server) 19 Apache Kafka
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20 Life Since Kafka in Production Apache Kafka • Developed and maintained by 5 Devs + 2 SRE
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 21 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 22. Key Idea #1: Data-parallelism leads to scale-out
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Produce/consume requests are randomly balanced among brokers 23 Distribute Clients across Partitions
  • 24. Key Idea #2: Disks are fast when used sequentially
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Appends are effectively O(1) • Reads from known offset are fast still, when cached 25 Store Messages as a Log 3 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 7) Partition i of Topic A
  • 26. Key Idea #3: Batching makes best use of network/IO
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Batched send and receive • Batched compression • No message caching in JVM • Zero-copy from file to socket (Java NIO) 27 Batch Transfer
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28 The API (0.8) Producer: send(topic, message) Consumer: Iterable stream = createMessageStreams(…).get(topic) for (message: stream) { // process the message }
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 29 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30 Kafka Usage at LinkedIn • Mainly used for tracking user-activity and metrics data • 16 - 32 brokers in each cluster (615+ total brokers) • 527 billion messages/day • 7500+ topics, 270k+ partitions • Byte rates: • Writes: 97 TB/day • Reads: 430 TB/day
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31 Kafka Usage at LinkedIn
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32 Kafka Usage at LinkedIn
  • 33. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33 Kafka Usage at LinkedIn
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 34 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 35. Problems • Hundreds of message types • Thousands of fields • What do they all mean? • What happens when they change?
  • 36. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36 Standardized Schema on Avro • Schema • Message structure contract • Performance gain • Workflow • Check in schema • Auto compatibility check • Code review • “Ship it!”
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 37 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 38. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38 Kafka to Hadoop
  • 39. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39 Hadoop ETL (Camus) • Map/Reduce job does data load • One job loads all events • ~10 minute ETA on average from producer to HDFS • Hive registration done automatically • Schema evolution handled transparently • Open sourced: – https://github.com/linkedin/camus
  • 40. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 40 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 41. Does it really work? “All published messages must be delivered to all consumers (quickly)”
  • 43. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43 More Features in Kafka 0.8 • Intra-cluster replication (0.8.0) • Highly availability, • Reduced latency • Log compaction (0.8.1) • State storage • Operational tools (0.8.2) • Topic management • Automated leader rebalance • etc .. Checkout our page for more: http://kafka.apache.org/
  • 44. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44 Kafka 0.9 • Clients Rewrite • Remove ZK dependency • Even better throughput • Security • More operability, multi-tenancy ready • Transactional Messaing • From at-least-one to exactly-once Checkout our page for more: http://kafka.apache.org/
  • 45. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Kafka Users: Next Maybe You?
  • 46. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46 Acknowledgements
  • 49. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49 Real-time Analysis with Kafka • Analytics from Hadoop can be slow • Production -> Kafka: tens of milliseconds • Kafka - > Hadoop: < 1 minute • ETL in Hadoop: ~ 45 minutes • MapReduce in Hadoop: maybe hours
  • 50. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50 Real-time Analysis with Kafka • Solution No.1: directly consuming from Kafka • Solution No. 2: other storage than HDFS • Spark, Shark • Pinot, Druid, FastBit • Solution No. 3: stream processing • Apache Samza • Storm
  • 51. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51 How Fast can Kafka Go? • Bottleneck #1: network bandwidth • Producer: 100 Mb/s for 1 Gig-Ethernet • Consumer can be slower due to multi-sub • Bottleneck #2: disk space • Data may be deleted before consumed at peak time• • Configurable time/size-based retention policy • Bottleneck #3: Zookeeper • Mainly due to offset commit, will be lifted in 0.9
  • 52. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52 Intra-cluster Replication • Pick CA within Datacenter (failover < 10ms) • Network partition is rare • Latency less than an issue • Separate data replication and consensus • Consensus => Zookeeper • Replication => primary-backup (f to tolerate f-1 failure) • Configurable ACK (durability v.s. latency) • More details: • http://www.slideshare.net/junrao/kafka-replication-apachecon2013
  • 53. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53 Replication Architecture Producer Consumer Producer Broker Broker Broker Broker Consumer ZK

Notas del editor

  1. Data-serving websites, LinkedIn has a lot of data
  2. Based on relevence
  3. We have this variety of data and and we need to build all these products around such data.
  4. We have this variety of data and and we need to build all these products around such data.
  5. Messaging: ActiveMQ User Activity: In house log aggregation Logging: Splunk Metrics: JMX => Zenoss Database data: Databus, custom ETL
  6. ActiveMQ: they do not fly
  7. Now you maybe wondering why it works so well? For example, why it can be both highly durable by persisting data to disks while still maintaining high throughput?
  8. Topic = message stream Topic has partitions, partitions are distributed to brokers
  9. Do not be afraid of disks
  10. File system caching
  11. And finally after all these tricks, the client interface we exposed to the users, are very simple.
  12. Now I will switch my gear and talk a little bit about Kafka usage at Linkedin
  13. 21st, October.
  14. Multi-colo
  15. 99.99%
  16. 0.8.2: Delete topic Automated leader rebalancing Controlled shutdown Offset management Parallel recovery min.isr and clean leader election
  17. Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc .. https://cwiki.apache.org/confluence/display/KAFKA/Clients Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. C - High performance C library with full protocol support C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. Clojure - Clojure DSL for the Kafka API JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation stdin & stdout https://cwiki.apache.org/confluence/display/KAFKA/Clients
  18. Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc ..