SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Stateful Stream Processing with Kafka and
Samza
George Li
georgexiaoxiaoli@acm.org
What is Apache Kafka?
● A distributed log system developed by
LinkedIn
● Logs are stored as files on disk
● Logs are organized into topics (think tables
in db).
● Topics are sharded into partitions
What is Apache Kafka?
● Each message is associated with an offset
number
● Reads are almost always sequential: from a
starting offset to the end of log
● Don’t use Kafka for random read.
https://github.com/abhioncbr/Kafka-Message-Server/wiki/Apache-of-Kafka-
Architecture-(As-per-Apache-Kafka-0.8.0-Dcoumentation)
What is Apache Samza?
● A stream processing framework developed
by LinkedIn
● Most commonly used in conjunction with
Kafka and Yarn
● Often compared with Spark Streaming and
Storm
What is Yarn?
● Resource and node management for
Hadoop
● You can plug in different resource
managers,e.g., Mesos, for Samza
● We use Yarn as resource manager in our
solution.
Why not Spark Streaming?
● Our processing logic is mostly stateful
● Spark Streaming’s updateStateByKey is not
flexible enough for us.
● We want to keep states across mini-batches
and update single k-v pairs
Why not Apache Storm?
● In a Samza solution, topology can be
decomposed into standalone nodes (Yarn
job) and its in/out edges (Kafka topics)
● We were advised that Storm is relatively
hard to configure and tune properly
How does Samza work?
● Our Samza code runs as a Yarn job
● You implement the StreamTask interface,
which defines a process() call.
● StreamTask runs inside a task instance,
which itself is inside a Yarn container.
How does Samza work?
● Task instance is single-threaded
● Each task instance reads only one partition
for each Kafka topic it subscribes to.
● Process() called for each message task
instance receives
Samza’s local state
● Local state is stored in a local k-v store such
as RocksDB
● The content of the local k-v store is
checkpointed into a separate Kafka topic
● Why not use Cassandra for quick random k-
v read?
Avoid remote calls inside process()
● Task instance is single-threaded,i.e., remote
call is expensive
● Because process() is called for each
message, small performance hits add up
● In our use case, Kafka is Samza’s only input
source and output destination.
Ask log for the ultimate truth
● To update the state of processing logic, we
send a “update state” message to Kafka log
● Task instance knows how to handle such
message
● In effect, our StreamTask becomes an
interpreter evaluates incoming DSL
message
Fault tolerance
● Process some messages, send result to
output topics, and update local state
● Task instance restarts. Now we need restart
from the last input checkpoint
● Output messages may be received already
● Local states change may have been
checkpointed
Any part could fail. Need to simulate these cases
Fault tolerance with local state
● When a task instance restarts, local state is
repopulated by reading its own Kafka log
● Yes, reading and repopulating will take a few
minutes
● Ideally, message’s effect on state should
idempotent
● What if my scenario is not ideal?
Testing fault tolerance
● A stackable trait that randomly redelivers
message
● Another stackable trait that randomly injects
failure at different functions
● Chaos Monkey on cluster
Reprocessing
● I want to upload a new version of code, but I
dont want to kill my current stream job.
● An important problem. Need to design for it
from the beginning
Reprocessing Options
1. Pull Kafka log into a data sink, and run batch
job against the sink,i.e., lambda architecture
2. Since Kafka holds the ultimate truth, we just
replay the log with new code, as long as you
can do it fast enough
Is my replay fast enough?
● 5 input partitions. Each holds 10 mil
messages
● If our Samza job can process 5k
message/sec (which is not a high speed)
● Less than 40 mins to reprocess all 50 mil
messages
More problems with Reprocessing
● When can you kill the old Yarn job?
● How does old result/new result affect user
experience?
● You need to write your own tool to manage
new topics and data sinks
Multi-tenancy Options
● Each tenant gets standalone topic/partitions
● Each topic partition holds data from multiple
tenants
In our case, the second option gives better
resource utilization
Multi-tenancy with local state
● Local k-v store holds data from different
tenants
● User should not play with tenant info - need
abstraction
● Solution similar to how we handle multi-
tenancy in Redis
Multi-tenant solution
● Each message has to carry tenant info
● Use a stackable trait to extract tenant info.
● This trait also controls access to local k-v
store.
● Code inside process() has no knowledge of
tenant info at all
Testing with local states
● Unit testing is slightly tricky due to API
● Use Kafka.utils.{TestUtils, TestZKUtils} to
spawn a test zookeeper/Kafka cluster in
process.
● org.apache.samza.test.integration.
TestStatefulTask
Performance troubleshooting
● Samza can write metrics to Kafka log. This
metrics often shows obvious bottlenecks
● Most performance problems come from
process() logic
Performance troubleshooting
● Often Samza metrics does not lead to the
root cause
● We rely mostly on jstack and tcpdump
● Tools do not replace problem-solving
Common performance problems
● Excessive JSON parsing
● Inefficient serialization/deserialization
● Accidental blocking call
● Evaluation of logging parameters. Can use
Samza’s own logging trait
Questions?
Feel free to send me emails/messages

Más contenido relacionado

La actualidad más candente

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Chen-en Lu
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsStreamNative
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and futureEd Yakabosky
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsStreamNative
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021StreamNative
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streamsdmantula
 
Mystery Machine Overview
Mystery Machine OverviewMystery Machine Overview
Mystery Machine OverviewIvan Glushkov
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerStreamNative
 

La actualidad más candente (20)

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Apache samza
Apache samzaApache samza
Apache samza
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streams
 
Mystery Machine Overview
Mystery Machine OverviewMystery Machine Overview
Mystery Machine Overview
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar Manager
 

Destacado

Functional and scale performance tests using zopkio
Functional and scale performance tests using zopkio Functional and scale performance tests using zopkio
Functional and scale performance tests using zopkio Marcelo Araujo
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan confluent
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 

Destacado (9)

Functional and scale performance tests using zopkio
Functional and scale performance tests using zopkio Functional and scale performance tests using zopkio
Functional and scale performance tests using zopkio
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 

Similar a Stateful stream processing with kafka and samza

Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - KafkaMayank Bansal
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1ChemAxon
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
The Cost of Kafka’s High Availability on Cloud with Geetha Anne
The Cost of Kafka’s High Availability on Cloud with Geetha AnneThe Cost of Kafka’s High Availability on Cloud with Geetha Anne
The Cost of Kafka’s High Availability on Cloud with Geetha AnneHostedbyConfluent
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDavid van Geest
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beamHai Lu
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...StreamNative
 

Similar a Stateful stream processing with kafka and samza (20)

kafka
kafkakafka
kafka
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
The Cost of Kafka’s High Availability on Cloud with Geetha Anne
The Cost of Kafka’s High Availability on Cloud with Geetha AnneThe Cost of Kafka’s High Availability on Cloud with Geetha Anne
The Cost of Kafka’s High Availability on Cloud with Geetha Anne
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Distributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and CassandraDistributed Task Scheduling with Akka, Kafka and Cassandra
Distributed Task Scheduling with Akka, Kafka and Cassandra
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beam
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
 

Último

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Último (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Stateful stream processing with kafka and samza

  • 1. Stateful Stream Processing with Kafka and Samza George Li georgexiaoxiaoli@acm.org
  • 2. What is Apache Kafka? ● A distributed log system developed by LinkedIn ● Logs are stored as files on disk ● Logs are organized into topics (think tables in db). ● Topics are sharded into partitions
  • 3. What is Apache Kafka? ● Each message is associated with an offset number ● Reads are almost always sequential: from a starting offset to the end of log ● Don’t use Kafka for random read.
  • 5. What is Apache Samza? ● A stream processing framework developed by LinkedIn ● Most commonly used in conjunction with Kafka and Yarn ● Often compared with Spark Streaming and Storm
  • 6. What is Yarn? ● Resource and node management for Hadoop ● You can plug in different resource managers,e.g., Mesos, for Samza ● We use Yarn as resource manager in our solution.
  • 7. Why not Spark Streaming? ● Our processing logic is mostly stateful ● Spark Streaming’s updateStateByKey is not flexible enough for us. ● We want to keep states across mini-batches and update single k-v pairs
  • 8. Why not Apache Storm? ● In a Samza solution, topology can be decomposed into standalone nodes (Yarn job) and its in/out edges (Kafka topics) ● We were advised that Storm is relatively hard to configure and tune properly
  • 9. How does Samza work? ● Our Samza code runs as a Yarn job ● You implement the StreamTask interface, which defines a process() call. ● StreamTask runs inside a task instance, which itself is inside a Yarn container.
  • 10. How does Samza work? ● Task instance is single-threaded ● Each task instance reads only one partition for each Kafka topic it subscribes to. ● Process() called for each message task instance receives
  • 11.
  • 12. Samza’s local state ● Local state is stored in a local k-v store such as RocksDB ● The content of the local k-v store is checkpointed into a separate Kafka topic ● Why not use Cassandra for quick random k- v read?
  • 13. Avoid remote calls inside process() ● Task instance is single-threaded,i.e., remote call is expensive ● Because process() is called for each message, small performance hits add up ● In our use case, Kafka is Samza’s only input source and output destination.
  • 14. Ask log for the ultimate truth ● To update the state of processing logic, we send a “update state” message to Kafka log ● Task instance knows how to handle such message ● In effect, our StreamTask becomes an interpreter evaluates incoming DSL message
  • 15. Fault tolerance ● Process some messages, send result to output topics, and update local state ● Task instance restarts. Now we need restart from the last input checkpoint ● Output messages may be received already ● Local states change may have been checkpointed
  • 16. Any part could fail. Need to simulate these cases
  • 17. Fault tolerance with local state ● When a task instance restarts, local state is repopulated by reading its own Kafka log ● Yes, reading and repopulating will take a few minutes ● Ideally, message’s effect on state should idempotent ● What if my scenario is not ideal?
  • 18. Testing fault tolerance ● A stackable trait that randomly redelivers message ● Another stackable trait that randomly injects failure at different functions ● Chaos Monkey on cluster
  • 19. Reprocessing ● I want to upload a new version of code, but I dont want to kill my current stream job. ● An important problem. Need to design for it from the beginning
  • 20. Reprocessing Options 1. Pull Kafka log into a data sink, and run batch job against the sink,i.e., lambda architecture 2. Since Kafka holds the ultimate truth, we just replay the log with new code, as long as you can do it fast enough
  • 21.
  • 22. Is my replay fast enough? ● 5 input partitions. Each holds 10 mil messages ● If our Samza job can process 5k message/sec (which is not a high speed) ● Less than 40 mins to reprocess all 50 mil messages
  • 23. More problems with Reprocessing ● When can you kill the old Yarn job? ● How does old result/new result affect user experience? ● You need to write your own tool to manage new topics and data sinks
  • 24. Multi-tenancy Options ● Each tenant gets standalone topic/partitions ● Each topic partition holds data from multiple tenants In our case, the second option gives better resource utilization
  • 25.
  • 26. Multi-tenancy with local state ● Local k-v store holds data from different tenants ● User should not play with tenant info - need abstraction ● Solution similar to how we handle multi- tenancy in Redis
  • 27. Multi-tenant solution ● Each message has to carry tenant info ● Use a stackable trait to extract tenant info. ● This trait also controls access to local k-v store. ● Code inside process() has no knowledge of tenant info at all
  • 28. Testing with local states ● Unit testing is slightly tricky due to API ● Use Kafka.utils.{TestUtils, TestZKUtils} to spawn a test zookeeper/Kafka cluster in process. ● org.apache.samza.test.integration. TestStatefulTask
  • 29. Performance troubleshooting ● Samza can write metrics to Kafka log. This metrics often shows obvious bottlenecks ● Most performance problems come from process() logic
  • 30. Performance troubleshooting ● Often Samza metrics does not lead to the root cause ● We rely mostly on jstack and tcpdump ● Tools do not replace problem-solving
  • 31. Common performance problems ● Excessive JSON parsing ● Inefficient serialization/deserialization ● Accidental blocking call ● Evaluation of logging parameters. Can use Samza’s own logging trait
  • 32. Questions? Feel free to send me emails/messages