Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger

Learnings From Shipping
1000+ Streaming Data Pipelines
To Production
Hakan Lofcali, Stefan Sprenger
{hakan,stefan}@datacater.io
‣We develop tools for developers working with streaming data
‣With Kafka, Kubernetes, and less than 5 developers, we built a platform that helped teams to
deploy more than 1,000 streaming data pipelines to production
‣Let’s take you on our journey and the tools we adopted, hurdles encountered, and solutions
found
‣Infra Space
‣Customer Solutions Space
2
WHAT WE DO / WHO WE ARE
3
STREAMING DATA PIPELINES
Continuous Applications of Data Transformations
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
4
STREAMING PIPELINES IN THE WILD
Customer communications in real-time
Actionable
data
Clickstream
data
Outbox
service
Raw data
Process
clickstream events
5
GOALS FOR THIS TALK
Avoid common pitfalls in streaming ETL
‣How to operate streaming data pipelines in an efficient and robust manner?
‣How to deal with resource-leaking Kafka Connect connectors?
‣How to monitor and debug running pipelines?
‣What are ways to deal with large data sources or slow data sinks?
‣What is missing in today’s ecosystem for streaming to become a commodity?
How to operate streaming data pipelines in
an efficient and robust manner?
6
7
SEPARATE BY NODE POOL
Kafka NodePool
Apache Kafka
Broker*
Strimzi Kafka
Operator
Control Plane Nodepool
Pod
We operate one K8s Cluster - Multiple Node pools
K8s StatefulSet
K8s Deployment
DataCater
Control Plane
K8s Deployment
…
Kafka Connect Nodepool
Quarkus
Pipeline
Pod
K8s Deployment(s) ▸ Max 110 pods per
node
▸ Max 5,000 nodes per
cluster
▸ Max 150,000 pods in
total
▸ *Separate Kafka and
Kafka Connect
clusters
State-of-the-art Orchestration
8
PROCESS ORCHESTRATION
▸ We started out on a single VM and moved to a distributed
process orchestration tool
▸ Kafka’s ecosystem is lagging state of the art process
orchestration like Kubernetes, Nomad, etc.
▸ ksqlDB and Kafka Connect manage processes, but we will
see how they are lacking fundamental patterns to be
operated at scale
Kafka Streams
Single VM Docker
Java Quarkus
Kubernetes
Quarkus SmallRye Reactive Messaging
9
STARTUP TIME
Scheduled
0s
Scheduled
60s
First Event
Processed
First Event
Processed
5s …
Liveness
OK
10s
Kafka Streams
30s
10
WORKLOAD DENSITY
Docker on Single VM
Quarkus
Quarkus
Quarkus
Kubernetes
Kafka Streams
Kafka Streams
RAM < 1.5GB
RAM < 1.5GB
RAM < 300MB
RAM < 300MB
RAM < 300MB
11
STRIMZI FOR KAFKA
…
Kubernetes; Dedicated node pool for Kafka
Apache Kafka
Broker
Apache Kafka
Broker
Apache Kafka
Broker
Strimzi Kafka
Operator
Kubernetes
StatefulSet
Pod Pod Pod
12
CONSUMER RE-BALANCING
summit
consumer 0
summit
consumer 1
summit
consumer 2
…
Pod Pod Pod
summit
partition 0
summit
partition 1
summit
partition 2
… … …
…
… …
13
CONSUMER RE-BALANCING
summit
consumer 0
summit
consumer 1
summit
consumer 2
…
Pod Pod Pod
summit
partition 0
summit
partition 1
summit
partition 2
… … …
…
… …
▸ Consumer re-balancing will
cause no consumption until
re-balancing is completed
by co-ordinator
▸ Number of consumers can
change due to errors,
disconnection, and
triggered by new load
requirements
14
UNEXPECTED SHUTDOWN
Startup Time
Partition Size
15
UNEXPECTED SHUTDOWN
Startup Time
Partition Size
Not Linear
Point of No Recovery
Log Size
How to deal with resource-leaking Kafka
Connect connectors?
16
17
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task C
S3 Source
Task A
PostgreSQL Source
Task A
K8s Deployment / Connect Cluster
Pod Pod Pod
PostgreSQL Source
Task B
MySQL CDC Source
Task A
MySQL CDC Source
Task B
K8s Deployment / Connect Cluster
18
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task C
MySQL CDC Source
Task A
MySQL CDC Source
Task B
S3 SOURCE
TASK A
Pod Pod
PostgreSQL Source
Task A
PostgreSQL Source
Task B
K8s Deployment / Connect Cluster
19
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task C
S3 SOURCE
TASK A
Pod Pod
MySQL CDC Source
Task A
MySQL CDC Source
Task B
PostgreSQL Source
Task A
PostgreSQL Source
Task B
K8s Deployment / Connect Cluster
20
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ELASTICSEARCH SINK
TASK C
S3 SOURCE
TASK A
Pod Pod
MySQL CDC Source
Task A
MySQL CDC Source
Task B
PostgreSQL Source
Task A
PostgreSQL Source
Task B
Connect Cluster Connect Cluster
21
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task C
S3 Source
Task A
Connect Cluster
Pod Pod Pod
MySQL CDC Source
Task A
MySQL CDC Source
Task B
PostgreSQL Source
Task A
PostgreSQL Source
Task B
Connect Cluster Connect Cluster
22
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
S3 Source
Task A
Pod
MySQL CDC Source
Task A
Connect Cluster Connect Cluster
23
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
S3 SOURCE
TASK A
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
MySQL CDC Source
Task A
Connect Cluster Connect Cluster
24
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
S3 SOURCE
TASK A
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
MySQL CDC Source
Task A
Connect Cluster Connect Cluster
25
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
S3 Source
Task A
Pod
MySQL CDC Source
Task A
▸ Utilise state of the art orchestration tools.
▸ Running Kafka on Kubernetes does not bring automatic elasticity.
▸ Kafka Connect is not self-contained. This will become a larger headache the more
connector tasks are running in a given cluster.
▸ Think about startup time throughout your tech stack. From Kafka brokers over Connect
tasks to streaming applications.
26
TAKE-AWAYS
Key Learnings
How to monitor and debug pipelines?
27
28
MONITORING STREAMING DATA PIPELINES
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
▸ External data sources or data sinks are unavailable (temporarily)
▸ Consumers (processors or sink connectors) are slower than producers
▸ Processing of events fails
29
POTENTIAL PRODUCTION ISSUES
Most common issues in streaming data pipelines
▸ External data sources or data sinks are unavailable (temporarily)
▸ Consumers (processors or sink connectors) are slower than producers
▸ Processing of events fails
30
POTENTIAL PRODUCTION ISSUES
Most common issues in streaming data pipelines
31
MONITORING CONNECTORS
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
Monitoring the health of connectors
‣Periodically call /connectors/:connector_name/status and investigate the response
32
MONITORING CONNECTORS
GET /connectors/hdfs-sink/status
{
"name": "hdfs-sink",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks":
[
{
"id": 0,
"state": "RUNNING",
"worker_id": “localhost:8083"
}
]
}
Healthy
33
MONITORING CONNECTORS
GET /connectors/hdfs-sink/status
{
"name": "hdfs-sink",
"connector": {
"state": “FAILED",
"worker_id": "localhost:8083"
},
"tasks":
[
{
"id": 0,
"state": "FAILED",
"worker_id": “localhost:8083”,
"trace": "org.apache.kafka.common.errors.RecordTooLargeExceptionn"
}
]
}
Unhealthy
34
MONITORING CONNECTORS
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
‣Periodically call /connectors/:connector_name/status and investigate the response
‣If failed, try to restart the connector (e.g., deals with temporary API outages) and
escalate or alert after X restarts
‣Sometimes, directly escalating might be reasonable
Monitoring the health of connectors
▸ External data sources or data sinks are unavailable (temporarily)
▸ Consumers (processors or sink connectors) are slower than producers
▸ Processing of events fails
35
POTENTIAL PRODUCTION ISSUES
Most common issues in streaming data pipelines
36
MONITORING BACKPRESSURE
Consumer Lags
Kafka Topic Consumer
‣Difference between the latest offset available in the Kafka topic (partition) and the
latest offset processed by the consumer
‣Resembles how much consumers are behind producers in terms of number of records
processed
37
MONITORING BACKPRESSURE
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
Consumer Lags in Streaming Data Pipelines
38
MONITORING BACKPRESSURE
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
Kafka Streams Consumer Lag
‣Number of records that have been extracted by the data source connector but have not
yet been processed by the Kafka Streams app
‣If data processing is slower than extraction, you might want to increase the degree of
parallelism of the Kafka Streams app
39
MONITORING BACKPRESSURE
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams
Sink Connector Consumer Lag
‣Number of records that have been processed by the Kafka Streams app but have not yet
been published by the sink connector
‣If publishing data to the data sinks is slower than processing, you might want to increase
the number of tasks of the sink connector
▸ External data sources or data sinks are unavailable (temporarily)
▸ Consumers (processors or sink connectors) are slower than producers
▸ Processing of events fails
40
POTENTIAL PRODUCTION ISSUES
Most common issues in streaming data pipelines
41
DEAD-LETTER QUEUES
Keep track of errors in processing
‣By default, Kafka Connect connectors fail
when observing errors in processing
‣We recommend to configure a dead-letter
queue (topic) for storing records that could
not be processed
‣Monitor the dead-letter queue topic and
manually investigate failed records
errors.tolerance = all
errors.deadletterqueue.topic.name = topic-dlq
Topic
Dead-letter
queue topic
Successful
processing
Failed
processing
Kafka
Connect
Source
Connector
What are ways to deal with large data
sources or slow data sinks?
42
43
DEALING WITH LARGE DATA SOURCES
‣Hurts a lot when performing initial snapshots,
which can take hours
‣Use multiple connectors for the same database
and make use of table.include.list
‣Adjust the snapshot query and consider only a
subset of the data source
‣Mitigate pain with incremental snapshotting
‣Accelerate snapshotting with parallelisation
PostgreSQL
Debezium
Source
Connector
TBs of data
44
DEALING WITH SLOW DATA SINKS
Kafka
Connect
Sink
Connector
Elasticsearch
‣Detect slow data sinks by monitoring the sink
connector consumer lag
‣Parallelise sending records to the data sink by
increasing the number of connector tasks
‣If available, batch multiple records and send
them with one request to the data sink
‣Avoid duplicated data delivery by adjusting
max.poll.records or max.poll.interval.ms
What is missing in today’s ecosystem for
streaming to become a commodity?
45
46
SERVERLESS TOPICS
‣Partitioned topics are the de-facto standard for
persisting events
‣# partitions = maximum degree of parallelism
‣Choosing the number of partitions remains a crucial
questions with significant impact on future cost and
performance, and needs to be answered at topic
creation time (!)
‣Having the ability to dynamically choose the degree of
parallelism would allow to easier cope with peak loads
"Horizontal Partition Autoscaler”
Partition 0
1 partition
Partition 0 Partition 1 Partition 2
3 partitions
Partition 0
1 partition
Scale Up
Scale Down
47
EASE OPERATIONS
More and better managed services
‣Operating streaming data pipelines boils down to running multiple distributed
systems and remains one of the big hurdles for its adoption
‣Managed services can reduce the operational pain
‣We witness the rise of cloud/SaaS offerings but believe there is still lots of room for
improvement
Summary
48
49
TAKE-AWAYS
Summary
‣Throwing Kafka and Kafka Connect at Kubernetes is beneficial but does not provide
a true cloud-native experience. It takes a few steps to, for instance, apply the self-
containment principle to Kafka Connect.
‣If possible, try to handle errors of connectors or streaming applications in an
automated manner without bringing the pipeline down
‣A lot of issues occur when integrating external systems that you do not control, e.g.,
snapshotting a very large database table, sending events to slow APIs, etc.
Questions?
50
1 de 50

Recomendados

Building Event-Driven Systems with Apache Kafka por
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBrian Ritchie
8.7K vistas33 diapositivas
Concepts and Patterns for Streaming Services with Kafka por
Concepts and Patterns for Streaming Services with KafkaConcepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaQAware GmbH
524 vistas82 diapositivas
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d... por
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
45.1K vistas51 diapositivas
Introduction to apache kafka, confluent and why they matter por
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
1.7K vistas58 diapositivas
Apache Kafka - A modern Stream Processing Platform por
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformGuido Schmutz
761 vistas56 diapositivas
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy por
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyKairo Tavares
400 vistas38 diapositivas

Más contenido relacionado

Similar a Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger

Real time Messages at Scale with Apache Kafka and Couchbase por
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
2.4K vistas32 diapositivas
Kafka summit apac session por
Kafka summit apac sessionKafka summit apac session
Kafka summit apac sessionChristina Lin
119 vistas33 diapositivas
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services por
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud ServicesBuild a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Servicesconfluent
350 vistas64 diapositivas
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline! por
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!confluent
827 vistas41 diapositivas
Kafka Explainaton por
Kafka ExplainatonKafka Explainaton
Kafka ExplainatonNguyenChiHoangMinh
15 vistas53 diapositivas
Devoxx university - Kafka de haut en bas por
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
1.9K vistas150 diapositivas

Similar a Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger(20)

Real time Messages at Scale with Apache Kafka and Couchbase por Will Gardella
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella2.4K vistas
Kafka summit apac session por Christina Lin
Kafka summit apac sessionKafka summit apac session
Kafka summit apac session
Christina Lin119 vistas
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services por confluent
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud ServicesBuild a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
confluent350 vistas
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline! por confluent
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent827 vistas
Devoxx university - Kafka de haut en bas por Florent Ramiere
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
Florent Ramiere1.9K vistas
All Streams Ahead! ksqlDB Workshop ANZ por confluent
All Streams Ahead! ksqlDB Workshop ANZAll Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZ
confluent109 vistas
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre... por HostedbyConfluent
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
HostedbyConfluent330 vistas
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co... por HostedbyConfluent
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
HostedbyConfluent312 vistas
Building Event Streaming Architectures on Scylla and Kafka por ScyllaDB
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB868 vistas
Apache Kafka with Spark Streaming: Real-time Analytics Redefined por Edureka!
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Edureka!10K vistas
Apache Kafka - Scalable Message Processing and more! por Guido Schmutz
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz704 vistas
Leveraging Mainframe Data for Modern Analytics por confluent
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
confluent2.5K vistas
Introducing Confluent Cloud: Apache Kafka as a Service por confluent
Introducing Confluent Cloud: Apache Kafka as a Service Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service
confluent2.1K vistas
Westpac Bank Tech Talk 1: Dive into Apache Kafka por confluent
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent328 vistas
Streaming Data Ingest and Processing with Apache Kafka por Attunity
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
Attunity4.3K vistas
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017 por Monal Daxini
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini2.7K vistas
Citi Tech Talk: Hybrid Cloud por confluent
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
confluent43 vistas
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S... por Anant Corporation
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation19 vistas

Más de HostedbyConfluent

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams por
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsHostedbyConfluent
88 vistas26 diapositivas
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ... por
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...HostedbyConfluent
52 vistas84 diapositivas
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ... por
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...HostedbyConfluent
79 vistas97 diapositivas
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern... por
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...HostedbyConfluent
64 vistas15 diapositivas
Rule Based Asset Management Workflow Automation at Netflix por
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixHostedbyConfluent
41 vistas56 diapositivas
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML... por
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...HostedbyConfluent
71 vistas32 diapositivas

Más de HostedbyConfluent(20)

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams por HostedbyConfluent
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka Streams
HostedbyConfluent88 vistas
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ... por HostedbyConfluent
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
HostedbyConfluent52 vistas
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ... por HostedbyConfluent
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
HostedbyConfluent79 vistas
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern... por HostedbyConfluent
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
HostedbyConfluent64 vistas
Rule Based Asset Management Workflow Automation at Netflix por HostedbyConfluent
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at Netflix
HostedbyConfluent41 vistas
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML... por HostedbyConfluent
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
HostedbyConfluent71 vistas
Indeed Flex: The Story of a Revolutionary Recruitment Platform por HostedbyConfluent
Indeed Flex: The Story of a Revolutionary Recruitment PlatformIndeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment Platform
HostedbyConfluent40 vistas
Forecasting Kafka Lag Issues with Machine Learning por HostedbyConfluent
Forecasting Kafka Lag Issues with Machine LearningForecasting Kafka Lag Issues with Machine Learning
Forecasting Kafka Lag Issues with Machine Learning
HostedbyConfluent31 vistas
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U... por HostedbyConfluent
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
HostedbyConfluent42 vistas
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre... por HostedbyConfluent
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
HostedbyConfluent45 vistas
Accelerating Path to Production for Generative AI-powered Applications por HostedbyConfluent
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered Applications
HostedbyConfluent74 vistas
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited... por HostedbyConfluent
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
HostedbyConfluent42 vistas
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad... por HostedbyConfluent
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
HostedbyConfluent58 vistas
Go Big or Go Home: Approaching Kafka Replication at Scale por HostedbyConfluent
Go Big or Go Home: Approaching Kafka Replication at ScaleGo Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at Scale
HostedbyConfluent39 vistas
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2 por HostedbyConfluent
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
HostedbyConfluent37 vistas
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid por HostedbyConfluent
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
HostedbyConfluent92 vistas
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python por HostedbyConfluent
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark PythonFrom Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
HostedbyConfluent86 vistas
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite... por HostedbyConfluent
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
HostedbyConfluent66 vistas
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K... por HostedbyConfluent
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
HostedbyConfluent82 vistas

Último

Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... por
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
119 vistas17 diapositivas
The Power of Heat Decarbonisation Plans in the Built Environment por
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
79 vistas20 diapositivas
Kyo - Functional Scala 2023.pdf por
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
457 vistas92 diapositivas
Digital Personal Data Protection (DPDP) Practical Approach For CISOs por
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
158 vistas59 diapositivas
"Surviving highload with Node.js", Andrii Shumada por
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
56 vistas29 diapositivas
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue por
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueShapeBlue
222 vistas7 diapositivas

Último(20)

Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... por ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue119 vistas
The Power of Heat Decarbonisation Plans in the Built Environment por IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE79 vistas
Digital Personal Data Protection (DPDP) Practical Approach For CISOs por Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash158 vistas
"Surviving highload with Node.js", Andrii Shumada por Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 vistas
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue por ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 vistas
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT por ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue206 vistas
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... por ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue186 vistas
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates por ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue252 vistas
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... por ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 vistas
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue por ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 vistas
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... por ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue173 vistas
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... por ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue139 vistas
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... por ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue166 vistas
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... por ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue106 vistas
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue por ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 vistas
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... por Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker54 vistas
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue por ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 vistas
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... por ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 vistas

Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger

  • 1. Learnings From Shipping 1000+ Streaming Data Pipelines To Production Hakan Lofcali, Stefan Sprenger {hakan,stefan}@datacater.io
  • 2. ‣We develop tools for developers working with streaming data ‣With Kafka, Kubernetes, and less than 5 developers, we built a platform that helped teams to deploy more than 1,000 streaming data pipelines to production ‣Let’s take you on our journey and the tools we adopted, hurdles encountered, and solutions found ‣Infra Space ‣Customer Solutions Space 2 WHAT WE DO / WHO WE ARE
  • 3. 3 STREAMING DATA PIPELINES Continuous Applications of Data Transformations Kafka Topic Kafka Connect Source Connector Kafka Connect Sink Connector Kafka Topic Kafka Streams
  • 4. 4 STREAMING PIPELINES IN THE WILD Customer communications in real-time Actionable data Clickstream data Outbox service Raw data Process clickstream events
  • 5. 5 GOALS FOR THIS TALK Avoid common pitfalls in streaming ETL ‣How to operate streaming data pipelines in an efficient and robust manner? ‣How to deal with resource-leaking Kafka Connect connectors? ‣How to monitor and debug running pipelines? ‣What are ways to deal with large data sources or slow data sinks? ‣What is missing in today’s ecosystem for streaming to become a commodity?
  • 6. How to operate streaming data pipelines in an efficient and robust manner? 6
  • 7. 7 SEPARATE BY NODE POOL Kafka NodePool Apache Kafka Broker* Strimzi Kafka Operator Control Plane Nodepool Pod We operate one K8s Cluster - Multiple Node pools K8s StatefulSet K8s Deployment DataCater Control Plane K8s Deployment … Kafka Connect Nodepool Quarkus Pipeline Pod K8s Deployment(s) ▸ Max 110 pods per node ▸ Max 5,000 nodes per cluster ▸ Max 150,000 pods in total ▸ *Separate Kafka and Kafka Connect clusters
  • 8. State-of-the-art Orchestration 8 PROCESS ORCHESTRATION ▸ We started out on a single VM and moved to a distributed process orchestration tool ▸ Kafka’s ecosystem is lagging state of the art process orchestration like Kubernetes, Nomad, etc. ▸ ksqlDB and Kafka Connect manage processes, but we will see how they are lacking fundamental patterns to be operated at scale Kafka Streams Single VM Docker Java Quarkus Kubernetes
  • 9. Quarkus SmallRye Reactive Messaging 9 STARTUP TIME Scheduled 0s Scheduled 60s First Event Processed First Event Processed 5s … Liveness OK 10s Kafka Streams 30s
  • 10. 10 WORKLOAD DENSITY Docker on Single VM Quarkus Quarkus Quarkus Kubernetes Kafka Streams Kafka Streams RAM < 1.5GB RAM < 1.5GB RAM < 300MB RAM < 300MB RAM < 300MB
  • 11. 11 STRIMZI FOR KAFKA … Kubernetes; Dedicated node pool for Kafka Apache Kafka Broker Apache Kafka Broker Apache Kafka Broker Strimzi Kafka Operator Kubernetes StatefulSet Pod Pod Pod
  • 12. 12 CONSUMER RE-BALANCING summit consumer 0 summit consumer 1 summit consumer 2 … Pod Pod Pod summit partition 0 summit partition 1 summit partition 2 … … … … … …
  • 13. 13 CONSUMER RE-BALANCING summit consumer 0 summit consumer 1 summit consumer 2 … Pod Pod Pod summit partition 0 summit partition 1 summit partition 2 … … … … … … ▸ Consumer re-balancing will cause no consumption until re-balancing is completed by co-ordinator ▸ Number of consumers can change due to errors, disconnection, and triggered by new load requirements
  • 15. 15 UNEXPECTED SHUTDOWN Startup Time Partition Size Not Linear Point of No Recovery Log Size
  • 16. How to deal with resource-leaking Kafka Connect connectors? 16
  • 17. 17 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ElasticSearch Sink Task C S3 Source Task A PostgreSQL Source Task A K8s Deployment / Connect Cluster Pod Pod Pod PostgreSQL Source Task B MySQL CDC Source Task A MySQL CDC Source Task B
  • 18. K8s Deployment / Connect Cluster 18 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ElasticSearch Sink Task C MySQL CDC Source Task A MySQL CDC Source Task B S3 SOURCE TASK A Pod Pod PostgreSQL Source Task A PostgreSQL Source Task B
  • 19. K8s Deployment / Connect Cluster 19 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ElasticSearch Sink Task C S3 SOURCE TASK A Pod Pod MySQL CDC Source Task A MySQL CDC Source Task B PostgreSQL Source Task A PostgreSQL Source Task B
  • 20. K8s Deployment / Connect Cluster 20 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ELASTICSEARCH SINK TASK C S3 SOURCE TASK A Pod Pod MySQL CDC Source Task A MySQL CDC Source Task B PostgreSQL Source Task A PostgreSQL Source Task B
  • 21. Connect Cluster Connect Cluster 21 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ElasticSearch Sink Task C S3 Source Task A Connect Cluster Pod Pod Pod MySQL CDC Source Task A MySQL CDC Source Task B PostgreSQL Source Task A PostgreSQL Source Task B
  • 22. Connect Cluster Connect Cluster 22 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ElasticSearch Sink Task A PostgreSQL Source Task A Connect Cluster Pod Pod S3 Source Task A Pod MySQL CDC Source Task A
  • 23. Connect Cluster Connect Cluster 23 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool S3 SOURCE TASK A ElasticSearch Sink Task A PostgreSQL Source Task A Connect Cluster Pod Pod MySQL CDC Source Task A
  • 24. Connect Cluster Connect Cluster 24 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool S3 SOURCE TASK A ElasticSearch Sink Task A PostgreSQL Source Task A Connect Cluster Pod Pod MySQL CDC Source Task A
  • 25. Connect Cluster Connect Cluster 25 KAFKA CONNECT SELF-MANAGED … Kubernetes (K8s); Dedicated Kafka Connect Nodepool ElasticSearch Sink Task A PostgreSQL Source Task A Connect Cluster Pod Pod S3 Source Task A Pod MySQL CDC Source Task A
  • 26. ▸ Utilise state of the art orchestration tools. ▸ Running Kafka on Kubernetes does not bring automatic elasticity. ▸ Kafka Connect is not self-contained. This will become a larger headache the more connector tasks are running in a given cluster. ▸ Think about startup time throughout your tech stack. From Kafka brokers over Connect tasks to streaming applications. 26 TAKE-AWAYS Key Learnings
  • 27. How to monitor and debug pipelines? 27
  • 28. 28 MONITORING STREAMING DATA PIPELINES Kafka Topic Kafka Connect Source Connector Kafka Connect Sink Connector Kafka Topic Kafka Streams
  • 29. ▸ External data sources or data sinks are unavailable (temporarily) ▸ Consumers (processors or sink connectors) are slower than producers ▸ Processing of events fails 29 POTENTIAL PRODUCTION ISSUES Most common issues in streaming data pipelines
  • 30. ▸ External data sources or data sinks are unavailable (temporarily) ▸ Consumers (processors or sink connectors) are slower than producers ▸ Processing of events fails 30 POTENTIAL PRODUCTION ISSUES Most common issues in streaming data pipelines
  • 31. 31 MONITORING CONNECTORS Kafka Topic Kafka Connect Source Connector Kafka Connect Sink Connector Kafka Topic Kafka Streams Monitoring the health of connectors ‣Periodically call /connectors/:connector_name/status and investigate the response
  • 32. 32 MONITORING CONNECTORS GET /connectors/hdfs-sink/status { "name": "hdfs-sink", "connector": { "state": "RUNNING", "worker_id": "localhost:8083" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": “localhost:8083" } ] } Healthy
  • 33. 33 MONITORING CONNECTORS GET /connectors/hdfs-sink/status { "name": "hdfs-sink", "connector": { "state": “FAILED", "worker_id": "localhost:8083" }, "tasks": [ { "id": 0, "state": "FAILED", "worker_id": “localhost:8083”, "trace": "org.apache.kafka.common.errors.RecordTooLargeExceptionn" } ] } Unhealthy
  • 34. 34 MONITORING CONNECTORS Kafka Topic Kafka Connect Source Connector Kafka Connect Sink Connector Kafka Topic Kafka Streams ‣Periodically call /connectors/:connector_name/status and investigate the response ‣If failed, try to restart the connector (e.g., deals with temporary API outages) and escalate or alert after X restarts ‣Sometimes, directly escalating might be reasonable Monitoring the health of connectors
  • 35. ▸ External data sources or data sinks are unavailable (temporarily) ▸ Consumers (processors or sink connectors) are slower than producers ▸ Processing of events fails 35 POTENTIAL PRODUCTION ISSUES Most common issues in streaming data pipelines
  • 36. 36 MONITORING BACKPRESSURE Consumer Lags Kafka Topic Consumer ‣Difference between the latest offset available in the Kafka topic (partition) and the latest offset processed by the consumer ‣Resembles how much consumers are behind producers in terms of number of records processed
  • 38. 38 MONITORING BACKPRESSURE Kafka Topic Kafka Connect Source Connector Kafka Connect Sink Connector Kafka Topic Kafka Streams Kafka Streams Consumer Lag ‣Number of records that have been extracted by the data source connector but have not yet been processed by the Kafka Streams app ‣If data processing is slower than extraction, you might want to increase the degree of parallelism of the Kafka Streams app
  • 39. 39 MONITORING BACKPRESSURE Kafka Topic Kafka Connect Source Connector Kafka Connect Sink Connector Kafka Topic Kafka Streams Sink Connector Consumer Lag ‣Number of records that have been processed by the Kafka Streams app but have not yet been published by the sink connector ‣If publishing data to the data sinks is slower than processing, you might want to increase the number of tasks of the sink connector
  • 40. ▸ External data sources or data sinks are unavailable (temporarily) ▸ Consumers (processors or sink connectors) are slower than producers ▸ Processing of events fails 40 POTENTIAL PRODUCTION ISSUES Most common issues in streaming data pipelines
  • 41. 41 DEAD-LETTER QUEUES Keep track of errors in processing ‣By default, Kafka Connect connectors fail when observing errors in processing ‣We recommend to configure a dead-letter queue (topic) for storing records that could not be processed ‣Monitor the dead-letter queue topic and manually investigate failed records errors.tolerance = all errors.deadletterqueue.topic.name = topic-dlq Topic Dead-letter queue topic Successful processing Failed processing Kafka Connect Source Connector
  • 42. What are ways to deal with large data sources or slow data sinks? 42
  • 43. 43 DEALING WITH LARGE DATA SOURCES ‣Hurts a lot when performing initial snapshots, which can take hours ‣Use multiple connectors for the same database and make use of table.include.list ‣Adjust the snapshot query and consider only a subset of the data source ‣Mitigate pain with incremental snapshotting ‣Accelerate snapshotting with parallelisation PostgreSQL Debezium Source Connector TBs of data
  • 44. 44 DEALING WITH SLOW DATA SINKS Kafka Connect Sink Connector Elasticsearch ‣Detect slow data sinks by monitoring the sink connector consumer lag ‣Parallelise sending records to the data sink by increasing the number of connector tasks ‣If available, batch multiple records and send them with one request to the data sink ‣Avoid duplicated data delivery by adjusting max.poll.records or max.poll.interval.ms
  • 45. What is missing in today’s ecosystem for streaming to become a commodity? 45
  • 46. 46 SERVERLESS TOPICS ‣Partitioned topics are the de-facto standard for persisting events ‣# partitions = maximum degree of parallelism ‣Choosing the number of partitions remains a crucial questions with significant impact on future cost and performance, and needs to be answered at topic creation time (!) ‣Having the ability to dynamically choose the degree of parallelism would allow to easier cope with peak loads "Horizontal Partition Autoscaler” Partition 0 1 partition Partition 0 Partition 1 Partition 2 3 partitions Partition 0 1 partition Scale Up Scale Down
  • 47. 47 EASE OPERATIONS More and better managed services ‣Operating streaming data pipelines boils down to running multiple distributed systems and remains one of the big hurdles for its adoption ‣Managed services can reduce the operational pain ‣We witness the rise of cloud/SaaS offerings but believe there is still lots of room for improvement
  • 49. 49 TAKE-AWAYS Summary ‣Throwing Kafka and Kafka Connect at Kubernetes is beneficial but does not provide a true cloud-native experience. It takes a few steps to, for instance, apply the self- containment principle to Kafka Connect. ‣If possible, try to handle errors of connectors or streaming applications in an automated manner without bringing the pipeline down ‣A lot of issues occur when integrating external systems that you do not control, e.g., snapshotting a very large database table, sending events to slow APIs, etc.