SlideShare una empresa de Scribd logo
1 de 31
Ricardo Paiva
How Criteo is managing one
of the largest Kafka
Infrastructure in Europe
DevLead @ Criteo
2 •
SRE Data Rivers
Hervé
Rivière
Martin
Richard
Axel
Rovillé
Ricardo
Paiva
Qinghui
Xu
Volume, Time and Location
4 •
Criteo is a high performance online
advertising platform.
Our goal is to show the right campaign
to the right user at the right moment.
5 •
How does it work?
$1.12 $0.83 $0.74 $1.03
www
1 2
AD Server
3
Criteo
RTB
567
< 100 ms
6 •
All messages through the same pipeline
HDFS
Streaming
jobs
Paris
Show me your numbers
8 •
 Up to 7 millions msgs/s
 (400 billions msgs/day)
 180 TB / day (compressed)
 Around 200 brokers
 Kafka in production for 4 years
Some Figures
9 •
The Infrastructure
2
3
2
Multiple Datacenters
 Bare-metal servers
 Servers managed by Chef
 User applications running on
Mesos or YARN
 13 Kafka clusters
10 •
32 brokers
2 x 12 cores / 1.8 GHz
< 5% cpu
256GB RAM
< 20GB Used memory (230GB cache)
12 disks of 8TB
RAID 10
48 TB total
~ 40% used
10GB NIC
Kafka setup per Datacenter
2.5 million msg/sec
1.1Gbps
11 •
Partition size
• How we define the number of partitions?
• 1GB/partition/hour
• 72h retention
• We try to keep 72GB per partition
• No key, no problem when increasing partitions
• We have topics with more than 1.300 partitions
Run, Forest, Run!
15
13 •
• First C# implementation
• Open Source (https://github.com/criteo/kafka-sharp/)
• Built for high-troughput
• Ability to blacklist partitions
• It discards messages if needed
• Our use:
• No Key Partitioning
• No order per partition guarantee
In-house C# Kafka Client
14 •
In-house C# Kafka Client
Broker 1
Broker 2
Broker 3
X
Slow
Application
15 •
Cons
• Costly to mantain
• Difficult to keep up to date
Pros
• Highly customizable
• Optimized for our use case
• Full control during the migration
Trade-off
Tic tac, tic tac
20
17 •
• Lag is our main metric for the clients
• We should be able to measure the lag in all conditions:
• No messages sent
• Blacklisted partitions
• New partitions added
• Offline partitions
SLA based on lag
18 •
Watermarks
46
6
6
6
6
5
5
partition 1
partition 2
partition 4
• Special messages sent to each partition.
• They contain a monotonic timestamp.
• They provide a clock for the stream of messages.
• If a message has a timestamp lower than the previous timestamp, it’s late.
2
3
3
2
2
3
2
2
2
1
1
1
3
3
44
4
4
44
5 4
new old
19 •
{
"partition": 36,
"partition_count": 582,
"kafka_topic": "bid_request",
"timestamp": 1547394552,
"region": "par",
"cluster": "local",
"environment": "prod"
}
Watermark message
20 •
Consensus
2
6
4
4
5
4
1
4
2
2
3
1
partition 1
partition 2
partition 4
3
8
5
partition 312345
• Watermarks are not aligned across
partitions.
• (At Criteo) We can have unordered
messages per partition.
• Empty partitions contain only
watermarks.
• partition_time = max(watermark)
• Global time: min(partitions_time)
21 •
Watermark Injector
Consumer
Watermark Tracker
Map<partition, timestamp>
Online
Service
Watermark
Injector
Chronometer
Move!
30
23 •
• Producers/Consumers can overload the cluster.
• Overloaded brokers may lead to losing data.
• Pipeline Cluster
• All data, that goes to HDFS
• General Purpose clusters
• Streaming
Local and Stream cluster
Pipeline General purpose
Online
Service
HDFS
24 •
Mesos
• Kafka Connect application running on Mesos.
• Custom connector.
• Writes offsets on the destination.
• Replication inter or cross datacenters.
Replication
25 •
Replication
Replicator
worker 1
Replicator
worker 2
26 •
Watermark problem
P=1
P=2
P=3
P=4
Replicator
worker 1
Replicator
worker 2
P=2
P=4P=3
P=1
27 •
Watermark reinjection
Watermark
Reinjection
New consensus
Current challenges
40
29 •
We migrated to Kafka 2
But still using old message format :(
We are upgrading our C# Kafka client
We look forward for new features:
• Idempotent producers
• Transaction
• Headers
Kafka new features
30 •
• Challenges:
• More and more streaming use cases.
• Multiple frameworks: Flink, Kafka Connect, Kafka Stream, Plain
Consumers.
• Clients running on Mesos, Yarn or bare-metal.
• We are working on a framework to help:
• Release
• Schedule
• Scale
• Monitor
• Maintain
Streaming
Thank you!
criteo.com/careers

Más contenido relacionado

La actualidad más candente

A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practicesconfluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafkaemreakis
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to KafkaAkash Vacher
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 

La actualidad más candente (20)

A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
kafka
kafkakafka
kafka
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 

Similar a How Criteo is managing one of the largest Kafka Infrastructure in Europe

Oleksandr Nitavskyi "Kafka deployment at Scale"
Oleksandr Nitavskyi "Kafka deployment at Scale"Oleksandr Nitavskyi "Kafka deployment at Scale"
Oleksandr Nitavskyi "Kafka deployment at Scale"Fwdays
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Johnny Miller
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Resolving Firebird performance problems
Resolving Firebird performance problemsResolving Firebird performance problems
Resolving Firebird performance problemsAlexey Kovyazin
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environmentEuropean Collaboration Summit
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:Tony Antony
 
Matrix, The Year To Date, Ben Parsons, TADSummit 2018
Matrix, The Year To Date, Ben Parsons, TADSummit 2018Matrix, The Year To Date, Ben Parsons, TADSummit 2018
Matrix, The Year To Date, Ben Parsons, TADSummit 2018Alan Quayle
 
The End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersThe End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersCarlos Martinez Cagnazzo
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 

Similar a How Criteo is managing one of the largest Kafka Infrastructure in Europe (20)

Oleksandr Nitavskyi "Kafka deployment at Scale"
Oleksandr Nitavskyi "Kafka deployment at Scale"Oleksandr Nitavskyi "Kafka deployment at Scale"
Oleksandr Nitavskyi "Kafka deployment at Scale"
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Resolving Firebird performance problems
Resolving Firebird performance problemsResolving Firebird performance problems
Resolving Firebird performance problems
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
Matrix, The Year To Date, Ben Parsons, TADSummit 2018
Matrix, The Year To Date, Ben Parsons, TADSummit 2018Matrix, The Year To Date, Ben Parsons, TADSummit 2018
Matrix, The Year To Date, Ben Parsons, TADSummit 2018
 
The End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident RespondersThe End of IPv4: What It Means for Incident Responders
The End of IPv4: What It Means for Incident Responders
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 

Último

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

How Criteo is managing one of the largest Kafka Infrastructure in Europe

  • 1. Ricardo Paiva How Criteo is managing one of the largest Kafka Infrastructure in Europe DevLead @ Criteo
  • 2. 2 • SRE Data Rivers Hervé Rivière Martin Richard Axel Rovillé Ricardo Paiva Qinghui Xu
  • 3. Volume, Time and Location
  • 4. 4 • Criteo is a high performance online advertising platform. Our goal is to show the right campaign to the right user at the right moment.
  • 5. 5 • How does it work? $1.12 $0.83 $0.74 $1.03 www 1 2 AD Server 3 Criteo RTB 567 < 100 ms
  • 6. 6 • All messages through the same pipeline HDFS Streaming jobs Paris
  • 7. Show me your numbers
  • 8. 8 •  Up to 7 millions msgs/s  (400 billions msgs/day)  180 TB / day (compressed)  Around 200 brokers  Kafka in production for 4 years Some Figures
  • 9. 9 • The Infrastructure 2 3 2 Multiple Datacenters  Bare-metal servers  Servers managed by Chef  User applications running on Mesos or YARN  13 Kafka clusters
  • 10. 10 • 32 brokers 2 x 12 cores / 1.8 GHz < 5% cpu 256GB RAM < 20GB Used memory (230GB cache) 12 disks of 8TB RAID 10 48 TB total ~ 40% used 10GB NIC Kafka setup per Datacenter 2.5 million msg/sec 1.1Gbps
  • 11. 11 • Partition size • How we define the number of partitions? • 1GB/partition/hour • 72h retention • We try to keep 72GB per partition • No key, no problem when increasing partitions • We have topics with more than 1.300 partitions
  • 13. 13 • • First C# implementation • Open Source (https://github.com/criteo/kafka-sharp/) • Built for high-troughput • Ability to blacklist partitions • It discards messages if needed • Our use: • No Key Partitioning • No order per partition guarantee In-house C# Kafka Client
  • 14. 14 • In-house C# Kafka Client Broker 1 Broker 2 Broker 3 X Slow Application
  • 15. 15 • Cons • Costly to mantain • Difficult to keep up to date Pros • Highly customizable • Optimized for our use case • Full control during the migration Trade-off
  • 16. Tic tac, tic tac 20
  • 17. 17 • • Lag is our main metric for the clients • We should be able to measure the lag in all conditions: • No messages sent • Blacklisted partitions • New partitions added • Offline partitions SLA based on lag
  • 18. 18 • Watermarks 46 6 6 6 6 5 5 partition 1 partition 2 partition 4 • Special messages sent to each partition. • They contain a monotonic timestamp. • They provide a clock for the stream of messages. • If a message has a timestamp lower than the previous timestamp, it’s late. 2 3 3 2 2 3 2 2 2 1 1 1 3 3 44 4 4 44 5 4 new old
  • 19. 19 • { "partition": 36, "partition_count": 582, "kafka_topic": "bid_request", "timestamp": 1547394552, "region": "par", "cluster": "local", "environment": "prod" } Watermark message
  • 20. 20 • Consensus 2 6 4 4 5 4 1 4 2 2 3 1 partition 1 partition 2 partition 4 3 8 5 partition 312345 • Watermarks are not aligned across partitions. • (At Criteo) We can have unordered messages per partition. • Empty partitions contain only watermarks. • partition_time = max(watermark) • Global time: min(partitions_time)
  • 21. 21 • Watermark Injector Consumer Watermark Tracker Map<partition, timestamp> Online Service Watermark Injector Chronometer
  • 23. 23 • • Producers/Consumers can overload the cluster. • Overloaded brokers may lead to losing data. • Pipeline Cluster • All data, that goes to HDFS • General Purpose clusters • Streaming Local and Stream cluster Pipeline General purpose Online Service HDFS
  • 24. 24 • Mesos • Kafka Connect application running on Mesos. • Custom connector. • Writes offsets on the destination. • Replication inter or cross datacenters. Replication
  • 26. 26 • Watermark problem P=1 P=2 P=3 P=4 Replicator worker 1 Replicator worker 2 P=2 P=4P=3 P=1
  • 29. 29 • We migrated to Kafka 2 But still using old message format :( We are upgrading our C# Kafka client We look forward for new features: • Idempotent producers • Transaction • Headers Kafka new features
  • 30. 30 • • Challenges: • More and more streaming use cases. • Multiple frameworks: Flink, Kafka Connect, Kafka Stream, Plain Consumers. • Clients running on Mesos, Yarn or bare-metal. • We are working on a framework to help: • Release • Schedule • Scale • Monitor • Maintain Streaming

Notas del editor

  1. Special Slide Instructions: Customize this slide with your name and title, add an account name or logo to the slide if presenting to a client Speaker Notes: Good morning/afternoon I’m excited to be with you today to share how Criteo is evolving. We have a new vision and are making some exciting changes as a result of the return of our founder JB Rudelle to lead Criteo…