SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
1
KSQL Performance Tuning
For Fun and Profit
Nick Dearden
2
KSQL Performance Tuning Planning
For Fun and Profit
Nick Dearden
3
4
5
6
77
Anatomy of a KSQL Query
Tuning goals
Performance Factors
What to Monitor
Rules of Thumb
8
● Every KSQL continuous query results in a
Kafka Streams Application
● An Application has a Topology…
● ..which may have sub-topologies…
● ..which are executed on StreamThreads
Apps,
CPUs,
Topologies
and Threads,
Oh My!
9
WriteProcessRead
10
Topologies, Tasks, & Partitions
• Topologies are divided into sub-topologies at read-write boundaries
- Read-process-write loop
• Within a sub-topology, tasks created for the max input partition count
- If multiple input topics, they are being co-processed, e.g. joins
- Internal topics, such as *-rekey ones, are counted too
• Each task is assigned to at most one StreamThread
- A StreamThread results in at least 3 JVM threads being created
- A StreamThread has its own Consumer and Producer instance
1111
Topologies, Tasks, &
Partitions
Divide a topology into read-
process-write sub-topologies
Thanks to Andy Bryant for the diagram!
12
Can I just explain it ?
● ksql> show queries;
● ksql> explain CSAS_R2_0;
13
ksql> create stream r2 as select stars, user_id, channel
from ratings;
Query ID | Kafka Topic | Query String
----------------------------------------------------------------
----------------------------------------
CSAS_R2_0 | R2 | CREATE STREAM R2 WITH
(KAFKA_TOPIC='R2', PARTITIONS=1, REPLICAS=1) AS SELECT
RATINGS.STARS "STARS",
RATINGS.USER_ID "USER_ID",
RATINGS.CHANNEL "CHANNEL"
FROM RATINGS;
----------------------------------------------------------------
----------------------------------------
For detailed information on a Query run: EXPLAIN <Query ID>;
ksql> show queries;
14
ksql> explain CSAS_R2_0;
Execution plan
--------------
> [ SINK ] | Schema: [ROWKEY STRING KEY, STARS INTEGER, USER_ID INTEGER, CHANNEL STRING]
> [ PROJECT ] | Schema: [ROWKEY STRING KEY, STARS INTEGER, USER_ID INTEGER,
CHANNEL STRING]
> [ SOURCE ] | Schema: [RATINGS.ROWKEY STRING KEY, RATINGS.ROWTIME
BIGINT, RATINGS.ROWKEY STRING, RATINGS.RATING_ID
BIGINT, RATINGS.USER_ID INTEGER, RATINGS.STARS
INTEGER, RATINGS.ROUTE_ID INTEGER,
RATINGS.RATING_TIME BIGINT, RATINGS.CHANNEL
STRING, RATINGS.MESSAGE STRING]
15
Multiple Queries ->
Multiple Topologies
1616
Performance Goals
● Latency ?
● Throughput ?
● Elasticity ?
17
18
Breaking Rules ?
● Consumer /
producer configs
● Message format
(Avro uses less CPU
than JSON)
● Compression
19
Network
1 Gb/s ~= 100-110 MB/s
Message Size
100 bytes : 1,000,000 / sec
1kb messages : 100,000 / sec
10kb messages : 10,000 / sec
20
21
Stream/Table Duality
22
State Stores (RocksDB)
Tables - consider key-space cardinality and message-size
Joins - join type, join windows
Aggregates - window sizes, group cardinality
23
Fault-Tolerance, powered by Kafka
Server A:
“I do stateful stream
processing, like tables,
joins, aggregations.”
“streaming
restore” of
A’s local state to BChangelog Topic
“streaming
backup” of
A’s local state
KSQL / Kafka
Streams App
Kafka
A key challenge of distributed stream processing is fault-tolerant state.
State is automatically migrated
in case of server failure
Server B:
“I restore the state and
continue processing
where
server A stopped.”
2424
Some Measurements
● KSQL Servers – i3.xlarge
○ 4 vCPUs
○ 30.5 GB memory
○ “up to 10Gbit network” (experimentally measured at ~ 1.2Gb/s
full-duplex baseline)
○ 200GB EBS SSD
● JVM Settings
○ Heap size 16GB (~50% of RAM, to leave space for state-stores)
25
Test Highlights
• Simple project query
(“speed-of-light”)
• CREATE STREAM foo AS
SELECT * FROM bar;
#
Queries msg/s MB/s
msg
size
CPU
%
MB Mem
Max
2 193k 59.14 320 99.19 18,949
10 189k 57.67 320 99.74 20,101
20 175k 53.43 320 99.68 23,377
50 168k 51.37 320 96.61 28,291
• 4 cores can’t saturate a 1Gb network link in
this test (but larger messages get close)
26
Test Highlights
• Simple project query
(“speed-of-light”)
• CREATE STREAM foo AS
SELECT * FROM bar;
#
Queries
#
Servers msg/s
msg/s/
host MB/s
CPU
%
2 1 193k 193k 59 99
2 3 585k 195k 179 96
2 10 1,855k 185k 567 96
Message throughput scales with server count
(same query, same data, msg-size=300bytes)
27
CREATE STREAM vip_actions AS
SELECT userid, page, action, zipcode
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
28
Test Highlights
• Stream-Table join
Stream-table join runs at ~50% throughput of
project query
#
Queries msg/s MB/s
msg
size
CPU
%
MB Mem
Max
2 88k 26 314 99.8 18,022
10 80k 24 314 99.8 19,931
29
Further Results
• A non-windowed aggregate on the same data ran at ~47k msgs/sec
• A windowed aggregate ran at ~24k msgs/sec (varies with window params)
• Re-partitioning can cut these results further
30
Miscellaneous Factors
• UDFs / UDAFs
• Scaling horizontally vs vertically
• Planning for elasticity
31
Take-Aways (1)
• Establish c
• Project and filter queries are cheap and fast
• Joins are slower, aggregates more so
• If select throughput (c) is 100%, then
• Joins run at about 50% of c
• Aggregates run at about 25%
• Windowed aggregates run ~10-15%
32
Take-Aways (2)
• (de)serialization is the most expensive part of any query
• Use Avro message format
• Start with 4 CPU cores for “serious” message volumes
• Use SSD for any state stores (speed > size)
33

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
LMAX Disruptor as real-life example
LMAX Disruptor as real-life exampleLMAX Disruptor as real-life example
LMAX Disruptor as real-life example
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
kafka
kafkakafka
kafka
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
How Criteo is managing one of the largest Kafka Infrastructure in Europe
How Criteo is managing one of the largest Kafka Infrastructure in EuropeHow Criteo is managing one of the largest Kafka Infrastructure in Europe
How Criteo is managing one of the largest Kafka Infrastructure in Europe
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Windows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPWindows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCP
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 

Similar a KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka Summit SF 2019

Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 

Similar a KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka Summit SF 2019 (20)

5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
 
Presentation oracle net services
Presentation    oracle net servicesPresentation    oracle net services
Presentation oracle net services
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 

Más de confluent

Más de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka Summit SF 2019

  • 1. 1 KSQL Performance Tuning For Fun and Profit Nick Dearden
  • 2. 2 KSQL Performance Tuning Planning For Fun and Profit Nick Dearden
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 77 Anatomy of a KSQL Query Tuning goals Performance Factors What to Monitor Rules of Thumb
  • 8. 8 ● Every KSQL continuous query results in a Kafka Streams Application ● An Application has a Topology… ● ..which may have sub-topologies… ● ..which are executed on StreamThreads Apps, CPUs, Topologies and Threads, Oh My!
  • 10. 10 Topologies, Tasks, & Partitions • Topologies are divided into sub-topologies at read-write boundaries - Read-process-write loop • Within a sub-topology, tasks created for the max input partition count - If multiple input topics, they are being co-processed, e.g. joins - Internal topics, such as *-rekey ones, are counted too • Each task is assigned to at most one StreamThread - A StreamThread results in at least 3 JVM threads being created - A StreamThread has its own Consumer and Producer instance
  • 11. 1111 Topologies, Tasks, & Partitions Divide a topology into read- process-write sub-topologies Thanks to Andy Bryant for the diagram!
  • 12. 12 Can I just explain it ? ● ksql> show queries; ● ksql> explain CSAS_R2_0;
  • 13. 13 ksql> create stream r2 as select stars, user_id, channel from ratings; Query ID | Kafka Topic | Query String ---------------------------------------------------------------- ---------------------------------------- CSAS_R2_0 | R2 | CREATE STREAM R2 WITH (KAFKA_TOPIC='R2', PARTITIONS=1, REPLICAS=1) AS SELECT RATINGS.STARS "STARS", RATINGS.USER_ID "USER_ID", RATINGS.CHANNEL "CHANNEL" FROM RATINGS; ---------------------------------------------------------------- ---------------------------------------- For detailed information on a Query run: EXPLAIN <Query ID>; ksql> show queries;
  • 14. 14 ksql> explain CSAS_R2_0; Execution plan -------------- > [ SINK ] | Schema: [ROWKEY STRING KEY, STARS INTEGER, USER_ID INTEGER, CHANNEL STRING] > [ PROJECT ] | Schema: [ROWKEY STRING KEY, STARS INTEGER, USER_ID INTEGER, CHANNEL STRING] > [ SOURCE ] | Schema: [RATINGS.ROWKEY STRING KEY, RATINGS.ROWTIME BIGINT, RATINGS.ROWKEY STRING, RATINGS.RATING_ID BIGINT, RATINGS.USER_ID INTEGER, RATINGS.STARS INTEGER, RATINGS.ROUTE_ID INTEGER, RATINGS.RATING_TIME BIGINT, RATINGS.CHANNEL STRING, RATINGS.MESSAGE STRING]
  • 16. 1616 Performance Goals ● Latency ? ● Throughput ? ● Elasticity ?
  • 17. 17
  • 18. 18 Breaking Rules ? ● Consumer / producer configs ● Message format (Avro uses less CPU than JSON) ● Compression
  • 19. 19 Network 1 Gb/s ~= 100-110 MB/s Message Size 100 bytes : 1,000,000 / sec 1kb messages : 100,000 / sec 10kb messages : 10,000 / sec
  • 20. 20
  • 22. 22 State Stores (RocksDB) Tables - consider key-space cardinality and message-size Joins - join type, join windows Aggregates - window sizes, group cardinality
  • 23. 23 Fault-Tolerance, powered by Kafka Server A: “I do stateful stream processing, like tables, joins, aggregations.” “streaming restore” of A’s local state to BChangelog Topic “streaming backup” of A’s local state KSQL / Kafka Streams App Kafka A key challenge of distributed stream processing is fault-tolerant state. State is automatically migrated in case of server failure Server B: “I restore the state and continue processing where server A stopped.”
  • 24. 2424 Some Measurements ● KSQL Servers – i3.xlarge ○ 4 vCPUs ○ 30.5 GB memory ○ “up to 10Gbit network” (experimentally measured at ~ 1.2Gb/s full-duplex baseline) ○ 200GB EBS SSD ● JVM Settings ○ Heap size 16GB (~50% of RAM, to leave space for state-stores)
  • 25. 25 Test Highlights • Simple project query (“speed-of-light”) • CREATE STREAM foo AS SELECT * FROM bar; # Queries msg/s MB/s msg size CPU % MB Mem Max 2 193k 59.14 320 99.19 18,949 10 189k 57.67 320 99.74 20,101 20 175k 53.43 320 99.68 23,377 50 168k 51.37 320 96.61 28,291 • 4 cores can’t saturate a 1Gb network link in this test (but larger messages get close)
  • 26. 26 Test Highlights • Simple project query (“speed-of-light”) • CREATE STREAM foo AS SELECT * FROM bar; # Queries # Servers msg/s msg/s/ host MB/s CPU % 2 1 193k 193k 59 99 2 3 585k 195k 179 96 2 10 1,855k 185k 567 96 Message throughput scales with server count (same query, same data, msg-size=300bytes)
  • 27. 27 CREATE STREAM vip_actions AS SELECT userid, page, action, zipcode FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
  • 28. 28 Test Highlights • Stream-Table join Stream-table join runs at ~50% throughput of project query # Queries msg/s MB/s msg size CPU % MB Mem Max 2 88k 26 314 99.8 18,022 10 80k 24 314 99.8 19,931
  • 29. 29 Further Results • A non-windowed aggregate on the same data ran at ~47k msgs/sec • A windowed aggregate ran at ~24k msgs/sec (varies with window params) • Re-partitioning can cut these results further
  • 30. 30 Miscellaneous Factors • UDFs / UDAFs • Scaling horizontally vs vertically • Planning for elasticity
  • 31. 31 Take-Aways (1) • Establish c • Project and filter queries are cheap and fast • Joins are slower, aggregates more so • If select throughput (c) is 100%, then • Joins run at about 50% of c • Aggregates run at about 25% • Windowed aggregates run ~10-15%
  • 32. 32 Take-Aways (2) • (de)serialization is the most expensive part of any query • Use Avro message format • Start with 4 CPU cores for “serious” message volumes • Use SSD for any state stores (speed > size)
  • 33. 33