SlideShare una empresa de Scribd logo
1 de 35
Evening Out the Uneven:
Dealing with Skews in Flink
Jun Qin, Head of Solutions Architecture
Karl Friedrich, Architect
1
Contents
01 Skew & Its Impact
02 Data Skew
03 Key Skew
04 State Skew
05 Scheduling Skew
06 Event Time/Watermark
Skew
07 Key Takeaways
2
Skew
● Workload imbalance among
subtasks/TaskManagers
○ in data to be processed
○ in state size
○ in event time
○ in resource usage
■ CPU/Memory/Disk
3
● Less resource utilization
● Back pressure
● Low throughput and/or high latency
● Potential high memory usage
○ JVM Garbage Collection
○ TM heartbeats timeout
○ Task failure
○ Job restart
Impact of Skew
4
● File Source: some files are much larger than
than the others
● Kafka Source: some Kafka partitions hold much
more data than the others
Examples
Data Skew
10x
5
Data Skew(Cont’d)
One task/TM is much busier that the others:
Bytes/Records Received is skewed:
Sympotoms
6
● Option 1: (basic) do filter() in your pipeline as early as possible to reduce the data volume
sourceStream
.assignTimestampsAndWatermarks()
.rebalance()
.map()
.otherTransformations()
.addSink()/sinkTo()
Data Skew (Cont’d)
Solutions
● Option 3: implement a custom operator to do a hadoop style map side aggregation (aka.
combiner)
○ See the class MapBundleOperator in flink-table-runtime
● Option 2: re-partition data by calling
○ rebalance()
○ shuffle()
○ partitionCustom()
○ keyBy()
⇨ pay attention to the shuffle cost!!!
7
Data Skew (Cont’d)
Throughput improved by using rebalance()
Rebalance then consume
Consume directly
➤ If the cost of network shuffle is
significant comparing the actual data
processing, you may get worse
results!
8
Key Skew
Record - Key - Key Group - taskSlot
record
key
keyGroup
taskSlot
KeySelector
MathUtils.murmurHash(key.hashCode())
% maxParallelism
keyGroup * parallelism / maxParallelism
maxParallelism determines the
total number of keyGroups
parallelism determines how the
keyGroups are grouped together
and assigned to a taskSlot
KeySelector determine
whether there is a hot
key
9
Key Skew (Cont’d)
Hot keys
record keyGroup
key
…
taskSlot 1: 1 record
taskSlot 2: 8 records
taskSlot 3: 2 record
taskSlot 4: 3 records
hot key!
1
2
0
4
5
3
6
7
8
9
10
11
A
B
C
D
maxParallelim=12
parallelism=4
10
Key Skew (Cont’d)
● Option 1: (basic) use a different key that has less or no skew.
Solutions of hot keys
➤ Flink SQL does this automatically if table.optimizer.agg-phase-strategy = TWO_PHASE
● Option 2: (general aggregation) local-global aggregation, similar to Hadoop combiner
11
Key Skew (Cont’d)
● Option 3: two-stage keyBy
Solutions of hot keys
stream
.keyBy(key)
.process(...)
stream
.keyBy(randomizedKey)
.process(...)
.keyBy(key)
.process(...)
➤ randomizedKey must be deterministic:
key + hashCode(anotherField)
splitStream = sourceStream
.process()
// stream of records with hot keys
splitStream
.getSideOut()
.keyBy(anotherField)
.process(...)
.keyBy(key)
.process(...)
// stream of records with non-hot keys
splitStream
.keyBy(key)
.process(...)
● Option 4: If hot keys are known in advance,
split the input stream in your pipeline into
hot key streams and non-hot key streams.
12
Key Skew (Cont’d)
Hot keys: experimental results
Stream split
One keyBy Two keyBy
skew
13
Key Skew (Cont’d)
Hot keys
Stream split
One keyBy Two keyBy
14
Key Skew (Cont’d)
Multiple keys are mapped to the same keyGroup
record keyGroup
key
…
taskSlot 1: 3 records
taskSlot 2: 8 records
taskSlot 3: 3 records
taskSlot 4: 0 records
maxParallelim=12
parallelism=4
hot keyGroup!
A
B
C
D
1
2
0
4
5
3
6
7
8
9
10
11
15
Key Skew (Cont’d)
Solution: adjust maxParallelism
maxParallelism=128 (default),
parallelism=3:
“A” → 104 → 2
“B” → 17 → 0
“C” → 17 → 0
maxParallelism=256⇧
parallelism=3:
“A” → 232 → 2
“B” → 17 → 0
“C” → 145 → 1
hot keyGroup!
increase
maxParallelism
16
Key Skew (Cont’d)
Multiple keys are mapped to the same taskSlot
record keyGroup
key
…
taskSlot 1: 3 records
taskSlot 2: 8 records
taskSlot 3: 3 records
taskSlot 4: 0 records
maxParallelim=12
parallelism=4
hot taskSlot!
1
2
0
4
5
3
6
7
8
9
10
11
A
B
C
D
17
Key Skew (Cont’d)
Solution: adjust Parallelism or maxParallelism
maxParallelim: 12
parallelism: 4
maxParallelim: 12
parallelism: 6
maxParallelim: 12
parallelism: 3
maxParallelim: 12
parallelism: 5
1
2
0
4
5
3
6
7
8
9
10
11
1
2
0
4
5
3
6
7
8
9
10
11
1
2
0
4
5
3
6
7
8
9
10
11
1
2
0
4
5
3
6
7
8
9
10
11
18
Key Skew (Cont’d)
Solution: adjust Parallelism or maxParallelism
5
maxParallelism=128 (default),
parallelism=4:
“1” → 54 → 1
“2” → 27 → 0
“3” → 33 → 1
“4” → 4 → 0
maxParallelism=64⇩
parallelism=4:
“1” → 54 → 3
“2” → 27 → 1
“3” → 33 → 2
“4” → 4 → 0
reduce
maxParallelism
19
Key Skew (Cont’d)
Adjust ratio between maxParallelism / Parallelism
➤ Best practice: maxParallelism = ~5-10 x parallelism
● maxParallelism = 128, parallelism = 127
○ 126 taskSlots each gets 1 keyGroup,
○ 1 taskSlot gets 2 keyGroups
⇨ one taskSlot processes 100% more records than each of the others
● maxParallelism = 1280, parallelism = 127
○ 117 taskSlots each gets 10 keyGroups,
○ 10 taskSlots each gets 11 keyGroups
⇨ a fluctuation of 10% among all taskSlots
20
State Skew
● Some subtasks have much bigger state than others
● Often caused by data skew or key skew
21
Recap
Is distributing records among task slots evenly sufficient?
distribute records among task slots evenly
Does it guarantee even resource utilization ❓
Data skew
Key skew
State skew
22
Scheduling Skew
record keyGroup
key
…
taskSlot 1
3 records
taskSlot 2
4 records
taskSlot 3
3 records
taskSlot 4
4 records
maxParallelim=12
parallelism=4
No of keyGroup=12
TaskManager 1
TaskManager 2
23
Scheduling Skew (Cont’d)
TM1 has more slots than TM2, skew! Task slots are evenly distributed among TM1 and TM2
TM 1 TM 2 TM 1 TM 2
cluster.evenly-spread-out-slots: true
cluster.evenly-spread-out-slots: false
(default)
Scheduling 4 subtasks to two TMs each having 3 slots
24
That’s all for
Scheduling Skew!
25
Time-based Processing
● Event time vs Processing time
○ Event time is the time event has actually occurred
○ Processing time is the time when the event is processed
● Event time based computations
○ Windowing
○ Timer
● Apache Flink uses watermarks to keep track of the progress in event time
● A data operator may have multiple input channels in Flink
○ Keyed streams
○ Joining streams
Background Knowledge
26
Event Time/Watermark Skew
An Example
27
operator
5
1
2
3
4
5
6
7
watermark
Watermarks far apart
8
9
watermark
Channel 1
Channel 2
Event Time/Watermark Skew (Cont’d)
operator
state
5
1
2
3
4
5
6
7
8
9
Window: [0,5)
Channel 1
Channel 2
28
Event Time/Watermark Skew (Cont’d)
Event Time Skew:
The event distribution on time is skewed among input channels
of an operator. It could be because of:
1. the nature of the data sources
2. or some upstream tasks progress faster than others;
Impacts:
● Backpressure
● Large state and long checkpoint duration
● Job failures
Watermarks of subtasks or
input channels may deviate
from each other.
29
operator
Event Time/Watermark Skew (Cont’d)
Watermark alignment in Flink 1.15
30
state
5
1
2
3
4
5
6
7
8
9
Window: [0,5)
WatermarkStrategy watermarkStrategy =
WatermarkStrategy
.<~>forBoundedOutOfOrderness(...)
.withTimestampAssigner(...)
.withWatermarkAlignment(
watermarkGroup,
maxAllowedWatermarkDrift,
updateInterval);
sourceStream =
env.fromSource(
kafkaSource,
watermarkStrategy,
sourceName)
.map(...)
…
Consuming from this channel is on-hold
because of maxAllowedWatermarkDrift,
e.g., 1 in this case
Event Time/Watermark Skew (Cont’d)
Watermark alignment reduces checkpoint size and duration
With watermark alignment
Without watermark alignment
31
Event Time/Watermark Skew (Cont’d)
● Use JobManagerWatermarkTracker
○ Event time alignment in Amazon Kinesis Data Streams Connector in Flink documentation
○ The Flink Forward 2020 talk from Shahid from Stripe: Streaming, Fast and Slow: Mitigating
Watermark Skew in Large, Stateful Jobs
What if we have to use an earlier version of Flink?
32
That’s all for
Event Time Skew!
33
✓ Skew can cause problems or failures
✓ Even the workload among not only task
slots but also taskManagers
✓ maxParalellism = ~5-10 x parallelism
✓ Pay attention to the network shuffle cost
when solving skew issues
Key Takeaways
Data Skew
Filter
Re-partition
Local aggregration
Key Skew
Hot Key
Hot keyGroup
Hot taskSlot
Use a different key
Local-Global aggregation
Stream split
Two-phase keyBy
Adjust maxParallelism
Adjust parallelism
Adjust maxParallelism
Event Time Skew
Use watermark alignment (Flink
1.15+)
JobManager Watermark
Tracker
State Skew Fix data skew and/or key skew
Scheduling Skew Set cluster.evenly-spread-out-slots:
true
34
Happy streaming!
Q & A
35

Más contenido relacionado

La actualidad más candente

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...HostedbyConfluent
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 

La actualidad más candente (20)

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 

Similar a Evening out the uneven: dealing with skew in Flink

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWRpasalapudi
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jetStreamNative
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: RevisionDamian T. Gordon
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flinkRenato Guimaraes
 
Input and Output Devices and Systems
Input and Output Devices and SystemsInput and Output Devices and Systems
Input and Output Devices and SystemsNajma Alam
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeDatabricks
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 

Similar a Evening out the uneven: dealing with skew in Flink (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Handout3o
Handout3oHandout3o
Handout3o
 
Data race
Data raceData race
Data race
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: Revision
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Input and Output Devices and Systems
Input and Output Devices and SystemsInput and Output Devices and Systems
Input and Output Devices and Systems
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta Lake
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 

Más de Flink Forward

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 

Más de Flink Forward (12)

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Evening out the uneven: dealing with skew in Flink

  • 1. Evening Out the Uneven: Dealing with Skews in Flink Jun Qin, Head of Solutions Architecture Karl Friedrich, Architect 1
  • 2. Contents 01 Skew & Its Impact 02 Data Skew 03 Key Skew 04 State Skew 05 Scheduling Skew 06 Event Time/Watermark Skew 07 Key Takeaways 2
  • 3. Skew ● Workload imbalance among subtasks/TaskManagers ○ in data to be processed ○ in state size ○ in event time ○ in resource usage ■ CPU/Memory/Disk 3
  • 4. ● Less resource utilization ● Back pressure ● Low throughput and/or high latency ● Potential high memory usage ○ JVM Garbage Collection ○ TM heartbeats timeout ○ Task failure ○ Job restart Impact of Skew 4
  • 5. ● File Source: some files are much larger than than the others ● Kafka Source: some Kafka partitions hold much more data than the others Examples Data Skew 10x 5
  • 6. Data Skew(Cont’d) One task/TM is much busier that the others: Bytes/Records Received is skewed: Sympotoms 6
  • 7. ● Option 1: (basic) do filter() in your pipeline as early as possible to reduce the data volume sourceStream .assignTimestampsAndWatermarks() .rebalance() .map() .otherTransformations() .addSink()/sinkTo() Data Skew (Cont’d) Solutions ● Option 3: implement a custom operator to do a hadoop style map side aggregation (aka. combiner) ○ See the class MapBundleOperator in flink-table-runtime ● Option 2: re-partition data by calling ○ rebalance() ○ shuffle() ○ partitionCustom() ○ keyBy() ⇨ pay attention to the shuffle cost!!! 7
  • 8. Data Skew (Cont’d) Throughput improved by using rebalance() Rebalance then consume Consume directly ➤ If the cost of network shuffle is significant comparing the actual data processing, you may get worse results! 8
  • 9. Key Skew Record - Key - Key Group - taskSlot record key keyGroup taskSlot KeySelector MathUtils.murmurHash(key.hashCode()) % maxParallelism keyGroup * parallelism / maxParallelism maxParallelism determines the total number of keyGroups parallelism determines how the keyGroups are grouped together and assigned to a taskSlot KeySelector determine whether there is a hot key 9
  • 10. Key Skew (Cont’d) Hot keys record keyGroup key … taskSlot 1: 1 record taskSlot 2: 8 records taskSlot 3: 2 record taskSlot 4: 3 records hot key! 1 2 0 4 5 3 6 7 8 9 10 11 A B C D maxParallelim=12 parallelism=4 10
  • 11. Key Skew (Cont’d) ● Option 1: (basic) use a different key that has less or no skew. Solutions of hot keys ➤ Flink SQL does this automatically if table.optimizer.agg-phase-strategy = TWO_PHASE ● Option 2: (general aggregation) local-global aggregation, similar to Hadoop combiner 11
  • 12. Key Skew (Cont’d) ● Option 3: two-stage keyBy Solutions of hot keys stream .keyBy(key) .process(...) stream .keyBy(randomizedKey) .process(...) .keyBy(key) .process(...) ➤ randomizedKey must be deterministic: key + hashCode(anotherField) splitStream = sourceStream .process() // stream of records with hot keys splitStream .getSideOut() .keyBy(anotherField) .process(...) .keyBy(key) .process(...) // stream of records with non-hot keys splitStream .keyBy(key) .process(...) ● Option 4: If hot keys are known in advance, split the input stream in your pipeline into hot key streams and non-hot key streams. 12
  • 13. Key Skew (Cont’d) Hot keys: experimental results Stream split One keyBy Two keyBy skew 13
  • 14. Key Skew (Cont’d) Hot keys Stream split One keyBy Two keyBy 14
  • 15. Key Skew (Cont’d) Multiple keys are mapped to the same keyGroup record keyGroup key … taskSlot 1: 3 records taskSlot 2: 8 records taskSlot 3: 3 records taskSlot 4: 0 records maxParallelim=12 parallelism=4 hot keyGroup! A B C D 1 2 0 4 5 3 6 7 8 9 10 11 15
  • 16. Key Skew (Cont’d) Solution: adjust maxParallelism maxParallelism=128 (default), parallelism=3: “A” → 104 → 2 “B” → 17 → 0 “C” → 17 → 0 maxParallelism=256⇧ parallelism=3: “A” → 232 → 2 “B” → 17 → 0 “C” → 145 → 1 hot keyGroup! increase maxParallelism 16
  • 17. Key Skew (Cont’d) Multiple keys are mapped to the same taskSlot record keyGroup key … taskSlot 1: 3 records taskSlot 2: 8 records taskSlot 3: 3 records taskSlot 4: 0 records maxParallelim=12 parallelism=4 hot taskSlot! 1 2 0 4 5 3 6 7 8 9 10 11 A B C D 17
  • 18. Key Skew (Cont’d) Solution: adjust Parallelism or maxParallelism maxParallelim: 12 parallelism: 4 maxParallelim: 12 parallelism: 6 maxParallelim: 12 parallelism: 3 maxParallelim: 12 parallelism: 5 1 2 0 4 5 3 6 7 8 9 10 11 1 2 0 4 5 3 6 7 8 9 10 11 1 2 0 4 5 3 6 7 8 9 10 11 1 2 0 4 5 3 6 7 8 9 10 11 18
  • 19. Key Skew (Cont’d) Solution: adjust Parallelism or maxParallelism 5 maxParallelism=128 (default), parallelism=4: “1” → 54 → 1 “2” → 27 → 0 “3” → 33 → 1 “4” → 4 → 0 maxParallelism=64⇩ parallelism=4: “1” → 54 → 3 “2” → 27 → 1 “3” → 33 → 2 “4” → 4 → 0 reduce maxParallelism 19
  • 20. Key Skew (Cont’d) Adjust ratio between maxParallelism / Parallelism ➤ Best practice: maxParallelism = ~5-10 x parallelism ● maxParallelism = 128, parallelism = 127 ○ 126 taskSlots each gets 1 keyGroup, ○ 1 taskSlot gets 2 keyGroups ⇨ one taskSlot processes 100% more records than each of the others ● maxParallelism = 1280, parallelism = 127 ○ 117 taskSlots each gets 10 keyGroups, ○ 10 taskSlots each gets 11 keyGroups ⇨ a fluctuation of 10% among all taskSlots 20
  • 21. State Skew ● Some subtasks have much bigger state than others ● Often caused by data skew or key skew 21
  • 22. Recap Is distributing records among task slots evenly sufficient? distribute records among task slots evenly Does it guarantee even resource utilization ❓ Data skew Key skew State skew 22
  • 23. Scheduling Skew record keyGroup key … taskSlot 1 3 records taskSlot 2 4 records taskSlot 3 3 records taskSlot 4 4 records maxParallelim=12 parallelism=4 No of keyGroup=12 TaskManager 1 TaskManager 2 23
  • 24. Scheduling Skew (Cont’d) TM1 has more slots than TM2, skew! Task slots are evenly distributed among TM1 and TM2 TM 1 TM 2 TM 1 TM 2 cluster.evenly-spread-out-slots: true cluster.evenly-spread-out-slots: false (default) Scheduling 4 subtasks to two TMs each having 3 slots 24
  • 26. Time-based Processing ● Event time vs Processing time ○ Event time is the time event has actually occurred ○ Processing time is the time when the event is processed ● Event time based computations ○ Windowing ○ Timer ● Apache Flink uses watermarks to keep track of the progress in event time ● A data operator may have multiple input channels in Flink ○ Keyed streams ○ Joining streams Background Knowledge 26
  • 27. Event Time/Watermark Skew An Example 27 operator 5 1 2 3 4 5 6 7 watermark Watermarks far apart 8 9 watermark Channel 1 Channel 2
  • 28. Event Time/Watermark Skew (Cont’d) operator state 5 1 2 3 4 5 6 7 8 9 Window: [0,5) Channel 1 Channel 2 28
  • 29. Event Time/Watermark Skew (Cont’d) Event Time Skew: The event distribution on time is skewed among input channels of an operator. It could be because of: 1. the nature of the data sources 2. or some upstream tasks progress faster than others; Impacts: ● Backpressure ● Large state and long checkpoint duration ● Job failures Watermarks of subtasks or input channels may deviate from each other. 29
  • 30. operator Event Time/Watermark Skew (Cont’d) Watermark alignment in Flink 1.15 30 state 5 1 2 3 4 5 6 7 8 9 Window: [0,5) WatermarkStrategy watermarkStrategy = WatermarkStrategy .<~>forBoundedOutOfOrderness(...) .withTimestampAssigner(...) .withWatermarkAlignment( watermarkGroup, maxAllowedWatermarkDrift, updateInterval); sourceStream = env.fromSource( kafkaSource, watermarkStrategy, sourceName) .map(...) … Consuming from this channel is on-hold because of maxAllowedWatermarkDrift, e.g., 1 in this case
  • 31. Event Time/Watermark Skew (Cont’d) Watermark alignment reduces checkpoint size and duration With watermark alignment Without watermark alignment 31
  • 32. Event Time/Watermark Skew (Cont’d) ● Use JobManagerWatermarkTracker ○ Event time alignment in Amazon Kinesis Data Streams Connector in Flink documentation ○ The Flink Forward 2020 talk from Shahid from Stripe: Streaming, Fast and Slow: Mitigating Watermark Skew in Large, Stateful Jobs What if we have to use an earlier version of Flink? 32
  • 33. That’s all for Event Time Skew! 33
  • 34. ✓ Skew can cause problems or failures ✓ Even the workload among not only task slots but also taskManagers ✓ maxParalellism = ~5-10 x parallelism ✓ Pay attention to the network shuffle cost when solving skew issues Key Takeaways Data Skew Filter Re-partition Local aggregration Key Skew Hot Key Hot keyGroup Hot taskSlot Use a different key Local-Global aggregation Stream split Two-phase keyBy Adjust maxParallelism Adjust parallelism Adjust maxParallelism Event Time Skew Use watermark alignment (Flink 1.15+) JobManager Watermark Tracker State Skew Fix data skew and/or key skew Scheduling Skew Set cluster.evenly-spread-out-slots: true 34

Notas del editor

  1. Hello everyone, welcome back to Flink Forward. When you work in the data domain, it is likely that you face some skew issues as data are not always evenly distributed. This is the same when you do stream processing with Flink. Skews can result in wasted resources and limited scalability. So, how can we even out the uneven? My name is Jun, head of solutions architecture in Ververica. In the past years, our team have helped customers and users solve various skew-related issues in their Flink jobs or clusters. Today, together with my colleague Karl, we will present the various skew situations that users often run into and discuss the solutions for each of them. We hope this can serve as a guideline to help you reduce skew in your Flink environment.
  2. We will start with the definition of skew and its impact to Flink jobs. Then I will present data skew & key skew. My colleague Karl will continue with state skew, scheduling skew and event time or watermark skew. We will also show experiemental results for some of the solutions. At the end, we will summarize the talk with the key takeaways.
  3. Let us first define skew. Skew mean the workload imbalance among subtasks of a Flink job or TaskManagers of a Flink cluster. Skew can happen <CLICK> in terms of data to be processed, <CLICK> in terms of state size, <CLICK> in terms of event time <CLICK> Or in terms of CPU/Memory/Disk usage
  4. As we saw in the previous slide, in a skew situation, some task managers may be 100% busy while the others are not even reach 50% . This leads to poor resource utilization. In a situation where majority of the data is processed by a single task slot, it will cause backpressure to upstream operators, result in low throughput and/or high latency. When you have a skew in event time, you may need to buffer lots of events in state, which can cause high memory usage, long garbage collection time. This then can lead to TM heartbeat time out, task failures and job restart.
  5. The first type of skew is Data Skew. For example, your job consumes messages from a directory of files where some files are much larger than other files. The similar thing can happen when your job consumes from a Kafak topic where the topic has several partitions, but some partitions has much more data than other partitions.
  6. What you will see from the Flink UI is that the records received by some subtasks is much larger than other subtasks. <CLICK> Consequently, that task is much busier than the others. And the corresponding TaskManager also has higher CPU usage than others. How can we deal with this situation in Flink?
  7. The first basic option is to check whether some records can be filtered out at the beginning of your pipeline. If so, you can then reduce the data volume sent to downstream operators. Then skew will be less of an issue. <CLICK> The second option is to re-partition data among the subtasks of the operators following the source opeator. You can call rebalance() to distribute records is a round-robin fashion, or shuffle() to select downstream operator subtasks randomly. You can also supply your own partition scheme. or call keyBy() if you are in a keyed context. Obviously, data re-partition implies a network shuffle. So you will get performance improvement only if the network shuffle cost is insignificant comparing to the computation of the rest of pipeline. Typically, we suggest to re-partition only if some TaskManager reached 100% CPU usage because of the skew. <CLICK> The third option is to implement a custom operator to do a hadoop style map side aggregation. The purpose here is to reduce the overall workload of downstream operators. For example instead of sending 10 raw records downstream, you can send 1 record with the aggregated value. You can see such an example in MapBundleOperator class in the flink-table-runtime package.
  8. Here is an experiment we conducted. We have 4 subtasks consuming directly from a Kafka topic with 4 partitions where one of the partition contains 80% of the total data volume. The throughput of our job was 30K per seconds. After we re-partition the data with rebalance(), we increase the job throughput to 40K. <CLICK> But as mentioned before, if the network shuffle cost is significant comparing to the actual data processing, you may get worse performance with data re-partition.
  9. As mentioned in the data skew section, we can use keyBy to re-partition data to solve data skew issue. But if the data are not evenly distributed among keys, it then become a key skew issue. Before we deep dive into key skew issues, let us have a look at first how Flink maps records to keys, and then to keyGroups and taskSlots. Flink maps records to keys by KeySelector. So KeySelector determines the number of records that are mapped to a particularly key. If the number is large, the key becomes a hot key. To compute keyGroups, Flink applies murmurHash to keys’ hashCode, modulo maxParallelism. So, maxParallelism determines the total number of keyGroups. For a given parallelism, all keyGroups are split into several ranges, each range is assigned to a taskSlot. So, parallelism here determines how the keyGroups are grouped together and mapped to taskSlots.
  10. Let us look at a concrete example. Here, every record is mapped to a key. 8 of them are mapped to key A. Key A is mapped to keyGroup 3. Given a maxParalellism of 12 and parallelism of 4, all 12 keyGroups are split into 4 ranges, represented by orange/yellow/green/red. keyGroup 3 is mapped to taskSlot 2. This means, taskSlot 2 will process 8 records, while other taskSlots only get 1/2/3 records. <CLICK> This is the Hot Key issue.
  11. The first basic solution to the hot key issue is to use a different key that has less or no skew. For example, instead of keyBy currency, you can keyBy accountId. If you are doing general aggregation, you can try the local-global aggregation approach, aks. two-phase aggregation. This is similar to Hadoop combiner. As shown in the picture here, instead of keyBy color directly, you first aggregate locally in each subtask. The local aggregation can help to accumulate a certain amount of input records which have the same key into a single accumulator. The global aggregation will then receive the reduced accumulators instead of large number of raw input records. This can significantly reduce the network shuffle and make the key kew less of an issue. Flink SQL does this automatically if you enable the TWO-PHASE aggregation strategy.
  12. The third approach to the hot key issue is two-stage keyBy. Because the key is skewed, we first keyBy a randomized key that consists of the original key and a random part. The assumption here is that the amount of the output data from the first keyBy(randomizedKey) and its process function is significantly reduced in comparison to the amount of original input data. Then in the second step, we can keyBy the original key. Because of the reduced data volume, the hot skew is not an issue any more. One thing to note here is that, given an input record, the randomzied key must be deterministic. For example, you can use the original key, plus the hashCode of another field of the input records. <CLICK> If you know the hot key in advance, another approach to solve the hot key issue is to split the input stream in your pipeline into streams of hot keys and streams of other keys, by using Flink’s side output. For the streams of other keys, you do as usual with your keyBy. For the streams of hot keys, you can keyBy another field to parallelize the data processing and then keyBy the original key to gather the aggregated results.
  13. Here, we simulated a hot key in our Kafka topic, tested the two-keyBy solution and the stream split solution. As we can see here at the left-hand side, due to the existence hot key, one of the TMs is bottlenecked on CPU if we just keyBy with the original key. The CPU usage is well balanced in the two-keyBy solution and the stream split solution, as seen at the right-hand side.
  14. As a result, the job throughput is increased from 38k to 42K in the two-keyBy approach and to 50K in the stream split approach. <PAUSE> This is all about the hot key issue.
  15. Let us go back the original picture. Here, we do not have a hot key, because each key is associated with 3-4 records. But because both key A and key B are mapped keyGroup 3. For the similar reason as mentioned before, taskSlot 2 gets the majority of the data to be processed. This is a hot keyGroup issue because multiple keys are mapped to the same keyGroup. The solution here is to adjust maxParallelism.
  16. Let us look at a concrete example. With the default maxParallelism of 128, if you want to process records with keys of string A, B, C in your pipeline with a parallelism of 3, key B and C will be mapped to keyGroup 17 and processed by taskSlot 0, key A will be processed by taskSlot 2, taskSlot 1 is idle. If you change the maxParallelism to 256, the three keys are evenly distributed to three taskSlots
  17. There is another scenario. Here we do not have hot keys, also no hot keyGroups. But because both key A and key B are mapped to taskSlot 2. So, the taskSlot 2 is hot again. The solution for this particular example is to adjust the parallelism.
  18. For example, given a maxParallelism of 12, keyGroups are represented by the numbers, are grouped and mapped to taskSlots based on colors. this slide shows how the keyGroups are distributed among taskSlots by adjusting the parallelism. <MOUSE> For example, when parallelism=3, keyGroup 0-3 are mapped to taskSlot0, keyGroup4-7 are mapped to taskSlot1, keyGroup8-11 are mapped to taskSlot2.
  19. You can also adjust maxParallelism to achieve an even distribution. With the default maxParallelism of 128, if you want to process records with keys of string 1,2,3,4 in your pipeline with a parallelism of 4, key 1 and 3 are processed by taskSlot 1, key 2 and 4 are processed by taskSlot 0, taskSlot 2 & 3 are idle. If you change the maxParallelism to 64, the four keys are evenly distributed to 4 taskSlots
  20. When changing maxParallelism and parallelism, you should also pay attention to the ratio between maxParallelim and parallelism as it can also impact the data distribution among taskSlots. With the default maxParallelism of 128, and a parallelism of 127, most of the taskSlots will get one keyGroup each, but there is one taskSlot get two keyGroups. This taskSlot will get 100% more work load comparing with other taskSlots. If we now change the maxParallelism to 1280, then all of the taskSlots will get either 10 or 11 keyGroups, meaning there is a fluctuation of 10% workload among all taskSlots. The workload is more evenly distributed in this example than the previous example. So the best practice is to have 5-10 times of parallelism as the maxParallelism. Keep in mind that setting the maximum parallelism to a very large value can negative impact to the performance because state backends have to keep internal data structures that scale with the number of key-groups. Also because of this, if you want to change the maxParallelism of an existing job without discarding the state, you need to convert your state via State Processor API. With this, I am now handing over to my colleague Karl for the other type of skews. <FINISH>
  21. Thanks! The 3rd kind of skews is “State Skew”. It refers to the case when Some subtasks have much bigger state than others. In this example, subtask 2’s state is much bigger than others, and takes much longer to checkpoint. State Skew is Often caused by data skew or key skew, which we discussed earlier. And typically it can be solved by a combination of solutions to data skew or key skew, depending on the situation. [Skip, Audience Qs] KQ: ID is TM ID or slot? Flink UI, click an operator: Subtask ID, which is mapped to TaskSlot. An operator can have many instances (tasks/subtasks). Each subtask is scheduled to a TaskSlot. But it’s not always 1-to-1 mapping. E.g. a TaskSlot can run an Operator chain of 4 Ops, i.e. 4 subtasks are scheduled to 1 TaskSlot.
  22. So far we’ve discussed data skews, key skews and state skews. They are very important. And Avoiding them helps us distribute records among task slots evenly. So, Does it guarantee even resource utilization? Unfortunately, no.
  23. Suppose we have our data evenly distributed in key groups and then into 4 subtasks. (Everything looks nice.) <CLICK> Next, the subtasks will be scheduled in TaskManagers. <NEXT> [Skip] For example, with maxParallelism=128 (default), parallelism=4: 1 → 86 → 2 2 → 127 → 3 3 → 113 → 3 4 → 7 → 0 5 → 126 → 3
  24. Let’s assume that we have 2 Task Managers, who have 3 Task Slots each. And we have 4 subtasks to do. By default, Flink would schedule 3 subtasks to TaskManager 1, and the left 1 subtask to TaskManager 2. Now we have TaskManager 1 doing 75% of work, while TaskManager 2 doing 25%. If this continues, we are not using our resources evenly. We call this Scheduling Skew, the 4th kind of skews. What can we do? <CLICK> We can turn on the Flink option cluster.evenly-spread-out-slots. <CLICK> Now the scheduler will take slots from the least used TM (when there aren’t any other preferences). This way the subtasks will be scheduled evenly among TaskManagers. <Note> Please note: This example assumes that both TM1 and TM2 are registered with the cluster before the jobs are submitted. And in clusters that dynamically add TMs as needed, the cluster.evenly-spread-out-slots option doesn't make sense. In addition, if each TM has only one slot (common in K8s env), then this configuration has no effect.
  25. That’s all for Scheduling Skew! <Pause> Next, we’re gonna look at a very different kind of skews.
  26. But before that, please allow me to refresh on some background knowledge first. In the domain of stream processing, we have the notions of Event time and Processing time. In general, Event time is the time when the event actually occurred, determined by the timestamp on the data record; While Processing time is the time when the event is processed, determined by the clock on the system processing the record. Many use cases apply window algorithms and timers based on Event time. And Apache Flink uses watermarks to keep track of the progress in event time. In addition, A data operator may have multiple input channels in Flink.
  27. Let’s look at this scenario. We have an Op processing 2 channels. And we indicate the watermarks from the channels by numbers. On the right side, When the Op receives Watermark 5 from channel 2, the greatest watermark it has received from channel 1 is Watermark 1. In other words, channel 2 progresses faster than channel 1 at the moment. (The greatest available watermark in channel 2 is 9, while the greatest in channel 1 is 5.) [Ref] Content of channels may be from: different streams, network shuffle, or keyBy(). keyBy is a type of network shuffle
  28. If the Op does an event time based computation, like a window operation, data from channel 2 will pile up (in the Op) when the Op waits for data from channel 1. This is called Event Time Skew.
  29. Event Time Skew happens when The event distribution on time is skewed among input channels of an operator. It could be because: Of the nature of the data sources (e.g. we dedup or join files of various sizes, each source takes 1 file at a time as input) Or because some upstream tasks progress faster than others; And resulting in varied event time and watermarks from different input channels of the operator in concern. <Pause> Event Time Skew may result in: Backpressure Large state and long checkpoint duration And Job failures <Pause> What to do? <NEXT> [Ref] How to detect Watermark Skew? How mitigating event-time skewness can reduce checkpoint failures and task manager crashes Watch the lag of the assigned Kafka partitions per Flink Kafka consumer. Unfortunately, these metrics are not available. Need to draw your own conclusions based on some indirect indicators, such as looking into whether the total checkpointing times increase faster than the state size, or whether there are differences in the checkpointing acknowledgement times between the various instances of the stateful operators. The latter may be partly because your data is not evenly distributed. Maybe the most reliable indicator is an irregular watermark progression of the Flink Kafka Consumer instances. Flink UI, click an operator: Subtask ID, which is mapped to TaskSlot. An operator can have many instances (tasks/subtasks). Each subtask is scheduled to a TaskSlot. But it’s not always 1-to-1 mapping. E.g. a TaskSlot can run an Operator chain of 4 Ops, i.e. 4 subtasks are scheduled to 1 TaskSlot. For the relationship of records, KeyGroups, and TaskSlots, see slide 23 <Scheduling Skew>. { After using keyBy(), we cannot see the watermark skews in Flink UI, as it shows the smallest watermark of all input channels of each subtask. In this demo, each subtask processes its own Kafka partition. They don’t interact with each other (no network shuffle involved). } Backpressure: At a high level, backpressure happens if some operator(s) in the Job Graph cannot process records at the same rate as they are received. This fills up the input buffers of the subtask that is running this slow operator. Once the input buffers are full, backpressure propagates to the output buffers of the upstream subtasks. Once those are filled up, the upstream subtasks are also forced to slow down their records’ processing rate to match the processing rate of the operator causing this bottleneck down the stream. Backpressure further propagates up the stream until it reaches the source operators.
  30. We can use Watermark alignment in Flink 1.15. With Watermark alignment, Flink pauses consuming from sources/tasks which generated watermarks that are too far into the future. Meanwhile it continues reading records from other sources/tasks which can move the combined watermark forward and thus unblock the faster ones. Look at the diagram on the right. We can use The parameter maxAllowedWatermarkDrift to define the maximal watermark difference between the 2 channels for the operator. In this example, if we set maxAllowedWatermarkDrift to 1, then when the watermark from Channel 1 reaches 5, the op will consume no more events from Channel 2 after Channel 2’s watermark reaches 6. It will continue consuming from Channel 2 after Channel 1’s watermark increases. This way the op doesn’t need to buffer excessive data, and the Event Time Skew is mitigated. The cost is more RPC messages between TMs and the JM. <NOTE> Please note that It Only works with sources which are implemented with the new source interface (FLIP-27) It does not work if timestamps and watermarks have been assigned to source before applying watermark alignment. Let’s look at some results. <NEXT> [Ref] https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/api/common/eventtime/WatermarkStrategy.html#withWatermarkAlignment-java.lang.String-java.time.Duration-java.time.Duration- @Experimental default WatermarkStrategy<T> withWatermarkAlignment(String watermarkGroup, java.time.Duration maxAllowedWatermarkDrift, java.time.Duration updateInterval) Creates a new WatermarkStrategy that configures the maximum watermark drift from other sources/tasks/partitions in the same watermark group. The group may contain completely independent sources (e.g. File and Kafka). Once configured Flink will "pause" consuming from a source/task/partition that is ahead of the emitted watermark in the group by more than the maxAllowedWatermarkDrift. Parameters: watermarkGroup - A group of sources to align watermarks maxAllowedWatermarkDrift - Maximal drift, before we pause consuming from the source/task/partition updateInterval - How often tasks should notify coordinator about the current watermark and how often the coordinator should announce the maximal aligned watermark. https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment-_beta_ Note: You can enable watermark alignment only for FLIP-27 sources. It does not work for legacy or if applied after the source via DataStream#assignTimestampsAndWatermarks. When enabling the alignment, you need to tell Flink, which group should the source belong. You do that by providing a label (e.g. alignment-group-1) which bind together all sources that share it. Moreover, you have to tell the maximal drift from the current minimal watermarks across all sources belonging to that group. The third parameter describes how often the current maximal watermark should be updated. The downside of frequent updates is that there will be more RPC messages travelling between TMs and the JM. In order to achieve the alignment Flink will pause consuming from the source/task, which generated watermark that is too far into the future. In the meantime it will continue reading records from other sources/tasks which can move the combined watermark forward and that way unblock the faster one. Note: As of 1.15, Flink supports aligning across tasks of the same source and/or different sources. It does not support aligning splits/partitions/shards in the same task. In a case where there are e.g. two Kafka partitions that produce watermarks at different pace, that get assigned to the same task watermark might not behave as expected. Fortunately, worst case it should not perform worse than without alignment. Given the limitation above, we suggest applying watermark alignment in two situations: You have two different sources (e.g. Kafka and File) that produce watermarks at different speeds You run your source with parallelism equal to the number of splits/shards/partitions, which results in every subtask being assigned a single unit of work. “if timestamps and watermarks have been assigned to source”, what does it mean? Those are defined by the watermarkStrategy parameter of StreamExecutionEnvironment.fromSource(). [Skip] Does not work with: watermarkStrategy.withIdleness()
  31. This is the data collected for a Flink job having Event Time skew. The job consumes a Kafka topic of 4 partitions, and the watermarks (event times) of 1 partition is much smaller than the other partitions. And it processes data in a window operator with Event Time. The Left graph shows the checkpoint size with and without using watermark alignment. And The Right graph shows the checkpoint duration with and without using watermark alignment. After using watermark alignment, both checkpoint size and duration reduced significantly, as the Event Time skew is mitigated. [Skip] KQ5: how did you draw the data of 2 jobs in 1 diagram? Yes draw the data of 2 jobs, in Grafana.
  32. What if we have to use an earlier version of Flink? <Pause> We can use JobManagerWatermarkTracker. <Pause> For example, If you use Amazon Kinesis Data Streams, check out Event time alignment in Amazon Kinesis Data Streams Connector. Otherwise, the talk <Streaming, Fast and Slow>, from Flink Forward 2020 might help you. Both solutions use JobManagerWatermarkTracker. Essentially we use a global aggregate to synchronize per subtask watermarks. Each subtask uses a per shard queue to control the rate at which records are emitted downstreams (based on how far ahead of the global watermark the next record in the queue is. ) [Ref] Event time alignment The Flink Kinesis Consumer optionally supports synchronization between parallel consumer subtasks (and their threads) to avoid the event time skew related problems described in Event time synchronization across sources. To enable synchronization, set the watermark tracker on the consumer: JobManagerWatermarkTracker watermarkTracker = new JobManagerWatermarkTracker("myKinesisSource"); consumer.setWatermarkTracker(watermarkTracker); The JobManagerWatermarkTracker uses a global aggregate to synchronize per subtask watermarks. Each subtask uses a per shard queue to control the rate at which records are emitted downstream based on how far ahead of the global watermark the next record in the queue is. The “emit ahead” limit is configured via ConsumerConfigConstants.WATERMARK_LOOKAHEAD_MILLIS. Smaller values reduce the skew but also the throughput. Larger values will allow the subtask to proceed further before waiting for the global watermark to advance. Another variable in the throughput equation is how frequently the watermark is propagated by the tracker. The interval can be configured via ConsumerConfigConstants.WATERMARK_SYNC_MILLIS. Smaller values reduce emitter waits and come at the cost of increased communication with the job manager. Since records accumulate in the queues when skew occurs, increased memory consumption needs to be expected. How much depends on the average record size. With larger sizes, it may be necessary to adjust the emitter queue capacity via ConsumerConfigConstants.WATERMARK_SYNC_QUEUE_CAPACITY. [Skip] Or you can implement a rate limiter. How do you rate-limit? Per volume? How to rate-limit per event time?
  33. And that concludes our discussion of Event Time Skew.
  34. In this talk, we covered 5 kinds of skews, all of which may impact the performance and scalability of your systems significantly. For Data Skew, we may Filter unnecessary data, Re-partition data, Or do Local aggregration. For Hot Key, we may Use a different key Use Local-Global aggregation Use Two-phase keyBy Or Use Stream split For Hot keyGroup, we may Adjust maxParallelism. For Hot taskSlot, we may Adjust parallelism or Adjust maxParallelism. For State Skew, we may find the root cause and fix the underlying data skew and/or key skew. For Scheduling Skew, we may Set cluster.evenly-spread-out-slots to true For Event Time Skew, we may leverage watermark alignment or JobManager Watermark Tracker. <Pause> In principle, we wanna Even the workload among not only Task slots but also Task Managers; And Look out for Event Time Skews; But pay attention to the cost of your solutions, <Pause> like network shuffle; <Pause> All of Those were learned from years of experience with large scale distributed systems. And we humbly hope that they can help you build performant and scalable systems! And happy streaming! [Skip] Workitems TODOs! FF week support?
  35. And happy streaming!