Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we’ll take a deep dive into Spark’s execution engine and share how we’re gradually solving the data skew problem at scale. To this end, we’ll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook’s petabyte-scale shuffle service (https://maxmind-databricks.pantheonsite.io/session/cosco-an-efficient-facebook-scale-shuffle-service).
2. Agenda
Skew Join Journey
▪ Skew Hint
▪ Runtime Skew Mitigation
▪ Customized AQE Skew Mitigation
Cosco + Skew Join
▪ Shuffle Recap
▪ Working with Original Cosco
▪ Splitting in File Boundaries
3. What Is Data Skew
▪ Uneven distribution of partitioned data
▪ Affects aggregate operations e.g Group By and Join
▪ E.g. count number of people group by country
▪ Data skew at FB
▪ PB sized table Joins with TB sized skewed partitions
▪ Task latency
▪ Non-skewed partition: mins to few hours
▪ Skewed partition: hours to days
▪ Skewed pipelines
▪ Latency sensitive daily jobs
▪ Complex DAG with several upstream/downstream dependencies
▪ Can cause delay in hundreds of downsteam pipelines
0.01GB 1TB
Percentile
shuffle
partition 1
partition 2
partition 3 (skewed)
partition 4
4. Data Skew In Join
▪ Join strategies in Spark
▪ SortMergeJoin/ ShuffleHashJoin
▪ Not skew resistant – requires input data to be partitioned
▪ BroadcastHashJoin
▪ Skew resistant – requires no data partitioning
▪ Requires one of side of join to be small
▪ Skew Join
▪ Hybrid join that combines above strategies
5. Mitigating Data Skew In Joins
Skew Hint
Runtime skew
mitigation
Customized AQE
skew mitigation
1 2 3
Optimizer rule-based solution Built on Spark 2.0
adaptive framework
Based on Spark 3.0
AQE framework
6. Mitigating Data Skew In Joins
Skew Hint
Runtime skew
mitigation
Customized AQE
skew mitigation
1 2 3
Optimizer rule-based solution Built on Spark 2.0
adaptive framework
Based on Spark 3.0
AQE framework
7. Skew Hint
▪ User provides skew information as
hints
▪ /*+ SKEWED_ON(column=value1,..) */
▪ Split input into two parts
▪ Skewed keys
▪ Non-skewed keys
▪ Broadcast join skewed keys
▪ Shuffle/Sort join non-skewed keys
▪ Union both join outputs
10. Skew Hint
Skew Hint
Pros
• Reduces runtime latency
Cons
• Requires prior knowledge of skewed
keys
• Double scan of data
11. Mitigating Data Skew In Joins
Skew Hint
Runtime skew
mitigation
Customized AQE
skew mitigation
1 2 3
Optimizer rule-based solution Built on Spark 2.0
adaptive framework
Based on Spark 3.0
AQE framework
12. Runtime skew mitigation
§ Spark 2.0 adaptive framework
§ Adds ExchangeCoordinator to all shuffle Exchange nodes
§ ExchangeCoordinator merges small partitions based on runtime stats
§ Runtime skewed partition detection
§ Collect MapOutputStatistics of a joins input stages
§ Partition is skewed iff size of partition is:
§ > min_threshold_config_value (e.g. 1GB)
§ > median_ratio X median_size(shuffle) (e.g. 10 times median size)
§ > pct_99_ratio X pct99_size(shuffle) (e.g. 3 times 99 percentile size)
Built on Spark 2.0 adaptive framework
13. Runtime skew mitigation
§ Split large partitions into smaller sub-partitions
shuffle table_A shuffle table_B
skewed partition split
Built on Spark 2.0 adaptive framework
table_A join table_B
14. Built on Spark 2.0 adaptive framework
Skew Hint
Runtime Skew Mitigation
Spark 2.0 adaptive framework
Pros
• Reduces runtime latency
Cons
• Requires prior knowledge of skewed
keys
• Double scan of data
Pros
• Reduces runtime latency
• No prior skew key knowledge
required
• No double scan of data
Cons
Additional shuffles even when a join is
not skewed
• Between joins
• Between joins and aggregates
Runtime skew mitigation
15. Mitigating Data Skew In Joins
Skew Hint
Runtime skew
mitigation
Customized AQE
skew mitigation
1 2 3
Optimizer rule-based solution Built on Spark 2.0
adaptive framework
Based on Spark 3.0
AQE framework
16. AQE skew mitigation
▪ AQE recap
▪ New adaptive framework in OSS
▪ Features
▪ Switching join strategy
▪ Small shuffle partitions coalescing
▪ Optimizing skew join
▪ Runtime DAG change
Spark 3.0 adaptive framework
17. AQE skew mitigation
Limitation
▪ Supports two table join only
▪ Single stage skewed join + reducer operation
▪ Skewed Join + Aggregate
▪ Skewed multi table joins
▪ Single stage skewed join + mapper operation
▪ Skewed join + union
▪ Skewed join + broadcast join
18. Customized AQE skew mitigation
Customization 1: Runtime Shuffle Insertion
Lifecycle of AQE execution
EnsureRequirements
Execute
Stats from previous stage
No new Exchange beyond
this point
Optimize skew
Create sub-stages
Re-plan
19. Customized AQE skew mitigation
Customization 1: Runtime Shuffle Insertion
Lifecycle of AQE execution
EnsureRequirements
Execute
Stats from previous stage
No new Exchange beyond
this point
Optimize skew
Create sub-stages
Re-plan
If skewed, set
outputPartitioning of join to
Unknown
Detect skew
20. table_A (petabye) table_B
Left Outer Join
table_C
Left Outer Join
Customized AQE skew mitigation
[Limitation] Customization 1: Runtime Shuffle Insertion
table_D
Left Outer Join
21. table_A (petabye) table_B
Left Outer Join
table_C
Left Outer Join
Customized AQE skew mitigation
[Limitation] Customization 1: Runtime Shuffle Insertion
table_D
Left Outer Join
Exchange
Exchange
Shuffles PB sized
data
Shuffles PB sized
data
22. Customized AQE skew mitigation
Customization 2: No Shuffle Multi Table Join
shuffle table_A
shuffle table_B
skewed partition split
shuffle table_C
▪ Split skewed partition in one table
▪ Replicate the corresponding partitions in the rest of the input tables
shuffle table_Dshuffle table_A
23. Skew Hint
Runtime Skew Mitigation
Spark 2.0 adaptive framework
Customized AQE Skew Mitigation
Spark 3.0 adaptive framework
Pros
• Reduces runtime latency
Cons
• Requires prior knowledge of skewed
keys
• Double scan of data
Pros
• Reduces runtime latency
• No prior skew key knowledge
required
• No double scan of data
Cons
Additional shuffles even when a join is
not skewed
• Between joins
• Between joins and aggregates
Pros
• Reduces runtime latency
• No prior skew key
knowledge required
• No double scan of data
• No unnecessary shuffles when a join
is not skewed
Customized AQE skew mitigation
Customization 2: No Shuffle Multi Table Join
25. ▪ Adaptive Skew Mitigation Splits in Mapper Boundaries
▪ Each sub-reducer is only required to read data from a subset of mappers.
▪ Only partial data is required
▪ E.g.: a skewed partition 0 is split that: sub-reducer 0 reads {mapper 1}, sub-reducer 1 reads {mapper 0, mapper 2}, sub-reducer 1
reads {mapper 0, mapper 2}
Partition 0
The Problem
File 0 File 1 File 3
Sub-reducer 0 Sub-reducer 1 Sub-reducer 2 Sub-reducer i…
Sub-reducer i
Mapper outputs
are merged in
files :(
26. Cosco Shuffle Recap
▪ Why Cosco?
▪ Shuffle as a service: IO Efficiency
▪ Write-ahead buffer
▪ Groups shuffled output based on partition id instead of mapper id.
▪ Data deduplication is performed in reducer side
▪ Efficiency Wins:
▪ Solve the write amplification problem: avoid spills: 3X -> 1X
▪ Solve the small IO problem: average IO size: 200KB -> 2.5MB
▪ Previous Talk
▪ Cosco: An Efficient Facebook-Scale Shuffle Service
28. Working with Mapper Boundaries
▪ Load all data but filter out uninterested ones
▪ Each sub-reducer still has to load all data from all file chunks of the partition.
▪ Filtering out records from uninterested mappers.
▪ IO inefficient
Record (mapper-1)
Record (mapper-3)
Record (mapper-2)
Record (mapper-2)
Record (mapper-4)
Record (mapper-1)
…
File 0
Read
Drop
Drop
Drop
Read
Read
Sub-reducer for
mapper {1, 4}
29. Skewed partition worst cases
▪ For the most heavily skewed partitions, it
could be up to XXX TB and requires XX k
splits, which means the skewed partition
data has to be read XX k times as well.
▪ Very inefficient, most times are spent on
disk IO.
▪ Compared to vanilla Spark shuffle, it no
longer has any advantages in terms of IO.
31. Terminology
▪ Record
▪ Unit of a data row
▪ Package
▪ Unit to be sent from mapper to shuffle service
▪ Chunk
▪ Unit to be flushed
Record
Record
Record
Record
Package
Package
Package
Package
Chunk
Multiple records to form a package
Multiple packages to form a chunk
32. ▪ Duplicated records
▪ A record could be duplicated in multiple files due to package resending which might be caused by various reasons.
▪ Cosco relies on reducer client to perform deduplication which requires it to read the entire partition data from a mapper.
▪ Simply splitting on file boundaries requires no overlap of mappers between file sets for different sub-reducers, which is not feasible.
File 2 File 3
Restrictions
Sub-reducer 0
File 0 File 1
Sub-reducer 0 is assigned to
read file 0 & 1. But it has to
read file 2 & 3 as well
because they have mapper
overlap with file 0 & 1.
33. Duplication
▪ Mapper failure
▪ Each mapper is assigned a unique id.
▪ Each package is attached with its mapper’s unique id.
▪ Reducer consumes records from interested mappers based on those unique ids.
▪ It would not cause duplicated records in reducer if a mapper output is not fully read by a reducer.
Mapper 0
(id: abc)
Mapper 0
(id: xyz)
Partition Package Id MapperId
...
0 3 abc
0 3 xyz
...
Failed
Sub-reducer 0
(read from `xyz`)
Dropped
Accepted
34. Duplication
▪ Network issues
▪ A mapper fails to receive ack message after timeout, it will resend the package.
▪ We call such packages "suspected packages”: a suspected package is not necessary to be a duplicated packaged, but a duplicated
package is always a suspected package.
Shuffle service - 1
Shuffle service - 2Package-1
Package-2
Package-3
Mapper - 1
Ack Queue
Chunk - i
Records from package-
{1,2,3} are flushed to chunks
Chunk - j
Records from package-
{1,2,3} are duplicated to
chunks
Resend packages
in ack queues
36. Suspected Map
▪ Map<partition, Map<packageId, chunkId>>
▪ Each mapper keeps a map to track the resent packages due to connection.
▪ Whenever a resend happens, all the packages in the ACK queue are added to the map.
Shuffle service - 1
Shuffle service - 2
Package-1
Pakcage-2
Package-3
Mapper - 1
ACK Queue
Resend packages
in ack queues
Add to suspected
package map
packageId chunkId
Package-x Chunk-xyz
Package-1 null
Package-2 null
Package-3 null
Partition - 1
37. Suspected Map
▪ Map<partition, Map<packageId, chunkId>>
▪ An ACK message eventually returned contains a unique identifier of the chunk containing the package.
▪ Mapper keeps the mapping from those suspected packages to their authorized chunks.
▪ Once the mapper is done, reports the suspected map to Spark driver.
▪ Spark driver aggregate the suspected package info from different mappers.
Shuffle service - 2
Pakcage-2
Package-3
Mapper - 1
ACK Queue
Return ACK for Package-1
with chunkId
1. Remove package-1
from ACK queue.
2. Update the chunkId in
suspected map
packageId chunkId
Package-x Chunk-xyz
Package-1 Chunk-abc
Package-2 null
Package-3 null
Partition - 1
38. ▪ Each sub-reducer would be assigned a subset of files along with the
corresponding suspected map of the partition.
Reading subset of files on sub-reducer
Chunk Set - 1 Chunk Set - 2
...
Skewed Partition
Sub-reducer - 1 Sub-reducer - 2 ...
39. ▪ Duplicated records would only be accepted if it is read from
authorized chunk.
Reading subset of files on sub-reducer
Record
{mapper-3,package-4,chunk-15}
Chunk-15
Record
{mapper-3,package-4,chunk-15}
Chunk-4
Sub-reducer-2 Sub-reducer-7
Record Accepted Record Dropped
{MapperId, PackageId} ChunkId
... ...
{mapper-3, package-4} chunk-15
... ...
40. Chunk Lost
▪ Chunk lost failure requires restart of all sub-reducers accordingly: .
▪ Spark driver to restart mappers to regenerate data of that particular partition.
▪ Data in new chunks are non-deterministic.
▪ Spark driver broadcasts failures to all corresponding sub-reducer tasks and restarts them all.
▪ This is rare since chunks are RS-encoded
41. Putting It All Together
▪ Motivation:
▪ To split a skewed partition into multiple non-overlapping file sets that each sub-reducer would be assigned one of them.
▪ Mapper:
▪ Keeps tracking of suspected packages along with authorized chunks.
▪ Report suspected package info to Spark driver when finishes.
▪ Driver:
▪ Aggregate suspect package info from all mappers and pass along to sub-reducers if necessary.
▪ Detects skewed partitions and split them based on file boundaries of each partition.
▪ Fail and restart all corresponding sub-reducer tasks of a partition if there is a chunk lost.
▪ Reducer:
▪ Fetches file list from driver along with suspected package info.
▪ Accept a record of a suspected package only if it is read from the authorized chunk.
▪ Data Correctness
▪ End-to-end checksum
42. Summary
Skew Join Journey
▪ Skew Hint
▪ Runtime Skew Mitigation
▪ Customized AQE Skew Mitigation
Cosco + Skew Join
▪ Shuffle Recap
▪ Working with Original Cosco
▪ Splitting in File Boundaries