SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

SOS: Optimizing Shuffle I/O
Brian Cho and Ergin Seyfe, Facebook
Haoyu Zhang, Princeton University

Large-scale shuffle
Stage 1 Stage 2
• Shuffle: all-to-all communication between stages
• >10x larger than available memory, strong fault tolerance requirements
→ on-disk shuffle files

Shuffle I/O grows quadratically with data
0 5000 10000
1umber RI TasNs
0
1000
2000
3000
4000
ShuIIOeTime(sec)
ShuIIOe Time
0
40
80
120
5eTuestCRunt/106
I/2 5eTuest
0 5000 10000
1umber of TasNs
0
500
1000
1500
Size(KB)
SKuffle )etFK Size
• M * R (number of mappers * number of reducers) shuffle fetches
• Large amount of fragmented I/O requests
• Adversarial workload for hard drives!

Strawman: tune number of tasks in a job
• Tasks spill intermediate data to disk if data splits exceed memory capacity
• Larger task execution reduces shuffle I/O, but increases spill I/O

Strawman: tune number of tasks in a job
• Need to retune when input data volume changes for each individual job
• Small tasks run into the quadratic I/O problem
• Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17]
• straggler problems, imbalanced workload, garbage collection overhead
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill

Small Tasks
Bulky Tasks
Large Amount of
Fragmented Shuffle I/O
Fewer, Sequential
Shuffle I/O

2. SOS: optimizing shuffle I/O

a.k.a. Riffle, presented at Eurosys 2018
Deployed at Facebook scale
SOS: optimizing shuffle I/O

• Merge map task outputs into larger
shuffle files
1. Combines small shuffle files into
larger ones
2. Keeps partitioned file layout
• Reducers fetch fewer, large blocks
instead of many, small blocks
• Number of requests:
(M * R) / (merge factor)
Optimized Shuffle Service
merge
request
map
map
map
reduce
reduce
reduce
reduce
reduce
reduce
reduce
map
map
map
merge
request
Application Driver
Merge Scheduler
Worker-Side Merger

• SOS shuffle service: a long running instance on each physical node
• SOS scheduler: keeps track of shuffle files and issues merge requests
Worker NodeWorker NodeTaskTaskTasks Worker Machine
Task Task Task Task
File System
ExecutorExecutor
SOS Shuffle Service
Driver
Job / Task
Scheduler
SOS Merge
Scheduler
assign
report task
statuses
report merge
statuses
send merge
requests

Results on synthetic workload (unoptimized)
1R 0erge 5 10 20 40
1-Way 0erge
0
100
200
300
400
500
Time(sec)
0aS SWage 5educe SWage
•SOS reduces number of fetch requests by 10x
•Reduce stage -393s, map stage +169s → job completes 35% faster
1R 0erge 5 10 20 40
1-Way 0erge
0
1500
3000
4500
6000
6ize(KB)
5ead BlRcN 6ize
0
2000
4000
6000
8000
5equesWCRunW
1umber Rf 5eads

Best-effort merge: mixing merged and unmerged files
1R 0erge 5 10 20 40
1-Way 0erge
0
1500
3000
4500
6000
6ize(KB)
5ead BlRcN 6ize
0
2000
4000
6000
8000
5equesWCRunW
1umber Rf 5eads
1R 0erge 5 10 20 40
1-Way 0erge
0
1500
3000
4500
6000
6ize(KB)
5ead BlRcN 6ize
0
2000
4000
6000
8000
5equesWCRunW
1umber Rf 5eads
1R 0erge 5 10 20 40
1-Way 0erge
0
100
200
300
400
500
Time(sec)
• Reduce stage -393s, map stage +52s → job completes 53% faster
• SOS finishes job with only ~50% of cluster resources!
1R 0erge 5 10 20 40
1-Way 0erge
0
100
200
300
400
500
Time(sec)
Best-effort merge (95%)

Additional details
• Merge operation fault-tolerance
• Handled by falling back to the unmerged files
• Efficient memory management
• Merger read/write large buffers for performance and IO efficiency
Block 65
Block 66
…
Block 67
…
Block 65
Block 66
…
Block 67
…
Block 65
Block 66
…
Block 67
…
…
Block 65-1
Block 65-2
Block 65-m
…
Block 66-1
Block 66-2
Block 66-m
…
BufferedRead
BufferedWrite
Merge

3. Deployment and
observed gains

Deployment
• Started staged rollout late last year
• Completed in April, running stably for over a month

SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency

SOS + zstd
SOS zstd Net
Spill I/O
Shuffle I/O

SOS + zstd
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O

SOS + zstd
SOS zstd Net
Shuffle I/O Gain Small Gain Gain

SOS + zstd
SOS zstd Net
SOS zstd Net
CPU time
Reserved CPU time

SOS + zstd
SOS zstd Net
SOS zstd Net
CPU time No change Small Regression Small Regression
Reserved CPU time

SOS + zstd
SOS zstd Net
SOS zstd Net
CPU time No change Small Regression Small Regression
Reserved CPU time Gain No change Gain

IO Gains: Request-level
Spark-level I/O requests: number of
application-level R/W requests made
• 7.5x less

IO Gains: Disk-level
Disk service time: time spent on
disks in the storage system
• 2x more efficient

IO Gains: Disk-level
Disk service time: time spent on
disks in the storage system
• 2x more efficient
Average IO Size: average size
of IO request at the disks
• 2.5x increase

Compute Gains
Reserved CPU time: resources
allocated for Spark executors
Total 10% Gain
• CPU time: time spent using CPU
à 5% Regression
• I/O time: time spent waiting (not
using CPU)
à 75% Gain
Currently working on increasing
these gains

1) Shuffle at large scale induces large fragmented shuffle I/Os
2) SOS provides a solution to optimize these I/Os
3) SOS deployed and running stably at Facebook scale
4) Observed gains of 2x more efficient I/O which translates to
10% more efficient compute
5) Plan to contribute back to Apache Spark
Summary

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

Similar to SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe