The document summarizes the SOS technique for optimizing shuffle I/O in distributed computing frameworks. SOS merges small intermediate data files from map tasks into larger files to reduce the number and fragmentation of shuffle fetch requests. When deployed at Facebook scale, SOS reduced shuffle I/O by 7.5x and disk service time by 2x, while increasing average I/O size by 2.5x. These I/O optimizations translated to an overall 10% reduction in reserved CPU time for jobs.
4. Large-scale shuffle
Stage 1 Stage 2
• Shuffle: all-to-all communication between stages
• >10x larger than available memory, strong fault tolerance requirements
→ on-disk shuffle files
5. Shuffle I/O grows quadratically with data
0 5000 10000
1umber RI TasNs
0
1000
2000
3000
4000
ShuIIOeTime(sec)
ShuIIOe Time
0
40
80
120
5eTuestCRunt/106
I/2 5eTuest
0 5000 10000
1umber of TasNs
0
500
1000
1500
Size(KB)
SKuffle )etFK Size
• M * R (number of mappers * number of reducers) shuffle fetches
• Large amount of fragmented I/O requests
• Adversarial workload for hard drives!
6. Strawman: tune number of tasks in a job
• Tasks spill intermediate data to disk if data splits exceed memory capacity
• Larger task execution reduces shuffle I/O, but increases spill I/O
7. Strawman: tune number of tasks in a job
• Need to retune when input data volume changes for each individual job
• Small tasks run into the quadratic I/O problem
• Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17]
• straggler problems, imbalanced workload, garbage collection overhead
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
300
400
500
600
700
800
900
1000
2000
4000
8000
10000
1umber of 0aS 7asNs
0
1000
2000
3000
7ime(sec)
6huffle 6Sill
18. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
19. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O
Shuffle I/O
20. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O
21. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
22. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS zstd Net
CPU time
Reserved CPU time
23. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS zstd Net
CPU time No change Small Regression Small Regression
Reserved CPU time
24. SOS + zstd
• Rollout includes zstd compression with SOS
• Combined they produce a net gain in IO and Compute efficiency
SOS zstd Net
Spill I/O Regression Gain Small Gain
Shuffle I/O Gain Small Gain Gain
SOS zstd Net
CPU time No change Small Regression Small Regression
Reserved CPU time Gain No change Gain
26. IO Gains: Disk-level
Disk service time: time spent on
disks in the storage system
• 2x more efficient
27. IO Gains: Disk-level
Disk service time: time spent on
disks in the storage system
• 2x more efficient
Average IO Size: average size
of IO request at the disks
• 2.5x increase
28. Compute Gains
Reserved CPU time: resources
allocated for Spark executors
Total 10% Gain
• CPU time: time spent using CPU
à 5% Regression
• I/O time: time spent waiting (not
using CPU)
à 75% Gain
Currently working on increasing
these gains
30. 1) Shuffle at large scale induces large fragmented shuffle I/Os
2) SOS provides a solution to optimize these I/Os
3) SOS deployed and running stably at Facebook scale
4) Observed gains of 2x more efficient I/O which translates to
10% more efficient compute
5) Plan to contribute back to Apache Spark
Summary