The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
2. Agenda
§ Spark @ LinkedIn
§ Introducing Magnet
§ Production Results
§ Future Work
3. Spark @ LinkedIn
A massive-scale infrastructure
11K nodes
40K+ daily Spark apps
~70% of cluster
compute resources
~18 PB daily
shuffle data
3X+ growth YoY
4. Shuffle Basics – Spark Shuffle
§ Transfer intermediate data across
stages with a mesh connection
between mappers and reducers.
§ Shuffle operation is a scaling and
performance bottleneck.
§ Issues are especially visible at large
scale.
5. Issues with Spark Shuffle at LinkedIn
Shuffle service
unavailable under heavy
load.
• Efficiency issue
• Reliability issue
Job runtime can be very
inconsistent during peak
cluster hours due to
shuffle.
• Scalability issue
Small shuffle blocks
hurt disk throughput
which prolongs shuffle
wait time.
6. Existing Optimizations Not Sufficient
• Broadcast join can reduce shuffle during join.
• Requires one of the tables to fit into memory.
• Caching RDD/DF after shuffling them can potentially reduce shuffle.
• Has performance implications and limited applicability.
• Bucketing is like caching by materializing DFs while preserving partitioning.
• Much more performant but still subject to limited applicability and requires manual setup effort.
• Adaptive Query Execution in Spark 3.0 with its auto-coalescing feature can optimize
shuffle.
• Slightly increased shuffle block size. However, not sufficient to address the 3 issues.
7. Magnet Shuffle Service
Magnet shuffle service adopts a push-based shuffle mechanism.
M. Shen, Y. Zhou, C. Singh. “Magnet: Push-based Shuffle Service for Large-scale Data
Processing” Proceedings of the VLDB Endowment, 13(12) (2020)
Convert small random reads in shuffle to large sequential reads.
Shuffle intermediate data becomes 2-replicated, improving reliability.
Locality-aware scheduling of reducers, further improving performance.
9. Push-based Shuffle
§ Reducers fetch a hybrid of blocks.
▪ Fetch merged blocks to improve I/O
efficiency.
▪ Fetch original blocks if not merged.
§ Improved shuffle data locality.
§ Merged shuffle files create a second
replica of shuffle data.
10. Magnet Results – Scalability and Reliability
Currently rolled out to 100% of offline Spark workloads at LinkedIn.
That’s about 15-18 PB of shuffle data per day.
Leveraged a ramp-based rollout to reduce risks and did not
encounter any major issues during the rollout of Magnet.
11. Magnet Results – Improve Cluster Shuffle efficiency
• 30X increase in shuffle data fetched locally compared with 6 months ago.
Shuffle
Locality
Ratio
13. Magnet Results – Improve Job Performance
Magnet brings performance improvements to Spark jobs at scale without any user intervention.
We analyzed 9628 Spark workflows that were onboarded to Magnet.
Overall compute resource consumption reduction is 16.3%.
Among flows previously heavily impacted by shuffle fetch delays (shuffle fetch delay time > 30% of total task
time), overall compute resource consumption reduction is 44.8%. These flows are about 19.5% of the the flows
we analyzed.
14. Magnet Results – Improve Job Performance
• 50% of heavily impacted
workflows have seen at
least 32% reduction in
their job runtime.
• The percentile time-
series graph represents
thousands of spark
workflows.
App
Duration
Reduction
%
25th percentile 50th
percentile 75th percentile
15. Key Finding
• Benefits from Magnet increases as adoption of Magnet increases.
App
Duration
Reduction
%
16. Magnet Results on NVMe
• Magnet still achieves significant benefits with NVMe disks for storing
shuffle data.
• Results of a benchmark job with and without Magnet on a 10-node cluster
with HDD and NVMe disks, respectively:
Runtime with HDD (min) /
Comparison with baseline
Runtime with NVMe (min) /
Comparison with baseline
Magnet disabled 16 (baseline) 10 (-37.5%)
Magnet enabled 7.3 (-54.4%) 4.2 (-73.7%)
17. Future Work
Contribute Magnet back to OSS
Spark: SPARK-30602
Cloud-native architecture for
Magnet
Bring superior shuffle
architecture to broader set of
use cases
18. Optimize Shuffle on Disaggregated Storage
Existing shuffle data storage
solutions
Store shuffle data on limited local storage
devices attached with VMs.
Store shuffle data on remote disaggregate
storage.
Drawbacks with these approaches
It hampers the VM elasticity and runs the risk
of exhausting local storage capacity.
It has non-negligible performance overhead.
19. Optimize Shuffle on Disaggregated Storage
§ Leverage both local and remote storage,
such that the local storage acts as a caching
layer for shuffle storage.
§ The local elastic storage can tolerate
running out of storage space or compute
VMs getting decommissioned.
§ Preserve Magnet’s benefit of much
improved shuffle data locality while
decoupling shuffle storage from compute
VMs
20. Optimize Python-centric Data Processing
The surge of the usage of AI in recent years has driven the adoption
of Python for building AI data pipelines.
Magnet’s optimization of shuffle is generic enough to benefit both
SQL-centric analytics and Python-centric AI use cases.
Support Magnet in Kubernetes.
21. Resources
SPARK-30602
Magnet: A scalable and performant shuffle architecture for Apache Spark
Magnet: Push-based Shuffle Service for Large-scale Data Processing
Bringing Next-Gen Shuffle Architecture To Data Infrastructure at LinkedIn Scale
SPARK
SPIP ticket
Blog post
VLDB 2020
Blog post