Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
1. Chenzhao Guo(Intel) Carson Wang(Intel)
Improving Apache Spark by Taking
Advantage of Disaggregated Architecture
#UnifiedDataAnalytics #SparkAISummit
2. 2#UnifiedDataAnalytics #SparkAISummit
About me
• Chenzhao Guo
• Big Data Software Engineer at Intel
• Contributor of Spark*, committer of OAP and HiBench
• Our group’s work: enabling and optimizing Spark* on an
exascale(10^18) High Performance Computer
*Other names and brands may be claimed as the property of others.
3. 3#UnifiedDataAnalytics #SparkAISummit
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
4. 4#UnifiedDataAnalytics #SparkAISummit
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
5. 5#UnifiedDataAnalytics #SparkAISummit
Pain point 1: performance
Shuffle drags down the end-to-end performance
• slow due to random disk I/Os
• may fail due to no space left or disk failure
Mechanical structure performs
slow seek operation
What if I upgrade the HDDs to larger capacity or SSDs?
6. 6#UnifiedDataAnalytics #SparkAISummit
Pain point 2: Easy management &
cost efficiency
• Still, local disks may be run out or fail
• Scaling out/up the storage resources affects
the compute resource
• Scaling the storage to accommodate shuffle’s
need may make storage resources under-
utilized for other(like CPU-intensive)
workloads: cost inefficient
What if I upgrade the HDDs to larger capacity or SSDs?
Collocated compute & storage architecture
7. 7
Limitations on default Spark shuffle
essence diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute &
storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
8. 8
What brings us here?
• Challenges to resolve enabling & optimizing Spark* on HPC:
– storage on Distributed Asynchronous Object Storage(DAOS)
– network on high-performance fabrics
– scalability issues when there are ~10000 nodes
– intermediate(shuffle, etc.) storage on disaggregated compute
& storage architecture
• Limitations on default Spark* shuffle
*Other names and brands may be claimed as the property of others.
9. 9
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
10. 10
Disaggregated architecture
• Compute and storage resources are separated, interconnected by
high-speed(usually) network
• Common in HPC world and large companies who own thousands of
servers, for higher efficiency and easier management
Compute cabinet with CPU,
Memory and accelerators
Storage cabinet with
NVMe SSD, running
DAOS
Disaggregated
architecture in HPC
Fabrics network
11. 11
Collocated vs disaggregated compute &
storage architecture
Essence Diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute
& storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
Disaggregate
d compute &
storage
architecture
Decouple
d
compute
& storage
Efficient for
diverse
workloads with
different
storage/comput
e needs
Independent
scaling per
compute &
storage needs
separately
Different team
only needs to
care their own
cluster
Overhead
Extra
network
transfer
12. 12
Network transfer is not a big deal as time
evolves
• Software:
– Efficient compression algorithms
– Columnar file format
– Other optimizations toward less I/O
• Hardware:
– Network bandwidth is evolving fast
I/O bottleneck
CPU bottleneck
13. 13
Targets Hadoop-compatible storage
remote shuffle
SortShuffleManager
ReaderWriter
Local disk
Java File API write
BlockManager
Netty fetch
Java File API read
Local diskLocal disk
default Spark local shuffle
storage cluster
Globally accessible Hadoop-compatible Filesystem
?
• Hadoop* Filesystem interface is general
• Support globally accessible storage
• Let the distributed storage takes care of replication,
enough space, etc.
compute cluster
*Other names and brands may be claimed as the property of others.
14. 14
A remote shuffle manager plugin: overview
RemoteShuffleManager
ReaderWriter
Globally accessible Hadoop-compatible Filesystem
Hadoop OutputStream Hadoop InputStream
HDFS Lustre DAOS FS…
• A pluggable custom shuffle manager
storing files in globally accessible
Hadoop-compatible filesystem
• The shuffle readers/writers talk with
storage system in Hadoop* API
• Algorithms of partitioning, sorting,
unsafe-optimizations, etc. are based on
the ones in default SortShuffleManager
• Shuffle data/index files, spill and other
temporary files are all kept in remote
storage
*Other names and brands may be claimed as the property of others.
15. executor executor
15
More customization details
• Shuffle writers/readers talk with storage in Hadoop* API
• Shuffle readers derive the data/index file path from applicationId, mapId
and reduceId, then directly load them from remote storage without any
metadata keeper’s help(MapOutputTracker)
• Avoid resubmitting certain map tasks when executor fails, due to the
shuffle blocks are still there and can be retrieved
fos = new FileOutputStream(file, true) fsdos = fs.create(file)Local disk Local diskLocal disk Globally-accessible Hadoop-compatible Filesystem
I need a
shuffle
block
Read by derived path
*Other names and brands may be claimed as the property of others.
16. 16
A remote shuffle I/O plugin
• No need to maintain full shuffle algorithms including sort, partition,
aggregation logics, but only cares about shuffle storage
• Based on the new shuffle abstraction layer introduced in SPARK-25299
Non-shuffle facilities
Sort shuffle manager
Pluggable
shuffle manager
Non-shuffle facilities
Sort shuffle manager
(non-storage part)
Pluggable
shuffle I/O
Customized storage
17. 17
What have we offered until now?
A ShuffleManager/(Shuffle I/O) that
• shuffles to globally accessible Hadoop-compatible storage
• is pluggable and easy to deploy
• is free from executor failures naturally due to shuffle metadata is
not locally managed, instead globally derived
• is free from disk failure due to the shuffle data in storage can be
replicated: reliable
• is easy to scale and cost-efficient due to taking advantages of
disaggregated compute and storage architecture
18. 18
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
19. 19
New challenges
• Repeatedly reading small index files from disks is inefficient
• Seek operations calling on data files in HDFS* are inefficient, reading
more data than expected through network
*Other names and brands may be claimed as the property of others.
20. 20
Optimization: index cache
Map stage:
Executor process 0
Maptask
Maptask
Maptask
HDFS Index File
Write index & data files to storage
Maptask
Maptask
Index files are small, however, some Hadoop* storage
system like HDFS* are not designed for small files I/O
*Other names and brands may be claimed as the property of others.
21. 21
Optimization: index cache
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Read index to cache
Indexcache
Directly read data files
• Saves small disk I/Os in storage cluster
• Saves small network I/Os between compute
and storage cluster
• Utilize the network inside compute cluster
22. 22
Fall back to read from remote storage
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Indexcache
Directly read data filesDirectly read index filesRead index to cache
• When index cache is not available in certain
executors(failure or network issues), fall back
to directly read index files from remote
storage, just as index cache feature is never
enabled
23. 23
Solutions for inefficient HDFS* seek
• a Hash-based shuffle manager without any seek operations
• Add new logics for blocks merging to eliminate seek operations
*Other names and brands may be claimed as the property of others.
24. 24
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
25. 25
Shuffle over Lustre*
Dataset 4G dataset & 100 G dataset (~50GB shuffle)
workload SQL internal join
software Spark* 2.4.3
Hadoop* 2.7.3
baseline: 64GB ram disk | remote shuffle: Lustre*(remote)
hardware 1 master + 4 workers
40 cores Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
128GB RAM
10Gb Ethernet
*Other names and brands may be claimed as the property of others.
27. 27
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
28. 28
Where next
• Future work: a purely-hash-based shuffle path that resembles hash-
based shuffle in Spark* history
• Disaggregated architecture can benefit not only for Spark* shuffle
• Towards a fully disaggregated world – CPU, memory and disks are
all disaggregated, connected through network
*Other names and brands may be claimed as the property of others.
32. 32
Legal Disclaimer
Software and workloads used in performance tests may have been optimized
for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance.