SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Chenzhao Guo(Intel) Carson Wang(Intel)
Improving Apache Spark by Taking
Advantage of Disaggregated Architecture
#UnifiedDataAnalytics #SparkAISummit
2#UnifiedDataAnalytics #SparkAISummit
About me
• Chenzhao Guo
• Big Data Software Engineer at Intel
• Contributor of Spark*, committer of OAP and HiBench
• Our group’s work: enabling and optimizing Spark* on an
exascale(10^18) High Performance Computer
*Other names and brands may be claimed as the property of others.
3#UnifiedDataAnalytics #SparkAISummit
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
4#UnifiedDataAnalytics #SparkAISummit
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
5#UnifiedDataAnalytics #SparkAISummit
Pain point 1: performance
Shuffle drags down the end-to-end performance
• slow due to random disk I/Os
• may fail due to no space left or disk failure
Mechanical structure performs
slow seek operation
What if I upgrade the HDDs to larger capacity or SSDs?
6#UnifiedDataAnalytics #SparkAISummit
Pain point 2: Easy management &
cost efficiency
• Still, local disks may be run out or fail
• Scaling out/up the storage resources affects
the compute resource
• Scaling the storage to accommodate shuffle’s
need may make storage resources under-
utilized for other(like CPU-intensive)
workloads: cost inefficient
What if I upgrade the HDDs to larger capacity or SSDs?
Collocated compute & storage architecture
7
Limitations on default Spark shuffle
essence diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute &
storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
8
What brings us here?
• Challenges to resolve enabling & optimizing Spark* on HPC:
– storage on Distributed Asynchronous Object Storage(DAOS)
– network on high-performance fabrics
– scalability issues when there are ~10000 nodes
– intermediate(shuffle, etc.) storage on disaggregated compute
& storage architecture
• Limitations on default Spark* shuffle
*Other names and brands may be claimed as the property of others.
9
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
10
Disaggregated architecture
• Compute and storage resources are separated, interconnected by
high-speed(usually) network
• Common in HPC world and large companies who own thousands of
servers, for higher efficiency and easier management
Compute cabinet with CPU,
Memory and accelerators
Storage cabinet with
NVMe SSD, running
DAOS
Disaggregated
architecture in HPC
Fabrics network
11
Collocated vs disaggregated compute &
storage architecture
Essence Diagram
Cluster
utilization
Scale out/up
Software
management
Traditional
collocated
architecture
Tightly
coupled
compute
& storage
Under-utilized
one of the 2
resources
Hard to scale
out/upgrade
without affecting
the other resource
Debugging an
issue may be
complicated
?
Disaggregate
d compute &
storage
architecture
Decouple
d
compute
& storage
Efficient for
diverse
workloads with
different
storage/comput
e needs
Independent
scaling per
compute &
storage needs
separately
Different team
only needs to
care their own
cluster
Overhead
Extra
network
transfer
12
Network transfer is not a big deal as time
evolves
• Software:
– Efficient compression algorithms
– Columnar file format
– Other optimizations toward less I/O
• Hardware:
– Network bandwidth is evolving fast
I/O bottleneck
CPU bottleneck
13
Targets Hadoop-compatible storage
remote shuffle
SortShuffleManager
ReaderWriter
Local disk
Java File API write
BlockManager
Netty fetch
Java File API read
Local diskLocal disk
default Spark local shuffle
storage cluster
Globally accessible Hadoop-compatible Filesystem
?
• Hadoop* Filesystem interface is general
• Support globally accessible storage
• Let the distributed storage takes care of replication,
enough space, etc.
compute cluster
*Other names and brands may be claimed as the property of others.
14
A remote shuffle manager plugin: overview
RemoteShuffleManager
ReaderWriter
Globally accessible Hadoop-compatible Filesystem
Hadoop OutputStream Hadoop InputStream
HDFS Lustre DAOS FS…
• A pluggable custom shuffle manager
storing files in globally accessible
Hadoop-compatible filesystem
• The shuffle readers/writers talk with
storage system in Hadoop* API
• Algorithms of partitioning, sorting,
unsafe-optimizations, etc. are based on
the ones in default SortShuffleManager
• Shuffle data/index files, spill and other
temporary files are all kept in remote
storage
*Other names and brands may be claimed as the property of others.
executor executor
15
More customization details
• Shuffle writers/readers talk with storage in Hadoop* API
• Shuffle readers derive the data/index file path from applicationId, mapId
and reduceId, then directly load them from remote storage without any
metadata keeper’s help(MapOutputTracker)
• Avoid resubmitting certain map tasks when executor fails, due to the
shuffle blocks are still there and can be retrieved
fos = new FileOutputStream(file, true) fsdos = fs.create(file)Local disk Local diskLocal disk Globally-accessible Hadoop-compatible Filesystem
I need a
shuffle
block
Read by derived path
*Other names and brands may be claimed as the property of others.
16
A remote shuffle I/O plugin
• No need to maintain full shuffle algorithms including sort, partition,
aggregation logics, but only cares about shuffle storage
• Based on the new shuffle abstraction layer introduced in SPARK-25299
Non-shuffle facilities
Sort shuffle manager
Pluggable
shuffle manager
Non-shuffle facilities
Sort shuffle manager
(non-storage part)
Pluggable
shuffle I/O
Customized storage
17
What have we offered until now?
A ShuffleManager/(Shuffle I/O) that
• shuffles to globally accessible Hadoop-compatible storage
• is pluggable and easy to deploy
• is free from executor failures naturally due to shuffle metadata is
not locally managed, instead globally derived
• is free from disk failure due to the shuffle data in storage can be
replicated: reliable
• is easy to scale and cost-efficient due to taking advantages of
disaggregated compute and storage architecture
18
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
19
New challenges
• Repeatedly reading small index files from disks is inefficient
• Seek operations calling on data files in HDFS* are inefficient, reading
more data than expected through network
*Other names and brands may be claimed as the property of others.
20
Optimization: index cache
Map stage:
Executor process 0
Maptask
Maptask
Maptask
HDFS Index File
Write index & data files to storage
Maptask
Maptask
Index files are small, however, some Hadoop* storage
system like HDFS* are not designed for small files I/O
*Other names and brands may be claimed as the property of others.
21
Optimization: index cache
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Read index to cache
Indexcache
Directly read data files
• Saves small disk I/Os in storage cluster
• Saves small network I/Os between compute
and storage cluster
• Utilize the network inside compute cluster
22
Fall back to read from remote storage
Reduce stage:
Executor process 0
Reducetask
Reducetask
HDFS Index File
Executor process 1
Reducetask
Fetch from
Index cache
Indexcache
Directly read data filesDirectly read index filesRead index to cache
• When index cache is not available in certain
executors(failure or network issues), fall back
to directly read index files from remote
storage, just as index cache feature is never
enabled
23
Solutions for inefficient HDFS* seek
• a Hash-based shuffle manager without any seek operations
• Add new logics for blocks merging to eliminate seek operations
*Other names and brands may be claimed as the property of others.
24
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
*Other names and brands may be claimed as the property of others.
25
Shuffle over Lustre*
Dataset 4G dataset & 100 G dataset (~50GB shuffle)
workload SQL internal join
software Spark* 2.4.3
Hadoop* 2.7.3
baseline: 64GB ram disk | remote shuffle: Lustre*(remote)
hardware 1 master + 4 workers
40 cores Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
128GB RAM
10Gb Ethernet
*Other names and brands may be claimed as the property of others.
26
Comparison: end-to-end time
54
87
55
0
10
20
30
40
50
60
70
80
90
100
local shuffle to RAM disk remote shuffle remote shuffle + index cache
27
Agenda
• Limitations on default Spark* shuffle
• Enabling Spark* shuffle on disaggregated compute & storage
• New challenges & optimizations
• Performance benchmarks
• Where next
28
Where next
• Future work: a purely-hash-based shuffle path that resembles hash-
based shuffle in Spark* history
• Disaggregated architecture can benefit not only for Spark* shuffle
• Towards a fully disaggregated world – CPU, memory and disks are
all disaggregated, connected through network
*Other names and brands may be claimed as the property of others.
Q&A
29
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
31
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request. No product or component can be absolutely secure.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by
visiting www.intel.com/design/literature.htm.
Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other
countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.
32
Legal Disclaimer
Software and workloads used in performance tests may have been optimized
for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance.

Más contenido relacionado

La actualidad más candente

How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)SolarWinds
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 

La actualidad más candente (20)

How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 

Similar a Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...inside-BigData.com
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Breaking IO Performance Barriers: Scalable Parallel File System for AWS
Breaking IO Performance Barriers: Scalable Parallel File System for AWSBreaking IO Performance Barriers: Scalable Parallel File System for AWS
Breaking IO Performance Barriers: Scalable Parallel File System for AWSAmazon Web Services
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0Heiko Loewe
 

Similar a Improving Apache Spark by Taking Advantage of Disaggregated Architecture (20)

Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Breaking IO Performance Barriers: Scalable Parallel File System for AWS
Breaking IO Performance Barriers: Scalable Parallel File System for AWSBreaking IO Performance Barriers: Scalable Parallel File System for AWS
Breaking IO Performance Barriers: Scalable Parallel File System for AWS
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Último (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

  • 1. Chenzhao Guo(Intel) Carson Wang(Intel) Improving Apache Spark by Taking Advantage of Disaggregated Architecture #UnifiedDataAnalytics #SparkAISummit
  • 2. 2#UnifiedDataAnalytics #SparkAISummit About me • Chenzhao Guo • Big Data Software Engineer at Intel • Contributor of Spark*, committer of OAP and HiBench • Our group’s work: enabling and optimizing Spark* on an exascale(10^18) High Performance Computer *Other names and brands may be claimed as the property of others.
  • 3. 3#UnifiedDataAnalytics #SparkAISummit Agenda • Limitations on default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 4. 4#UnifiedDataAnalytics #SparkAISummit Agenda • Limitations on default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 5. 5#UnifiedDataAnalytics #SparkAISummit Pain point 1: performance Shuffle drags down the end-to-end performance • slow due to random disk I/Os • may fail due to no space left or disk failure Mechanical structure performs slow seek operation What if I upgrade the HDDs to larger capacity or SSDs?
  • 6. 6#UnifiedDataAnalytics #SparkAISummit Pain point 2: Easy management & cost efficiency • Still, local disks may be run out or fail • Scaling out/up the storage resources affects the compute resource • Scaling the storage to accommodate shuffle’s need may make storage resources under- utilized for other(like CPU-intensive) workloads: cost inefficient What if I upgrade the HDDs to larger capacity or SSDs? Collocated compute & storage architecture
  • 7. 7 Limitations on default Spark shuffle essence diagram Cluster utilization Scale out/up Software management Traditional collocated architecture Tightly coupled compute & storage Under-utilized one of the 2 resources Hard to scale out/upgrade without affecting the other resource Debugging an issue may be complicated ?
  • 8. 8 What brings us here? • Challenges to resolve enabling & optimizing Spark* on HPC: – storage on Distributed Asynchronous Object Storage(DAOS) – network on high-performance fabrics – scalability issues when there are ~10000 nodes – intermediate(shuffle, etc.) storage on disaggregated compute & storage architecture • Limitations on default Spark* shuffle *Other names and brands may be claimed as the property of others.
  • 9. 9 Agenda • Limitations on default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 10. 10 Disaggregated architecture • Compute and storage resources are separated, interconnected by high-speed(usually) network • Common in HPC world and large companies who own thousands of servers, for higher efficiency and easier management Compute cabinet with CPU, Memory and accelerators Storage cabinet with NVMe SSD, running DAOS Disaggregated architecture in HPC Fabrics network
  • 11. 11 Collocated vs disaggregated compute & storage architecture Essence Diagram Cluster utilization Scale out/up Software management Traditional collocated architecture Tightly coupled compute & storage Under-utilized one of the 2 resources Hard to scale out/upgrade without affecting the other resource Debugging an issue may be complicated ? Disaggregate d compute & storage architecture Decouple d compute & storage Efficient for diverse workloads with different storage/comput e needs Independent scaling per compute & storage needs separately Different team only needs to care their own cluster Overhead Extra network transfer
  • 12. 12 Network transfer is not a big deal as time evolves • Software: – Efficient compression algorithms – Columnar file format – Other optimizations toward less I/O • Hardware: – Network bandwidth is evolving fast I/O bottleneck CPU bottleneck
  • 13. 13 Targets Hadoop-compatible storage remote shuffle SortShuffleManager ReaderWriter Local disk Java File API write BlockManager Netty fetch Java File API read Local diskLocal disk default Spark local shuffle storage cluster Globally accessible Hadoop-compatible Filesystem ? • Hadoop* Filesystem interface is general • Support globally accessible storage • Let the distributed storage takes care of replication, enough space, etc. compute cluster *Other names and brands may be claimed as the property of others.
  • 14. 14 A remote shuffle manager plugin: overview RemoteShuffleManager ReaderWriter Globally accessible Hadoop-compatible Filesystem Hadoop OutputStream Hadoop InputStream HDFS Lustre DAOS FS… • A pluggable custom shuffle manager storing files in globally accessible Hadoop-compatible filesystem • The shuffle readers/writers talk with storage system in Hadoop* API • Algorithms of partitioning, sorting, unsafe-optimizations, etc. are based on the ones in default SortShuffleManager • Shuffle data/index files, spill and other temporary files are all kept in remote storage *Other names and brands may be claimed as the property of others.
  • 15. executor executor 15 More customization details • Shuffle writers/readers talk with storage in Hadoop* API • Shuffle readers derive the data/index file path from applicationId, mapId and reduceId, then directly load them from remote storage without any metadata keeper’s help(MapOutputTracker) • Avoid resubmitting certain map tasks when executor fails, due to the shuffle blocks are still there and can be retrieved fos = new FileOutputStream(file, true) fsdos = fs.create(file)Local disk Local diskLocal disk Globally-accessible Hadoop-compatible Filesystem I need a shuffle block Read by derived path *Other names and brands may be claimed as the property of others.
  • 16. 16 A remote shuffle I/O plugin • No need to maintain full shuffle algorithms including sort, partition, aggregation logics, but only cares about shuffle storage • Based on the new shuffle abstraction layer introduced in SPARK-25299 Non-shuffle facilities Sort shuffle manager Pluggable shuffle manager Non-shuffle facilities Sort shuffle manager (non-storage part) Pluggable shuffle I/O Customized storage
  • 17. 17 What have we offered until now? A ShuffleManager/(Shuffle I/O) that • shuffles to globally accessible Hadoop-compatible storage • is pluggable and easy to deploy • is free from executor failures naturally due to shuffle metadata is not locally managed, instead globally derived • is free from disk failure due to the shuffle data in storage can be replicated: reliable • is easy to scale and cost-efficient due to taking advantages of disaggregated compute and storage architecture
  • 18. 18 Agenda • Limitations on default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 19. 19 New challenges • Repeatedly reading small index files from disks is inefficient • Seek operations calling on data files in HDFS* are inefficient, reading more data than expected through network *Other names and brands may be claimed as the property of others.
  • 20. 20 Optimization: index cache Map stage: Executor process 0 Maptask Maptask Maptask HDFS Index File Write index & data files to storage Maptask Maptask Index files are small, however, some Hadoop* storage system like HDFS* are not designed for small files I/O *Other names and brands may be claimed as the property of others.
  • 21. 21 Optimization: index cache Reduce stage: Executor process 0 Reducetask Reducetask HDFS Index File Executor process 1 Reducetask Fetch from Index cache Read index to cache Indexcache Directly read data files • Saves small disk I/Os in storage cluster • Saves small network I/Os between compute and storage cluster • Utilize the network inside compute cluster
  • 22. 22 Fall back to read from remote storage Reduce stage: Executor process 0 Reducetask Reducetask HDFS Index File Executor process 1 Reducetask Fetch from Index cache Indexcache Directly read data filesDirectly read index filesRead index to cache • When index cache is not available in certain executors(failure or network issues), fall back to directly read index files from remote storage, just as index cache feature is never enabled
  • 23. 23 Solutions for inefficient HDFS* seek • a Hash-based shuffle manager without any seek operations • Add new logics for blocks merging to eliminate seek operations *Other names and brands may be claimed as the property of others.
  • 24. 24 Agenda • Limitations on default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next *Other names and brands may be claimed as the property of others.
  • 25. 25 Shuffle over Lustre* Dataset 4G dataset & 100 G dataset (~50GB shuffle) workload SQL internal join software Spark* 2.4.3 Hadoop* 2.7.3 baseline: 64GB ram disk | remote shuffle: Lustre*(remote) hardware 1 master + 4 workers 40 cores Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz 128GB RAM 10Gb Ethernet *Other names and brands may be claimed as the property of others.
  • 26. 26 Comparison: end-to-end time 54 87 55 0 10 20 30 40 50 60 70 80 90 100 local shuffle to RAM disk remote shuffle remote shuffle + index cache
  • 27. 27 Agenda • Limitations on default Spark* shuffle • Enabling Spark* shuffle on disaggregated compute & storage • New challenges & optimizations • Performance benchmarks • Where next
  • 28. 28 Where next • Future work: a purely-hash-based shuffle path that resembles hash- based shuffle in Spark* history • Disaggregated architecture can benefit not only for Spark* shuffle • Towards a fully disaggregated world – CPU, memory and disks are all disaggregated, connected through network *Other names and brands may be claimed as the property of others.
  • 30. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 31. 31 Legal Disclaimer No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. No product or component can be absolutely secure. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the Intel logo, [List the Intel trademarks in your document] are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others © Intel Corporation.
  • 32. 32 Legal Disclaimer Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.