SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Best Practices for Using Alluxio with Spark
Haoyuan Li, Ancil McBarnett
Strata NewYork, Sept 2017
Confidential © Alluxio, Inc.All Rights Reserved. 2
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
Confidential © Alluxio, Inc.All Rights Reserved. 3
Data EcosystemYesterday
3
•  One Compute
Framework
•  Single Storage System
•  Co-located
Confidential © Alluxio, Inc.All Rights Reserved. 4
Data Ecosystem Today
•  Many Compute
Frameworks
•  Multiple Storage Systems
•  Most not co-located
Confidential © Alluxio, Inc.All Rights Reserved. 5
Data Ecosystem Issues
5	
•  Each application manage
multiple data sources
•  Add/Removing data
sources require
application changes
•  Storage optimizations
requires application
change
•  Lower performance due
to lack of locality
Confidential © Alluxio, Inc.All Rights Reserved. 6
Data Ecosystem Challenges
2 Data Freshness
•  Cross-network movement is slow
•  Each ETL creates more lag 4 Security & Governance
•  Data security & governance is
increasingly complex
1 Speed & Complexity
•  Numerous storage & compute systems
•  Integration and interoperability issues
(on prem, hybrid, cloud)
•  Many departments & groups
3 Cost
•  Data duplication
•  Data and App explosion driving cost up
6
Heavy integrations create painful organizational drag
Confidential © Alluxio, Inc.All Rights Reserved. 7
Data Ecosystem with Alluxio
7	
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  Highest performance
in Memory
•  No Lock in
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Confidential © Alluxio, Inc.All Rights Reserved. 8
Alluxio Design Principles
2 Optimize Data Access
•  Remote data
•  Service-oriented & microservices
•  Hot/warm/cold data
•  Temporary data
4 Enterprise Class
•  Distributed Architecture
•  Commodity Hardware
•  High Availability
•  Security
1
Big Data & Machine Learning
•  Interoperability with leading projects
•  Large scale data sets
•  High IO
3 Application Data Sharing
•  Multiple compute frameworks within a
node or cluster
•  Shared storage
•  Read/write support
8
Confidential © Alluxio, Inc.All Rights Reserved. 9
Alluxio Innovation:
Server-side API Translation
Convert from Client-side Interface to Native Storage Interface
HDFS Interface
HDFS Interface S3A Interface Swift Interface
Google Cloud
Interface
Confidential © Alluxio, Inc.All Rights Reserved. 10
Alluxio Innovation:
Server-side API Translation
Convert between different versions of HDFS
HDFS 2.7 Interface
HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
Confidential © Alluxio, Inc.All Rights Reserved. 11
Alluxio Innovation:
Unified Namespace
Enables effective data management across different Under Stores
Uses Mounting with Transparent Naming
Confidential © Alluxio, Inc.All Rights Reserved. 12
Alluxio Innovation:
Unified Namespace
Create a catalog of available data sources for Data Scientists
/finance/customer-transactions/	
/finance/vendor-transactions/	
/operations/device-logs/	
/operations/phone-call-recordings/	
/operations/check-images/	
/research/us-economic-data/	
/research/intl-economic-data/	
/marketing/advertising-dataset/	
/marketing/marketing-funnel-dataset/	
	
alluxio://
Confidential © Alluxio, Inc.All Rights Reserved. 13
Alluxio Innovation:
Intelligent Cache
Local performance from remote data using native multi-tier storage
RAM
SSD
HDD
Hot Warm Cold
Confidential © Alluxio, Inc.All Rights Reserved. 14
Where to use Alluxio
Finding high-fit Alluxio use-cases
Compute Zone
Standalone or managed with Mesos orYarn
Storage in Different Availability Zone
Either on-prem or cloud
Alluxio is installed with or near compute to unify data
stores, stage remote data, and improve system
performance.
Spark Tensorflow Presto
HDFS
Guidelines
ü  Compute separated from storage
ü  Distributed compute
ü  I/O or network latency exists
ü  Unification of many storage systems
ü  Applications sharing long lived data
More checks result in higher fit applications
Confidential © Alluxio, Inc.All Rights Reserved. 15
Fastest Growing Big Data
Open Source Projects
Fastest Growing
open-source project
in the big data
ecosystem
Running in large
production clusters
500+ Contributors
from 100+
organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
15
Confidential © Alluxio, Inc.All Rights Reserved. 16
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
Confidential © Alluxio, Inc.All Rights Reserved. 17
Big Data Case Study –
17
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
http://bit.ly/2oMx95W
Confidential © Alluxio, Inc.All Rights Reserved. 18
Big Data Case Study –
18
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution –
With Alluxio, data queries are 30X
faster
Impact –
Higher operational efficiency
http://bit.ly/2pDHS3O
Confidential © Alluxio, Inc.All Rights Reserved. 19
Big Data Case Study –
19
Challenge –
Gain end to end view of
business with large volume of
data for $5B Travel Site
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
HDFS
Solution –
With Alluxio, 300x improvement in
performance
Impact –
Increased revenue from immediate
response to user behavior
Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
Confidential © Alluxio, Inc.All Rights Reserved. 20
Machine Learning Case Study –
20
Challenge –
Disparate Data both on-prem
and Cloud. Heterogeneous
types of data.
Scaling of Exabyte size data.
Slow due to disk based
approach.
SPARK
HDFS
SPARK
MINIO
Solution –
Using Alluxio to prevent I/O
bottlenecks
Impact –
Orders of magnitude higher
performance than before.
http://bit.ly/2p18ds3
MESOS
Confidential © Alluxio, Inc.All Rights Reserved. 21
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
Confidential © Alluxio, Inc.All Rights Reserved. 22
Consolidating Memory
22
Storage Engine &
Execution Engine
Same Process
•  Two copies of data in memory – double the memory used
•  Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
Confidential © Alluxio, Inc.All Rights Reserved. 23
Consolidating Memory
23
Storage Engine &
Execution Engine
Different process
•  Half the memory used
•  Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
Confidential © Alluxio, Inc.All Rights Reserved. 24
Data Resilience During Crash
24
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
Confidential © Alluxio, Inc.All Rights Reserved. 25
Data Resilience During Crash
25
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
•  Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process
Confidential © Alluxio, Inc.All Rights Reserved. 26
Data Resilience During Crash
26
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
•  Process Crash Requires Network and/or Disk I/O to Re-read Data
Confidential © Alluxio, Inc.All Rights Reserved. 27
Data Resilience During Crash
27
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
Confidential © Alluxio, Inc.All Rights Reserved. 28
Data Resilience During Crash
28
•  Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
Confidential © Alluxio, Inc.All Rights Reserved. 29
Accessing Alluxio Data From Spark
29
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
Confidential © Alluxio, Inc.All Rights Reserved. 30
Code Example for Spark RDDs
30
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)!
rdd.saveAsObjectFile(alluxioPath)!
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)!
rdd = sc.objectFile(alluxioPath)!
Confidential © Alluxio, Inc.All Rights Reserved. 31
Code Example for Spark DataFrames
31
Writing to Alluxio df.write.parquet(alluxioPath)!
Reading from Alluxio df = sc.read.parquet(alluxioPath)!
Confidential © Alluxio, Inc.All Rights Reserved. 32
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
Confidential © Alluxio, Inc.All Rights Reserved. 33
Experiments
Spark 2.0.0 + Alluxio 1.2.0
Single worker:Amazon r3.2xlarge
Comparisons:
Alluxio
Spark Storage Level: MEMORY_ONLY
Spark Storage Level: MEMORY_ONLY_SER
Spark Storage Level: DISK_ONLY
Confidential © Alluxio, Inc.All Rights Reserved. 34
0
50
100
150
200
250
0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
Time[seconds]
RDD Size [GB]
Alluxio (textFile)
 Alluxio (objectFile)
 DISK_ONLY
 MEMORY_ONLY_SER
 MEMORY_ONLY
Reading Cached RDD
34
Confidential © Alluxio, Inc.All Rights Reserved. 35
0
 100
 200
 300
 400
 500
 600
 700
 800
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
7x speedup
16x speedup
New Context: Read 50 GB RDD (S3)
35
Confidential © Alluxio, Inc.All Rights Reserved. 36
Reading Cached DataFrame (parquet)
36
0
50
100
150
200
250
0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
Time[seconds]
DataFrame Size [GB]
Alluxio (textFile)
 MEMORY_ONLY_SER
 MEMORY_ONLY
Confidential © Alluxio, Inc.All Rights Reserved. 37
New Context: Read 50 GB DataFrame (S3)
37
0
 250
 500
 750
 1000
 1250
 1500
 1750
Alluxio
No Alluxio
Time [seconds]
10x average speedup, 17x peak speedup
Confidential © Alluxio, Inc.All Rights Reserved. 38
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
Confidential © Alluxio, Inc.All Rights Reserved. 39
Demo Environment
39
Spark
Alluxio
Confidential © Alluxio, Inc.All Rights Reserved. 40
Conclusion
Easy to use Alluxio with Spark
Predictable and improved performance
Easily connect to various storages
Twi$er.com/alluxio	
Linkedin.com/alluxio	
	
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
Confidential © Alluxio, Inc.All Rights Reserved. 41
Thank you!
Haoyuan Li Ancil McBarnett
haoyuan@alluxio.com ancil@alluxio.com
Twitter: @haoyuan Twitter: @
Twi$er.com/alluxio	
Linkedin.com/alluxio	
	
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™

Más contenido relacionado

La actualidad más candente

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioAlluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.
 
Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxiothelabdude
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersAlluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio, Inc.
 
Alluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio, Inc.
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioAlluxio, Inc.
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Alluxio, Inc.
 
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio, Inc.
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Alluxio, Inc.
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
 

La actualidad más candente (20)

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
 
Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand Clusters
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
 
Alluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for Dask
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 

Similar a Best Practices for Using Alluxio with Spark

The Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with AlluxioThe Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with AlluxioAlluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanData Con LA
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory SpeedAlluxio, Inc.
 
Accelerating Spark Workloads in an Apache Mesos Environment with Alluxio
Accelerating Spark Workloads in an Apache Mesos Environment with AlluxioAccelerating Spark Workloads in an Apache Mesos Environment with Alluxio
Accelerating Spark Workloads in an Apache Mesos Environment with AlluxioAlluxio, Inc.
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.
 
Alluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio, Inc.
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
 

Similar a Best Practices for Using Alluxio with Spark (20)

The Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with AlluxioThe Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with Alluxio
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
 
Data EcoSystem 2.0
Data EcoSystem 2.0Data EcoSystem 2.0
Data EcoSystem 2.0
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
 
Accelerating Spark Workloads in an Apache Mesos Environment with Alluxio
Accelerating Spark Workloads in an Apache Mesos Environment with AlluxioAccelerating Spark Workloads in an Apache Mesos Environment with Alluxio
Accelerating Spark Workloads in an Apache Mesos Environment with Alluxio
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
 
Alluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory Speed
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 

Más de Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

Más de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Best Practices for Using Alluxio with Spark

  • 1. Best Practices for Using Alluxio with Spark Haoyuan Li, Ancil McBarnett Strata NewYork, Sept 2017
  • 2. Confidential © Alluxio, Inc.All Rights Reserved. 2 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  • 3. Confidential © Alluxio, Inc.All Rights Reserved. 3 Data EcosystemYesterday 3 •  One Compute Framework •  Single Storage System •  Co-located
  • 4. Confidential © Alluxio, Inc.All Rights Reserved. 4 Data Ecosystem Today •  Many Compute Frameworks •  Multiple Storage Systems •  Most not co-located
  • 5. Confidential © Alluxio, Inc.All Rights Reserved. 5 Data Ecosystem Issues 5 •  Each application manage multiple data sources •  Add/Removing data sources require application changes •  Storage optimizations requires application change •  Lower performance due to lack of locality
  • 6. Confidential © Alluxio, Inc.All Rights Reserved. 6 Data Ecosystem Challenges 2 Data Freshness •  Cross-network movement is slow •  Each ETL creates more lag 4 Security & Governance •  Data security & governance is increasingly complex 1 Speed & Complexity •  Numerous storage & compute systems •  Integration and interoperability issues (on prem, hybrid, cloud) •  Many departments & groups 3 Cost •  Data duplication •  Data and App explosion driving cost up 6 Heavy integrations create painful organizational drag
  • 7. Confidential © Alluxio, Inc.All Rights Reserved. 7 Data Ecosystem with Alluxio 7 •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  Highest performance in Memory •  No Lock in Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
  • 8. Confidential © Alluxio, Inc.All Rights Reserved. 8 Alluxio Design Principles 2 Optimize Data Access •  Remote data •  Service-oriented & microservices •  Hot/warm/cold data •  Temporary data 4 Enterprise Class •  Distributed Architecture •  Commodity Hardware •  High Availability •  Security 1 Big Data & Machine Learning •  Interoperability with leading projects •  Large scale data sets •  High IO 3 Application Data Sharing •  Multiple compute frameworks within a node or cluster •  Shared storage •  Read/write support 8
  • 9. Confidential © Alluxio, Inc.All Rights Reserved. 9 Alluxio Innovation: Server-side API Translation Convert from Client-side Interface to Native Storage Interface HDFS Interface HDFS Interface S3A Interface Swift Interface Google Cloud Interface
  • 10. Confidential © Alluxio, Inc.All Rights Reserved. 10 Alluxio Innovation: Server-side API Translation Convert between different versions of HDFS HDFS 2.7 Interface HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface
  • 11. Confidential © Alluxio, Inc.All Rights Reserved. 11 Alluxio Innovation: Unified Namespace Enables effective data management across different Under Stores Uses Mounting with Transparent Naming
  • 12. Confidential © Alluxio, Inc.All Rights Reserved. 12 Alluxio Innovation: Unified Namespace Create a catalog of available data sources for Data Scientists /finance/customer-transactions/ /finance/vendor-transactions/ /operations/device-logs/ /operations/phone-call-recordings/ /operations/check-images/ /research/us-economic-data/ /research/intl-economic-data/ /marketing/advertising-dataset/ /marketing/marketing-funnel-dataset/ alluxio://
  • 13. Confidential © Alluxio, Inc.All Rights Reserved. 13 Alluxio Innovation: Intelligent Cache Local performance from remote data using native multi-tier storage RAM SSD HDD Hot Warm Cold
  • 14. Confidential © Alluxio, Inc.All Rights Reserved. 14 Where to use Alluxio Finding high-fit Alluxio use-cases Compute Zone Standalone or managed with Mesos orYarn Storage in Different Availability Zone Either on-prem or cloud Alluxio is installed with or near compute to unify data stores, stage remote data, and improve system performance. Spark Tensorflow Presto HDFS Guidelines ü  Compute separated from storage ü  Distributed compute ü  I/O or network latency exists ü  Unification of many storage systems ü  Applications sharing long lived data More checks result in higher fit applications
  • 15. Confidential © Alluxio, Inc.All Rights Reserved. 15 Fastest Growing Big Data Open Source Projects Fastest Growing open-source project in the big data ecosystem Running in large production clusters 500+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive 15
  • 16. Confidential © Alluxio, Inc.All Rights Reserved. 16 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  • 17. Confidential © Alluxio, Inc.All Rights Reserved. 17 Big Data Case Study – 17 Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK TERADATA SPARK TERADATA Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” http://bit.ly/2oMx95W
  • 18. Confidential © Alluxio, Inc.All Rights Reserved. 18 Big Data Case Study – 18 Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency SPARK Baidu File System SPARK Baidu File System Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
  • 19. Confidential © Alluxio, Inc.All Rights Reserved. 19 Big Data Case Study – 19 Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency SPARK HDFS Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq CEPH HDFS CEPH FLINK SPARK FLINK
  • 20. Confidential © Alluxio, Inc.All Rights Reserved. 20 Machine Learning Case Study – 20 Challenge – Disparate Data both on-prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach. SPARK HDFS SPARK MINIO Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3 MESOS
  • 21. Confidential © Alluxio, Inc.All Rights Reserved. 21 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  • 22. Confidential © Alluxio, Inc.All Rights Reserved. 22 Consolidating Memory 22 Storage Engine & Execution Engine Same Process •  Two copies of data in memory – double the memory used •  Inter-process Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  • 23. Confidential © Alluxio, Inc.All Rights Reserved. 23 Consolidating Memory 23 Storage Engine & Execution Engine Different process •  Half the memory used •  Inter-process Sharing Happens at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  • 24. Confidential © Alluxio, Inc.All Rights Reserved. 24 Data Resilience During Crash 24 Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  • 25. Confidential © Alluxio, Inc.All Rights Reserved. 25 Data Resilience During Crash 25 CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 •  Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  • 26. Confidential © Alluxio, Inc.All Rights Reserved. 26 Data Resilience During Crash 26 CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process •  Process Crash Requires Network and/or Disk I/O to Re-read Data
  • 27. Confidential © Alluxio, Inc.All Rights Reserved. 27 Data Resilience During Crash 27 Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  • 28. Confidential © Alluxio, Inc.All Rights Reserved. 28 Data Resilience During Crash 28 •  Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  • 29. Confidential © Alluxio, Inc.All Rights Reserved. 29 Accessing Alluxio Data From Spark 29 Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  • 30. Confidential © Alluxio, Inc.All Rights Reserved. 30 Code Example for Spark RDDs 30 Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath)! rdd.saveAsObjectFile(alluxioPath)! Reading RDD from Alluxio rdd = sc.textFile(alluxioPath)! rdd = sc.objectFile(alluxioPath)!
  • 31. Confidential © Alluxio, Inc.All Rights Reserved. 31 Code Example for Spark DataFrames 31 Writing to Alluxio df.write.parquet(alluxioPath)! Reading from Alluxio df = sc.read.parquet(alluxioPath)!
  • 32. Confidential © Alluxio, Inc.All Rights Reserved. 32 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  • 33. Confidential © Alluxio, Inc.All Rights Reserved. 33 Experiments Spark 2.0.0 + Alluxio 1.2.0 Single worker:Amazon r3.2xlarge Comparisons: Alluxio Spark Storage Level: MEMORY_ONLY Spark Storage Level: MEMORY_ONLY_SER Spark Storage Level: DISK_ONLY
  • 34. Confidential © Alluxio, Inc.All Rights Reserved. 34 0 50 100 150 200 250 0 5 10 15 20 25 30 35 40 45 50 Time[seconds] RDD Size [GB] Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY Reading Cached RDD 34
  • 35. Confidential © Alluxio, Inc.All Rights Reserved. 35 0 100 200 300 400 500 600 700 800 Alluxio (textFile) Alluxio (objectFile) No Alluxio Time [seconds] 7x speedup 16x speedup New Context: Read 50 GB RDD (S3) 35
  • 36. Confidential © Alluxio, Inc.All Rights Reserved. 36 Reading Cached DataFrame (parquet) 36 0 50 100 150 200 250 0 5 10 15 20 25 30 35 40 45 50 Time[seconds] DataFrame Size [GB] Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
  • 37. Confidential © Alluxio, Inc.All Rights Reserved. 37 New Context: Read 50 GB DataFrame (S3) 37 0 250 500 750 1000 1250 1500 1750 Alluxio No Alluxio Time [seconds] 10x average speedup, 17x peak speedup
  • 38. Confidential © Alluxio, Inc.All Rights Reserved. 38 Outline Alluxio Overview Alluxio + Spark Use Cases Using Spark with Alluxio Performance Evaluation Demo 1 2 3 4 5
  • 39. Confidential © Alluxio, Inc.All Rights Reserved. 39 Demo Environment 39 Spark Alluxio
  • 40. Confidential © Alluxio, Inc.All Rights Reserved. 40 Conclusion Easy to use Alluxio with Spark Predictable and improved performance Easily connect to various storages
  • 41. Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ Confidential © Alluxio, Inc.All Rights Reserved. 41 Thank you! Haoyuan Li Ancil McBarnett haoyuan@alluxio.com ancil@alluxio.com Twitter: @haoyuan Twitter: @ Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™