Best Practices for Using Alluxio with Spark

Best Practices for Using Alluxio with Spark
Haoyuan Li, Ancil McBarnett
Strata NewYork, Sept 2017

Confidential © Alluxio, Inc.All Rights Reserved. 2
Outline
Alluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5

Data EcosystemYesterday
3
•  One Compute
Framework
•  Single Storage System
•  Co-located

Data Ecosystem Today
•  Many Compute
Frameworks
•  Multiple Storage Systems
•  Most not co-located

Data Ecosystem Issues
5
•  Each application manage
multiple data sources
•  Add/Removing data
sources require
application changes
•  Storage optimizations
requires application
change
•  Lower performance due
to lack of locality

Data Ecosystem Challenges
2 Data Freshness
•  Cross-network movement is slow
•  Each ETL creates more lag 4 Security & Governance
•  Data security & governance is
increasingly complex
1 Speed & Complexity
•  Numerous storage & compute systems
•  Integration and interoperability issues
(on prem, hybrid, cloud)
•  Many departments & groups
3 Cost
•  Data duplication
•  Data and App explosion driving cost up
6
Heavy integrations create painful organizational drag

Data Ecosystem with Alluxio
7
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  Highest performance
in Memory
•  No Lock in
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface

Alluxio Design Principles
2 Optimize Data Access
•  Remote data
•  Service-oriented & microservices
•  Hot/warm/cold data
•  Temporary data
4 Enterprise Class
•  Distributed Architecture
•  Commodity Hardware
•  High Availability
•  Security
1
Big Data & Machine Learning
•  Interoperability with leading projects
•  Large scale data sets
•  High IO
3 Application Data Sharing
•  Multiple compute frameworks within a
node or cluster
•  Shared storage
•  Read/write support
8

Alluxio Innovation:
Server-side API Translation
Convert from Client-side Interface to Native Storage Interface
HDFS Interface
HDFS Interface S3A Interface Swift Interface
Google Cloud
Interface

Alluxio Innovation:
Server-side API Translation
Convert between different versions of HDFS
HDFS 2.7 Interface
HDP 2.4 InterfaceCDH 5.6 Interface MAPR 5.2 Interface

Alluxio Innovation:
Uniﬁed Namespace
Enables effective data management across different Under Stores
Uses Mounting with Transparent Naming

Alluxio Innovation:
Uniﬁed Namespace
Create a catalog of available data sources for Data Scientists
/finance/customer-transactions/
/finance/vendor-transactions/
/operations/device-logs/
/operations/phone-call-recordings/
/operations/check-images/
/research/us-economic-data/
/research/intl-economic-data/
/marketing/advertising-dataset/
/marketing/marketing-funnel-dataset/

alluxio://

Alluxio Innovation:
Intelligent Cache
Local performance from remote data using native multi-tier storage
RAM
SSD
HDD
Hot Warm Cold

Where to use Alluxio
Finding high-ﬁt Alluxio use-cases
Compute Zone
Standalone or managed with Mesos orYarn
Storage in Different Availability Zone
Either on-prem or cloud
Alluxio is installed with or near compute to unify data
stores, stage remote data, and improve system
performance.
Spark Tensorflow Presto
HDFS
Guidelines
ü  Compute separated from storage
ü  Distributed compute
ü  I/O or network latency exists
ü  Unification of many storage systems
ü  Applications sharing long lived data
More checks result in higher fit applications

Fastest Growing Big Data
Open Source Projects
Fastest Growing
open-source project
in the big data
ecosystem
Running in large
production clusters
500+ Contributors
from 100+
organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
15

Outline
Alluxio Overview
Demo
1
2
3
4
5

Big Data Case Study –
17
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
http://bit.ly/2oMx95W

18
Challenge –
data
SPARK
Baidu File System
SPARK
Baidu File System
Solution –
With Alluxio, data queries are 30X
faster
Impact –
Higher operational efficiency
http://bit.ly/2pDHS3O

19
Challenge –
data for $5B Travel Site
SPARK
HDFS
Solution –
With Alluxio, 300x improvement in
performance
Impact –
Increased revenue from immediate
response to user behavior
Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK

Machine Learning Case Study –
20
Challenge –
Disparate Data both on-prem
and Cloud. Heterogeneous
types of data.
Scaling of Exabyte size data.
Slow due to disk based
approach.
SPARK
HDFS
SPARK
MINIO
Solution –
Using Alluxio to prevent I/O
bottlenecks
Impact –
Orders of magnitude higher
performance than before.
http://bit.ly/2p18ds3
MESOS

Outline
Alluxio Overview
Demo
1
2
3
4
5

Consolidating Memory
22
Storage Engine &
Execution Engine
Same Process
•  Two copies of data in memory – double the memory used
•  Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3

Consolidating Memory
23
Storage Engine &
Execution Engine
Different process
•  Half the memory used
•  Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage

Data Resilience During Crash
24
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process

25
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
•  Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process

26
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
•  Process Crash Requires Network and/or Disk I/O to Re-read Data

27
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process

28
•  Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process

Accessing Alluxio Data From Spark
29
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file

Code Example for Spark RDDs
30
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)!
rdd.saveAsObjectFile(alluxioPath)!
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)!
rdd = sc.objectFile(alluxioPath)!

Code Example for Spark DataFrames
31
Writing to Alluxio df.write.parquet(alluxioPath)!
Reading from Alluxio df = sc.read.parquet(alluxioPath)!

Outline
Alluxio Overview
Demo
1
2
3
4
5

Experiments
Spark 2.0.0 + Alluxio 1.2.0
Single worker:Amazon r3.2xlarge
Comparisons:
Alluxio
Spark Storage Level: MEMORY_ONLY
Spark Storage Level: MEMORY_ONLY_SER
Spark Storage Level: DISK_ONLY

0
50
100
150
200
250
0
5
10
15
20
25
30
35
40
45
50
Time[seconds]
RDD Size [GB]
Alluxio (textFile)
Alluxio (objectFile)
DISK_ONLY
MEMORY_ONLY_SER
MEMORY_ONLY
Reading Cached RDD
34

0
100
200
300
400
500
600
700
800
Alluxio
(textFile)
Alluxio
(objectFile)
No Alluxio
Time [seconds]
7x speedup
16x speedup
New Context: Read 50 GB RDD (S3)
35

Reading Cached DataFrame (parquet)
36
0
50
100
150
200
250
0
5
10
15
20
25
30
35
40
45
50
Time[seconds]
DataFrame Size [GB]
Alluxio (textFile)
MEMORY_ONLY_SER
MEMORY_ONLY

New Context: Read 50 GB DataFrame (S3)
37
0
250
500
750
1000
1250
1500
1750
Alluxio
No Alluxio
Time [seconds]
10x average speedup, 17x peak speedup

Outline
Alluxio Overview
Demo
1
2
3
4
5

Demo Environment
39
Spark
Alluxio

Conclusion
Easy to use Alluxio with Spark
Predictable and improved performance
Easily connect to various storages

Twi$er.com/alluxio
Linkedin.com/alluxio

Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
Thank you!
Haoyuan Li Ancil McBarnett
haoyuan@alluxio.com ancil@alluxio.com
Twitter: @haoyuan Twitter: @
Twi$er.com/alluxio
Linkedin.com/alluxio

Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™

Best Practices for Using Alluxio with Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Best Practices for Using Alluxio with Spark

Similar a Best Practices for Using Alluxio with Spark (20)

Más de Alluxio, Inc.

Más de Alluxio, Inc. (20)

Último

Último (20)

Best Practices for Using Alluxio with Spark