Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
6. DATA ORCHESTRATION SUMMIT
Single Cloud & On-Prem Use Cases
Consistent SLAs, Performance, and
Cost Savings on cloud storage
USE CASE 01: CLOUD USE CASE 02: ON PREM
PUBLIC CLOUD
Tensorflow
Alluxio
Speed-up analytics on on-prem
object stores
ON PREMISE
Spark
Alluxio
OR OR
7. CHALLENGES WITH CLOUD STORAGE
USE CASE 01: CLOUD
Inefficient access to cloud storage
• Performance is variable and consistent SLAs are hard to achieve
• Metadata operations are expensive & slowdown workloads
• Embedded caching solutions are ineffective for ephemeral
workloads & clusters
Tensorflow
Alluxio
OR
8. • 40%+ reduction in AI training time & cost
• 2-8x performance with Analytics engines
• Eliminate storage access cost to cut total cost by up to 50%
• Reduce latency spikes by up to 6x using data pre-loading &
consistent performance guarantees
• Optional off-cluster caching for ephemeral workloads
SOLUTION
Consistent SLAs, Performance &
Cost Savings on cloud storage
USE CASE 01: CLOUD
Tensorflow
Alluxio
OR
9. CHALLENGES WITH ON-PREM OBJECT STORES
USE CASE 02: ON PREM
Slow transition to object storage
• Performance for analytics & AI workloads can be very poor
• No native support for popular frameworks
• Expensive metadata operations further reduce performance
t
Spark
Alluxio
OR OR
10. • Improved performance over co-located HDFS with the
flexibility of segregated storage
• Support for multiple APIs
• No changes to the end-user experience
• Enable cheap storage at a fraction of the cost
SOLUTION
Speed-up analytics & AI on
on-prem object stores
USE CASE 02: ON PREM
t
Spark
Alluxio
SAME REGION
OR OR
11. DATA ORCHESTRATION SUMMIT
Hybrid Cloud & Multi-Datacenter
Burst compute to a public cloud
and gradually migrate
USE CASE 03: HYBRID
Hive
Alluxio
PUBLIC CLOUD
ON PREMISE
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
USE CASE 04: HYBRID
Alluxio
Pytorch
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
USE CASE 05: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
12. ALLUXIO 12
CHALLENGES WITH HYBRID CLOUD BURSTING
USE CASE 03: HYBRID
Migrating Analytics or AI to the
Cloud is Hard
• Repeated data access across the corporate network to a public
cloud is not feasible
• Copying data to cloud storage is time consuming and complex
• Using a cloud storage system like S3 means expensive
application changes and low performance
t
Hive
Alluxio
13. t
Hive
Alluxio
SAME REGION
ALLUXIO 13
• Performance as if data is on the cloud compute cluster
• 100% of I/O is offloaded from on-premises
• No changes to end-user experience and security model
• Common data fabric with only a logical data copies
• Utilization of elastic cloud compute for up to 4x costs savings
SOLUTION
Burst Compute to a Public Cloud
and Gradually Migrate
USE CASE 03: HYBRID
14. DATA ORCHESTRATION SUMMIT
Alluxio @ Walmart
• Zero-Copy
○ No new copies of data in the cloud
• High Performance
○ Data caching accelerates queries
• Lower Costs
○ One source of truth for data avoids
additional storage
15. ALLUXIO 15
CHALLENGES WITH HYBRID CLOUD STORAGE
USE CASE 04: HYBRID
Accessing Cloud Storage from a
Private Datacenter
• No unified view for cloud and on-prem storage
• Prohibitively high network egress costs
• Inability to utilize compute on-premises for data generated
in the cloud
• Inadequate performance for analytics and AI
PyTorch
ON PREMISE
PUBLIC CLOUD
16. ALLUXIO 16
• Performance as if data is on the on-prem compute cluster
• Intelligent distributed caching for reads & writes
• Network cost savings of up to 80% by eliminating replication
• No changes to the end-user experience with flexible APIs and
security model on cloud storage
SOLUTION
Hybrid Cloud Storage Gateway for
data in the cloud
USE CASE 04: HYBRID
Alluxio
PyTorch
ON PREMISE
PUBLIC CLOUD
17. ALLUXIO 17
CHALLENGES WITH SUPPORTING SATELLITE CLUSTERS
ACROSS DATA CENTERS
USE CASE 05: MULTI DATACENTER
Utilization of compute resources
across datacenters
• Orchestrating data to compute clusters in another data center is
manual and time consuming
• Storing and managing multiple copies of the data is expensive
with unnecessary network traffic for replication
• Running replication frameworks on an overloaded storage
cluster dramatically impacts performance of existing workloads
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
18. ALLUXIO 18
• No redundant data copies across datacenters
• Elimination of complex data synchronization
• 3-6x performance compared to remote data access across regions
• Self-service data infrastructure across business units
SOLUTION
Cross Datacenter Access without
changing Ingest Pipeline
USE CASE 05: MULTI DATACENTER
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
19. DATA ORCHESTRATION SUMMIT
Alluxio @ Adobe
Primary DC with large Hadoop Cluster out of
space, ad hoc SQL workloads exponentially
growing as analyst headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
20. DATA ORCHESTRATION SUMMIT
Alluxio & Data Analytics
• Data Analytics runs on Data Lakes
• Data Lakes are designed for data storage, not access
• Alluxio is the Data Orchestration layer which bridges the
compute and data layers
○ If the Data Lake is remote
○ If the Data Lake is overloaded
○ If the Data Lake has variable latency
○ If the Data Lake has low performance
○ If the Data Lake doesn’t support the same semantics
○ ...
22. DATA ORCHESTRATION SUMMIT
Alluxio & AI w/ K8s
• Machine Learning & AI runs on Data Lakes
• Compared to Data Analytics, AI workloads have different
characteristics, but a similar mismatch between compute
and storage
23. DATA ORCHESTRATION SUMMIT
Alluxio & AI - Better Together
• Access Pattern - Repeated access on a dataset
• Dataset - Many small files
• Preferred API - Posix Filesystem
• Workload Regularity - Predictable, bulk access
26. DATA ORCHESTRATION SUMMIT
Alluxio Open Source Project Stats
Latest stable release: 2.4.1
Total number of contributors: 1092
+1013 more commits since v2.1.0 (Nov 2019, 1st Summit)
5100+ Slack users (alluxio.io/slack)
28. DATA ORCHESTRATION SUMMIT
Production Deployments at Scale
● Top-tier cell phone provider
○ 3000+ Alluxio servers in a single cluster
● Top-tier social network company
○ 10,000+ concurrent Alluxio clients
○ 10+PB data managed
29. DATA ORCHESTRATION SUMMIT
Special Interest Groups in Ecosystem
● SIG in Machine Learning/K8s on Alluxio
■ Regular Community R&D meetings
■ Re-implemented JNI-based FUSE integration
■ Performance optimizations for small files, RPCs
● A new SIG kicked off in Presto on Alluxio
30. DATA ORCHESTRATION SUMMIT
Experimental Two-week Release Cycle
● Previous release cadence: quarterly
● New experimental release schedule:
○ every two weeks
○ starting early December!
● What does it bring to Alluxio community?
○ deliver feature/bug fixes faster