5. Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
5,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
8. ALLUXIO 8
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive
9. COMMON USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 9
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio
10. Alluxio - Key Innovations
ALLUXIO 10
Acceleration, efficient
representation and movement of
data based on policies
EFFICIENT ACCESS &
EASY DATA MANAGEMENT
Orchestrate a data platform with
agility across regions for private,
hybrid or multi-cloud
ENVIRONMENT AGNOSTIC
& MULTI-CLOUD READY
Support multiple APIs for
analytics and AI with storage
abstraction and streamlined data
movement across the pipeline
UNIFY DATA LAKES
≈
11. ALLUXIO 11
EXAMPLE JOURNEY
On-premises storage as the source of truth
v
REGION A
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
DATACENTER 2
INGESTION ETL
Hive
13. ALLUXIO 13
Why using Alluxio with Iceberg?
Improve IO performance and efficiency for data analytics with better data locality.
Simplify the management of Iceberg files together with computing engine.
Avoid the eventual consistent file system talk with Iceberg directly.
15. ALLUXIO 15
Alluxio Write Type
Write Type Description
MUST_CACHE Writes directly to Alluxio
*THROUGH Writes directly to under storage
*CACHE_THROUGH Writes to Alluxio and under storage
synchronously
ASYNC_THROUGH Writes to Alluxio first, then asynchronously
writes to the under storage
16. When all accesses go through Alluxio (S3 mounted as
under storage with Iceberg tables are stored)
16
Spark can read the iceberg table from Alluxio Data in
S3
Alluxio
Alluxio reads and writes
Iceberg tables from/to S3.
Spark can write Iceberg tables to Alluxio
Alluxio + Iceberg Architecture: Option 1
ALLUXIO 16
17. When Iceberg tables stored on under storage (e.g. S3 here) can be
updated out side Alluxio, how to avoid reading broken table?
17
On read: Spark query the iceberg table
with “metadata sync interval = 0”
⇒ retrieve the latest iceberg table
Data in
S3
Alluxio
On read: Alluxio always
check meta data and get the
latest Iceberg file and data
file from S3
On write: Alluxio writes to S3
with
CACHE_THROUGH/THROUGH,
which will guarantee the
strong consistency for Iceberg
table commit.
On write: Spark write the Iceberg
file and data file to S3 with
CACHE_THROUGH/THROUGH.
⇒ Strong consistency achieved
for Iceberg table commit.
Alluxio + Iceberg Architecture: Option 2
ALLUXIO 17
24. ALLUXIO 24
New Features
Native folder for metadata storage (Jack Ye, AWS)
Enable Iceberg Local Cache (Baolong, Tencent)
Upgrade to iceberg 1.12.0 and Parquet 0.12.0 (Xinli Shang, Uber and Beinan, Alluxio)
Predicate pushdown to iceberg (Beinan Wang, Alluxio)