SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Iceberg + Alluxio For fast
Data Analytics
Beinan Wang & Shouwei Chen @ Alluxio
2021/12/14
Introduction
Beinan Wang
● PrestoDB Committer
● PhD in CE @ Syracuse
● Email: beinan@alluxio.com
● Interactive Query / Compute Engine / Caching
Shouwei Chen
● Core Maintainer @ Alluxio
● PhD in ECE @ Rutgers
● Email: shouwei@alluxio.com
● Data lake / Structured data / Community
Find us on Alluxio community slack!
https://alluxio.io/slack
ALLUXIO 2
Outline
● Alluxio Overview
● Running Iceberg with Alluxio
● Querying your Iceberg Table with Presto
● Presto Iceberg connector updates
● Q & A
ALLUXIO 3
What is Alluxio?
Open Source Started From UC Berkeley AMPLab in 2014
Join the
conversation on
Slack
alluxio.io/slack
1,000+ contributors
& growing
5,000+ Slack
Community Members
Top 10 Most Critical Java
Based Open Source Project
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Data Orchestration for
Analytics & AI in the Cloud
Available:
ALLUXIO 7
DATA ACCESSIBILITY
Access any storage using any compute
ALLUXIO 8
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive
COMMON USE CASES
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 02: HYBRID
Alluxio
Spark
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 03: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 9
Consistent SLAs, Performance, and
Cost Savings on cloud storage
CASE 01: CLOUD
PUBLIC CLOUD
Tensorflow
Alluxio
Alluxio - Key Innovations
ALLUXIO 10
Acceleration, efficient
representation and movement of
data based on policies
EFFICIENT ACCESS &
EASY DATA MANAGEMENT
Orchestrate a data platform with
agility across regions for private,
hybrid or multi-cloud
ENVIRONMENT AGNOSTIC
& MULTI-CLOUD READY
Support multiple APIs for
analytics and AI with storage
abstraction and streamlined data
movement across the pipeline
UNIFY DATA LAKES
≈
ALLUXIO 11
EXAMPLE JOURNEY
On-premises storage as the source of truth
v
REGION A
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
DATACENTER 2
INGESTION ETL
Hive
Why using Alluxio with Iceberg?
ALLUXIO 13
Why using Alluxio with Iceberg?
Improve IO performance and efficiency for data analytics with better data locality.
Simplify the management of Iceberg files together with computing engine.
Avoid the eventual consistent file system talk with Iceberg directly.
How to integrate Alluxio with Iceberg?
ALLUXIO 15
Alluxio Write Type
Write Type Description
MUST_CACHE Writes directly to Alluxio
*THROUGH Writes directly to under storage
*CACHE_THROUGH Writes to Alluxio and under storage
synchronously
ASYNC_THROUGH Writes to Alluxio first, then asynchronously
writes to the under storage
When all accesses go through Alluxio (S3 mounted as
under storage with Iceberg tables are stored)
16
Spark can read the iceberg table from Alluxio Data in
S3
Alluxio
Alluxio reads and writes
Iceberg tables from/to S3.
Spark can write Iceberg tables to Alluxio
Alluxio + Iceberg Architecture: Option 1
ALLUXIO 16
When Iceberg tables stored on under storage (e.g. S3 here) can be
updated out side Alluxio, how to avoid reading broken table?
17
On read: Spark query the iceberg table
with “metadata sync interval = 0”
⇒ retrieve the latest iceberg table
Data in
S3
Alluxio
On read: Alluxio always
check meta data and get the
latest Iceberg file and data
file from S3
On write: Alluxio writes to S3
with
CACHE_THROUGH/THROUGH,
which will guarantee the
strong consistency for Iceberg
table commit.
On write: Spark write the Iceberg
file and data file to S3 with
CACHE_THROUGH/THROUGH.
⇒ Strong consistency achieved
for Iceberg table commit.
Alluxio + Iceberg Architecture: Option 2
ALLUXIO 17
Query your Iceberg Table
Create Table
ALLUXIO 19
create table iceberg.test.test1 with
(format = 'PARQUET', partitioning =
ARRAY['c_birth_month']) as
SELECT
c_customer_sk,
c_birth_day,
c_birth_month
FROM
tpcds.sf100.customer
Insert
ALLUXIO 20
insert into
iceberg.test.test1
values
(
1000, 40, 13
)
;
Query
ALLUXIO 21
Screenshot from Chunxu’s talk earlier.
Schema Evolution
ALLUXIO 22
Screenshot from Chunxu’s talk earlier.
Iceberg Connector Updates
ALLUXIO 24
New Features
Native folder for metadata storage (Jack Ye, AWS)
Enable Iceberg Local Cache (Baolong, Tencent)
Upgrade to iceberg 1.12.0 and Parquet 0.12.0 (Xinli Shang, Uber and Beinan, Alluxio)
Predicate pushdown to iceberg (Beinan Wang, Alluxio)
Iceberg Native Catalog
Native folder for metadata storage (Jack Ye, AWS)
ALLUXIO 25
Iceberg Loca Cache
Enable Iceberg Local Cache (Baolong, Tencent)
ALLUXIO 26
Diagram is from: https://prestodb.io/blog/2021/02/04/raptorx
Predicate Pushdown
Reduce the number of partitions scanned by presto
ALLUXIO 27
Predicate Pushdown Resource Usage
Reduce the number of partitions scanned by presto
ALLUXIO 28
ALLUXIO 29
Ongoing Work
Native Iceberg IO (Jack Ye, AWS)
Materialized view (Chunxu Tang, Twitter)
Iceberg v2 support and Row level Delete(Beinan Wang, Alluxio)
Q & A

Más contenido relacionado

La actualidad más candente

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

La actualidad más candente (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 

Similar a Iceberg + Alluxio for Fast Data Analytics

Enabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioEnabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with Alluxio
Alluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 

Similar a Iceberg + Alluxio for Fast Data Analytics (20)

Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with Kubernetes
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
 
Enabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with AlluxioEnabling Ultra-fast Presto in the Cloud with Alluxio
Enabling Ultra-fast Presto in the Cloud with Alluxio
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Running Spark & Alluxio in Kubernetes
Running Spark & Alluxio in KubernetesRunning Spark & Alluxio in Kubernetes
Running Spark & Alluxio in Kubernetes
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data management
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 

Más de Alluxio, Inc.

Más de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Iceberg + Alluxio for Fast Data Analytics

  • 1. Iceberg + Alluxio For fast Data Analytics Beinan Wang & Shouwei Chen @ Alluxio 2021/12/14
  • 2. Introduction Beinan Wang ● PrestoDB Committer ● PhD in CE @ Syracuse ● Email: beinan@alluxio.com ● Interactive Query / Compute Engine / Caching Shouwei Chen ● Core Maintainer @ Alluxio ● PhD in ECE @ Rutgers ● Email: shouwei@alluxio.com ● Data lake / Structured data / Community Find us on Alluxio community slack! https://alluxio.io/slack ALLUXIO 2
  • 3. Outline ● Alluxio Overview ● Running Iceberg with Alluxio ● Querying your Iceberg Table with Presto ● Presto Iceberg connector updates ● Q & A ALLUXIO 3
  • 5. Open Source Started From UC Berkeley AMPLab in 2014 Join the conversation on Slack alluxio.io/slack 1,000+ contributors & growing 5,000+ Slack Community Members Top 10 Most Critical Java Based Open Source Project GitHub’s Top 100 Most Valuable Repositories Out of 96 Million
  • 6. Data Orchestration for Analytics & AI in the Cloud Available:
  • 7. ALLUXIO 7 DATA ACCESSIBILITY Access any storage using any compute
  • 8. ALLUXIO 8 BRING DATA CLOSER TO COMPUTE ACROSS SILOS Access based data movement for compute and storage spread across environments v REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine DATACENTER 2 DATACENTER 1 Hive
  • 9. COMMON USE CASES Hybrid Cloud Gateway to utilize on-prem compute for data in the cloud CASE 02: HYBRID Alluxio Spark PUBLIC CLOUD ON PREMISE Cross Datacenter Access without changing Ingest Pipeline across regions CASE 03: MULTI-DATACENTER Presto Alluxio DATACENTER 1 DATACENTER 2 INGESTION ALLUXIO 9 Consistent SLAs, Performance, and Cost Savings on cloud storage CASE 01: CLOUD PUBLIC CLOUD Tensorflow Alluxio
  • 10. Alluxio - Key Innovations ALLUXIO 10 Acceleration, efficient representation and movement of data based on policies EFFICIENT ACCESS & EASY DATA MANAGEMENT Orchestrate a data platform with agility across regions for private, hybrid or multi-cloud ENVIRONMENT AGNOSTIC & MULTI-CLOUD READY Support multiple APIs for analytics and AI with storage abstraction and streamlined data movement across the pipeline UNIFY DATA LAKES ≈
  • 11. ALLUXIO 11 EXAMPLE JOURNEY On-premises storage as the source of truth v REGION A REGION B PRIVATE DATA CENTERS Amazon EMR DATACENTER 2 INGESTION ETL Hive
  • 12. Why using Alluxio with Iceberg?
  • 13. ALLUXIO 13 Why using Alluxio with Iceberg? Improve IO performance and efficiency for data analytics with better data locality. Simplify the management of Iceberg files together with computing engine. Avoid the eventual consistent file system talk with Iceberg directly.
  • 14. How to integrate Alluxio with Iceberg?
  • 15. ALLUXIO 15 Alluxio Write Type Write Type Description MUST_CACHE Writes directly to Alluxio *THROUGH Writes directly to under storage *CACHE_THROUGH Writes to Alluxio and under storage synchronously ASYNC_THROUGH Writes to Alluxio first, then asynchronously writes to the under storage
  • 16. When all accesses go through Alluxio (S3 mounted as under storage with Iceberg tables are stored) 16 Spark can read the iceberg table from Alluxio Data in S3 Alluxio Alluxio reads and writes Iceberg tables from/to S3. Spark can write Iceberg tables to Alluxio Alluxio + Iceberg Architecture: Option 1 ALLUXIO 16
  • 17. When Iceberg tables stored on under storage (e.g. S3 here) can be updated out side Alluxio, how to avoid reading broken table? 17 On read: Spark query the iceberg table with “metadata sync interval = 0” ⇒ retrieve the latest iceberg table Data in S3 Alluxio On read: Alluxio always check meta data and get the latest Iceberg file and data file from S3 On write: Alluxio writes to S3 with CACHE_THROUGH/THROUGH, which will guarantee the strong consistency for Iceberg table commit. On write: Spark write the Iceberg file and data file to S3 with CACHE_THROUGH/THROUGH. ⇒ Strong consistency achieved for Iceberg table commit. Alluxio + Iceberg Architecture: Option 2 ALLUXIO 17
  • 19. Create Table ALLUXIO 19 create table iceberg.test.test1 with (format = 'PARQUET', partitioning = ARRAY['c_birth_month']) as SELECT c_customer_sk, c_birth_day, c_birth_month FROM tpcds.sf100.customer
  • 21. Query ALLUXIO 21 Screenshot from Chunxu’s talk earlier.
  • 22. Schema Evolution ALLUXIO 22 Screenshot from Chunxu’s talk earlier.
  • 24. ALLUXIO 24 New Features Native folder for metadata storage (Jack Ye, AWS) Enable Iceberg Local Cache (Baolong, Tencent) Upgrade to iceberg 1.12.0 and Parquet 0.12.0 (Xinli Shang, Uber and Beinan, Alluxio) Predicate pushdown to iceberg (Beinan Wang, Alluxio)
  • 25. Iceberg Native Catalog Native folder for metadata storage (Jack Ye, AWS) ALLUXIO 25
  • 26. Iceberg Loca Cache Enable Iceberg Local Cache (Baolong, Tencent) ALLUXIO 26 Diagram is from: https://prestodb.io/blog/2021/02/04/raptorx
  • 27. Predicate Pushdown Reduce the number of partitions scanned by presto ALLUXIO 27
  • 28. Predicate Pushdown Resource Usage Reduce the number of partitions scanned by presto ALLUXIO 28
  • 29. ALLUXIO 29 Ongoing Work Native Iceberg IO (Jack Ye, AWS) Materialized view (Chunxu Tang, Twitter) Iceberg v2 support and Row level Delete(Beinan Wang, Alluxio)
  • 30. Q & A