SlideShare una empresa de Scribd logo
The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock providers,
Scheduling...)
Table Services
(cleaning, compaction, clustering, indexing,
file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index, Hash based,
Lucene..)
Table Format
(Schema, File listings, Stats, Evolution, …)
Lake Cache*
(Columnar, transactional, mutable, WIP,...)
Metaserver*
(Stats, table service coordination,...)
Transactional
Database Layer
Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest, various sources,
Catalog sync, Admin CLI, Data Quality,...)
User Interface
Readers
(Snapshot, Time Travel, Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart Layout
Management, etc)
Programming API
In Industry Today
Trading transactions - Near
real-time CDC from 4000+
postgres tables at 5 mins!
Minute level analytics with 70%
CPU savings @ Exabyte scale Tiktok
recommendations
Package deliveries -
real-time event analytics at
PB scale
Streaming log ingestion and
efficient GDPR deletes
using Apache Hudi
150 source systems, ETL
processing for 10,000+
tables
Faster data access @ 75%
less storage costs
Near real-time grocery
delivery tracking
Streaming data lake for
device data
Feature Store using Hudi
Building faster analytics for
automotive data
Uber rides - 250+PB from
24h+ to minutes latency on
8000+ tables
Real time analytics that
power financial decisions
Real-time advertising for 20M+
concurrent viewers
Lakehouse at Fortune 1 Scale
Lake House
Architecture @
Halodoc
Faster SLAs with low
cost data pipelines
cost optimized fast analytics
for sports solutions
3800+
members
The Community
7000+
Commits
431+
Contributors
6000+
GH Engagers
36
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
19
PMCs
800B+
Records/Day
(from even just 1 user!)
A vibrant OSS Community
4700+
questions
answered
(in just last 2 years!)
22800+
responses
(in just last 2 years!)
Opportunities
- Query engines prefer separate integrations.
- Need to maintain specific Hudi connectors.
- Improved query planning & execution with
Hudi’s advanced capabilities multi-modal
indexing
Deeper Query Engine
Integrations
- Mature SQL support made possible
from advancements in engines like
Apache Spark & Apache Flink
- Generalized data model for
supporting keys in Hudi tables
Generalized Data Model
- Migrate to hybrid architecture:
Serverless for data and serverful for
table metadata.
- Scales well for metadata.
- Addresses evolving concurrency
control needs.
Serverful & Serverless
- Support for complex, unstructured,
large blobs with indexing, mutation
and change capture.
- Expand to ML/AL modeling, image
and video processing applications.
Beyond Structured Data
- Reverse streaming data
- Snapshot management
- Diagnostic reporters
- Cross Region Replication
- TTL management
Enhanced self management
Database
experience on
the Lake
The Database building blocks
Main components of a DBMS.
Courtesy: The seminal database paper: Architecture of a Database System
Reference diagram highlighting existing (green) and new (yellow) Hudi
components, along with external components (blue). Checkout RFC-69
LSM Tree Style Timeline
Can we support commits every
minute for the 10 years?
Can we organize the timeline in a
better way so that it scales well
linearly?
Unlocks infinite time travel,
time-travel writes, NB Concurrency
LSM Trees FTW!
https://github.com/google/leveldb
Non-Blocking Concurrency Control
Are we being too optimistic?
Three generally agreed upon approaches :
Pessimistic, Optimistic and Multi Version
Architecture of a Database System (Sec 6.2)
Non-Blocking Concurrency Control
Can we avoid the performance and
cost penalties due to OCC?
One way is to enhance OCC with
sophisticated techniques for early
conflict detection
How about a general-purpose
non-blocking MVCC-based
concurrency control
Spanner’s TrueTime-like global
monotonically increasing timestamps
New Filegroup Reader and Writer
Can we do better?
Positional merging instead of
key-based merging
- Improve performance when > 50% base
records are changed
First class support for partial
updates
- Reduce write amplification, read
amplification
Engine agnostic abstractions
is_partial
schema (can be partial)
Position-based Merge Benchmark
Good gains on large updates; But still on paper
- Existing implementations like Iceberg are poor, scan
the entire base file.
- Hudi PR#10167 open to make it reality with filter
pushdown for positional merging
Data: MOR tables, 500GB and 1TB with 1000
partitions. 50% records deleted after initial
load.
Data
Size
Key based
Query
Latency (ms)
Position based
Query Latency
(ms)
Gains
500GB 9407 8686 12%
1TB 15030 12534 20%
Setup: AWS EMR cluster, 1 driver
(m5.8xlarge) and 20 executors
(m5.4xlarge), Apache Spark 3.3.3
Partial Update Benchmark
Game changing performance improvements!
Data: 1TB MOR table, with 1000 partitions. 80% random updates in
subsequent commit after bulk loading the data. Total 100 fields in schema,
but updates are done only for 3 fields.
Metric Full Update Partial Update Gains
Update latency (s) 2072 1429 1.4x
Total Bytes Written (GB) 891.7 12.7 70.2x
Query latency (s) 164 29 5.7x
Functional Index
Relational databases allow to build
index on functions or expressions
Accelerate queries based on results
of computations.
Hide how data is partitioned from
how data is queried.
Absorb partitioning into indexes. No
more hide-and-evolving partitions!
RFC-63
Functional Index In Action
SQL Script
CREATE TABLE hudi_table_func_index (
ts STRING,
uuid STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING
) USING HUDI
tblproperties (primaryKey = 'uuid')
PARTITIONED BY (city)
INSERT INTO hudi_table_func_index VALUES (...);
CREATE INDEX ts_hour ON hudi_table_func_index USING
column_stats(ts) options(func='hour');
SELECT city, fare, rider, driver FROM
hudi_table_func_index WHERE city NOT IN ('chennai')
AND hour(ts) > 12;
Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
LinkedIn: company/apache-hudi
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?
Join Hudi Slack

Más contenido relacionado

Similar a A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0

Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
HANA SITSP 2011
HANA SITSP 2011HANA SITSP 2011
HANA SITSP 2011
Henrique Pinto
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Abdelkrim Hadjidj
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Hana Training Day 1
Hana Training Day 1Hana Training Day 1
Hana Training Day 1
mishra4927
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
dfilppi
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
Andrew Underwood
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark StreamingSpark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 

Similar a A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0 (20)

Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
HANA SITSP 2011
HANA SITSP 2011HANA SITSP 2011
HANA SITSP 2011
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Hana Training Day 1
Hana Training Day 1Hana Training Day 1
Hana Training Day 1
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark StreamingSpark Seattle meetup - Breaking ETL barrier with Spark Streaming
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 

Último

LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 

Último (20)

LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0

  • 1.
  • 2. The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Transactional Database Layer Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API
  • 3. In Industry Today Trading transactions - Near real-time CDC from 4000+ postgres tables at 5 mins! Minute level analytics with 70% CPU savings @ Exabyte scale Tiktok recommendations Package deliveries - real-time event analytics at PB scale Streaming log ingestion and efficient GDPR deletes using Apache Hudi 150 source systems, ETL processing for 10,000+ tables Faster data access @ 75% less storage costs Near real-time grocery delivery tracking Streaming data lake for device data Feature Store using Hudi Building faster analytics for automotive data Uber rides - 250+PB from 24h+ to minutes latency on 8000+ tables Real time analytics that power financial decisions Real-time advertising for 20M+ concurrent viewers Lakehouse at Fortune 1 Scale Lake House Architecture @ Halodoc Faster SLAs with low cost data pipelines cost optimized fast analytics for sports solutions
  • 4. 3800+ members The Community 7000+ Commits 431+ Contributors 6000+ GH Engagers 36 Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 19 PMCs 800B+ Records/Day (from even just 1 user!) A vibrant OSS Community 4700+ questions answered (in just last 2 years!) 22800+ responses (in just last 2 years!)
  • 5. Opportunities - Query engines prefer separate integrations. - Need to maintain specific Hudi connectors. - Improved query planning & execution with Hudi’s advanced capabilities multi-modal indexing Deeper Query Engine Integrations - Mature SQL support made possible from advancements in engines like Apache Spark & Apache Flink - Generalized data model for supporting keys in Hudi tables Generalized Data Model - Migrate to hybrid architecture: Serverless for data and serverful for table metadata. - Scales well for metadata. - Addresses evolving concurrency control needs. Serverful & Serverless - Support for complex, unstructured, large blobs with indexing, mutation and change capture. - Expand to ML/AL modeling, image and video processing applications. Beyond Structured Data - Reverse streaming data - Snapshot management - Diagnostic reporters - Cross Region Replication - TTL management Enhanced self management Database experience on the Lake
  • 6. The Database building blocks Main components of a DBMS. Courtesy: The seminal database paper: Architecture of a Database System Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue). Checkout RFC-69
  • 7. LSM Tree Style Timeline Can we support commits every minute for the 10 years? Can we organize the timeline in a better way so that it scales well linearly? Unlocks infinite time travel, time-travel writes, NB Concurrency LSM Trees FTW! https://github.com/google/leveldb
  • 8. Non-Blocking Concurrency Control Are we being too optimistic? Three generally agreed upon approaches : Pessimistic, Optimistic and Multi Version Architecture of a Database System (Sec 6.2)
  • 9. Non-Blocking Concurrency Control Can we avoid the performance and cost penalties due to OCC? One way is to enhance OCC with sophisticated techniques for early conflict detection How about a general-purpose non-blocking MVCC-based concurrency control Spanner’s TrueTime-like global monotonically increasing timestamps
  • 10. New Filegroup Reader and Writer Can we do better? Positional merging instead of key-based merging - Improve performance when > 50% base records are changed First class support for partial updates - Reduce write amplification, read amplification Engine agnostic abstractions is_partial schema (can be partial)
  • 11. Position-based Merge Benchmark Good gains on large updates; But still on paper - Existing implementations like Iceberg are poor, scan the entire base file. - Hudi PR#10167 open to make it reality with filter pushdown for positional merging Data: MOR tables, 500GB and 1TB with 1000 partitions. 50% records deleted after initial load. Data Size Key based Query Latency (ms) Position based Query Latency (ms) Gains 500GB 9407 8686 12% 1TB 15030 12534 20% Setup: AWS EMR cluster, 1 driver (m5.8xlarge) and 20 executors (m5.4xlarge), Apache Spark 3.3.3
  • 12. Partial Update Benchmark Game changing performance improvements! Data: 1TB MOR table, with 1000 partitions. 80% random updates in subsequent commit after bulk loading the data. Total 100 fields in schema, but updates are done only for 3 fields. Metric Full Update Partial Update Gains Update latency (s) 2072 1429 1.4x Total Bytes Written (GB) 891.7 12.7 70.2x Query latency (s) 164 29 5.7x
  • 13. Functional Index Relational databases allow to build index on functions or expressions Accelerate queries based on results of computations. Hide how data is partitioned from how data is queried. Absorb partitioning into indexes. No more hide-and-evolving partitions! RFC-63
  • 14. Functional Index In Action SQL Script CREATE TABLE hudi_table_func_index ( ts STRING, uuid STRING, rider STRING, driver STRING, fare DOUBLE, city STRING ) USING HUDI tblproperties (primaryKey = 'uuid') PARTITIONED BY (city) INSERT INTO hudi_table_func_index VALUES (...); CREATE INDEX ts_hour ON hudi_table_func_index USING column_stats(ts) options(func='hour'); SELECT city, fare, rider, driver FROM hudi_table_func_index WHERE city NOT IN ('chennai') AND hour(ts) > 12;
  • 15. Come Build With The Community! Docs : https://hudi.apache.org Blogs : https://hudi.apache.org/blog Slack : Apache Hudi Slack Group LinkedIn: company/apache-hudi Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) Join Hudi Slack