A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0

The Hudi Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock providers,
Scheduling...)
Table Services
(cleaning, compaction, clustering, indexing,
file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index, Hash based,
Lucene..)
Table Format
(Schema, File listings, Stats, Evolution, …)
Lake Cache*
(Columnar, transactional, mutable, WIP,...)
Metaserver*
(Stats, table service coordination,...)
Transactional
Database Layer
Query Engines
(Spark, Flink, Hive, Presto, Trino, Impala,
Redshift, BigQuery, Snowflake,..)
Platform Services
(Streaming/Batch ingest, various sources,
Catalog sync, Admin CLI, Data Quality,...)
User Interface
Readers
(Snapshot, Time Travel, Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart Layout
Management, etc)
Programming API

In Industry Today
Trading transactions - Near
real-time CDC from 4000+
postgres tables at 5 mins!
Minute level analytics with 70%
CPU savings @ Exabyte scale Tiktok
recommendations
Package deliveries -
real-time event analytics at
PB scale
Streaming log ingestion and
efﬁcient GDPR deletes
using Apache Hudi
150 source systems, ETL
processing for 10,000+
tables
Faster data access @ 75%
less storage costs
Near real-time grocery
delivery tracking
Streaming data lake for
device data
Feature Store using Hudi
Building faster analytics for
automotive data
Uber rides - 250+PB from
24h+ to minutes latency on
8000+ tables
Real time analytics that
power ﬁnancial decisions
Real-time advertising for 20M+
concurrent viewers
Lakehouse at Fortune 1 Scale
Lake House
Architecture @
Halodoc
Faster SLAs with low
cost data pipelines
cost optimized fast analytics
for sports solutions

3800+
members
The Community
7000+
Commits
431+
Contributors
6000+
GH Engagers
36
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
19
PMCs
800B+
Records/Day
(from even just 1 user!)
A vibrant OSS Community
4700+
questions
answered
(in just last 2 years!)
22800+
responses
(in just last 2 years!)

Opportunities
- Query engines prefer separate integrations.
- Need to maintain speciﬁc Hudi connectors.
- Improved query planning & execution with
Hudi’s advanced capabilities multi-modal
indexing
Deeper Query Engine
Integrations
- Mature SQL support made possible
from advancements in engines like
Apache Spark & Apache Flink
- Generalized data model for
supporting keys in Hudi tables
Generalized Data Model
- Migrate to hybrid architecture:
Serverless for data and serverful for
table metadata.
- Scales well for metadata.
- Addresses evolving concurrency
control needs.
Serverful & Serverless
- Support for complex, unstructured,
large blobs with indexing, mutation
and change capture.
- Expand to ML/AL modeling, image
and video processing applications.
Beyond Structured Data
- Reverse streaming data
- Snapshot management
- Diagnostic reporters
- Cross Region Replication
- TTL management
Enhanced self management
Database
experience on
the Lake

The Database building blocks
Main components of a DBMS.
Courtesy: The seminal database paper: Architecture of a Database System
Reference diagram highlighting existing (green) and new (yellow) Hudi
components, along with external components (blue). Checkout RFC-69

LSM Tree Style Timeline
Can we support commits every
minute for the 10 years?
Can we organize the timeline in a
better way so that it scales well
linearly?
Unlocks inﬁnite time travel,
time-travel writes, NB Concurrency
LSM Trees FTW!
https://github.com/google/leveldb

Non-Blocking Concurrency Control
Are we being too optimistic?
Three generally agreed upon approaches :
Pessimistic, Optimistic and Multi Version
Architecture of a Database System (Sec 6.2)

Non-Blocking Concurrency Control
Can we avoid the performance and
cost penalties due to OCC?
One way is to enhance OCC with
sophisticated techniques for early
conﬂict detection
How about a general-purpose
non-blocking MVCC-based
concurrency control
Spanner’s TrueTime-like global
monotonically increasing timestamps

New Filegroup Reader and Writer
Can we do better?
Positional merging instead of
key-based merging
- Improve performance when > 50% base
records are changed
First class support for partial
updates
- Reduce write ampliﬁcation, read
ampliﬁcation
Engine agnostic abstractions
is_partial
schema (can be partial)

Position-based Merge Benchmark
Good gains on large updates; But still on paper
- Existing implementations like Iceberg are poor, scan
the entire base ﬁle.
- Hudi PR#10167 open to make it reality with ﬁlter
pushdown for positional merging
Data: MOR tables, 500GB and 1TB with 1000
partitions. 50% records deleted after initial
load.
Data
Size
Key based
Query
Latency (ms)
Position based
Query Latency
(ms)
Gains
500GB 9407 8686 12%
1TB 15030 12534 20%
Setup: AWS EMR cluster, 1 driver
(m5.8xlarge) and 20 executors
(m5.4xlarge), Apache Spark 3.3.3

Partial Update Benchmark
Game changing performance improvements!
Data: 1TB MOR table, with 1000 partitions. 80% random updates in
subsequent commit after bulk loading the data. Total 100 ﬁelds in schema,
but updates are done only for 3 ﬁelds.
Metric Full Update Partial Update Gains
Update latency (s) 2072 1429 1.4x
Total Bytes Written (GB) 891.7 12.7 70.2x
Query latency (s) 164 29 5.7x

Functional Index
Relational databases allow to build
index on functions or expressions
Accelerate queries based on results
of computations.
Hide how data is partitioned from
how data is queried.
Absorb partitioning into indexes. No
more hide-and-evolving partitions!
RFC-63

Functional Index In Action
SQL Script
CREATE TABLE hudi_table_func_index (
ts STRING,
uuid STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING
) USING HUDI
tblproperties (primaryKey = 'uuid')
PARTITIONED BY (city)
INSERT INTO hudi_table_func_index VALUES (...);
CREATE INDEX ts_hour ON hudi_table_func_index USING
column_stats(ts) options(func='hour');
SELECT city, fare, rider, driver FROM
hudi_table_func_index WHERE city NOT IN ('chennai')
AND hour(ts) > 12;

Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
LinkedIn: company/apache-hudi
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack

Thanks!
Questions?
Join Hudi Slack

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0

Recomendados

Recomendados

Más contenido relacionado

Similar a A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0

Similar a A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0 (20)

Último

Último (20)

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Apache Hudi 1.0