Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
2. Speaker Bio
PMC Chair/Creator of Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Confluent (ksqlDB, Kafka/Streams)
Staff Eng @ Linkedin (Voldemort, DDS)
Sr Eng @ Oracle (CDC/Goldengate/XStream)
5. Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
System for tracking, fetching new data
- Not concerned with how to use such data
- Ideally, incremental update downstream
- Minimizing number of bits read-written/change
Change is the ONLY Constant
- Even in Computer science
- Data is immutable = Myth (well, kinda)
6. Examples of CDC
Polling an external API for new events
- Timestamps, status indicators, versions
- Simple, works for small-scale data changes
- E.g: Polling github events API
Emit Events directly from Application
- Data model to encode deltas
- Scales for high-volume data changes
- E.g: Emitting sensor state changes to Pulsar
Scanning a database’s redo log
- SCN and other watermarks to extract data/metadata changes
- Operationally heavy, very high fidelity
- E.g: Using Debezium to obtain changelogs from MySQL
7. CDC vs ETL?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time full bootstrap
- <>
CDC changes T and L significantly
- T on change streams, not just table state
- L incrementally, not just bulk reloads
8. CDC vs Streaming Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
Reliable Stream Processing needs distributed logs
- Rewind/Replay CDC logs
- Absorb spikes/batch writes to sinks
9. Ideal CDC Source
Support reliable incremental consumption
- <>
Support rewinding/replay
- <>
Support ordering of changes
- <>
21. Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstraction layer over DFS
- We invented this!
Hadoop Upserts, Deletes &
Incrementals
Provide transactional updates/deletes
First class support for record level CDC
streams
23. What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Columnar
formats
+ Scalable Compute
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/; 2016
24. Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
2016 : Project created at Uber & powers all database/business critical feeds @ Uber
2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support
2018 : Picked up adopters, hardening, async compaction..
2019 : Incubated into ASF, community growth, added more platform components.
2020 : Top level Apache project, Over 10x growth in community, downloads, adoption
2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
32. Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled by “retention”
parameters
- Leverage append() when
available; lower metadata
overhead
Merges are local to each file group
- UUID keys throw off any
range pruning
33. Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming state
store
Workloads have different shapes
- Late arriving updates; Totally
random
- Trickle down to derived tables
Many pluggable options
- Bloom Filters + Key ranges
- HBase, Join based
- Global vs Local
34. MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differentiate writers vs table services
- Much like what databases do
- Table services don’t contend with
writers
- Async compaction/clustering
Don’t be so “Optimistic”
- OCC b/w writers; works, until it does
n’t
- Retries, split txns, wastes resources
- MVCC/Log based between
writers/table services
35. Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support business-specific resolution
Log partial updates
- Log just changed column;
- Drastic reduction in write amplification
Log based reconciliation
- Delete, Undelete based on business
logic
- CRDT, Operational Transform like
delayed conflict resolution
36. Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other
- E.g: Compaction, Cleaning in rocksDB
E.g: Clustering & Compaction know each
other
- Reconcile metadata based on time order
- Compactions avoid redundant
scheduling
Self Managing
- Sorting, Time-order preservation, File-
sizing
37. Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change streams in commit
order
- _hoodie_commit_seqno: Consume
large commits in chunks, ala Kafka
offsets
File group design => CDC friendly
- Efficient retrieval of old, new values
- Efficient retrieval of all values for key
Infinite Retention/Lookback coming later in
2021
39. Scalable, Multi Model Indexes
Partitions are very coarse file-level indexes
Finer grained indexes as new partitions to
metadata table
- Bloom Filter, Bitmaps
- Column ranges (RFC-27)
- HFile/Hash indexes
- Search?
External indexes
- DynamoDB, Spanner + other cloud stores
- C*, Mongo and other
40. Caching
LRU Cache ala DB Buffer Pool
Frequent Commits => Small objects/blocks
- Today : Aggressively table services
- Tomorrow : File Group/Hudi file model
aware caching
- Mutable data => FileSystem/Block level
caches are not that effective.
Benefits
- Great performance for CDC tables
- Avoid open/close costs for small objects
41. Timeline Metaserver
Interesting fact : Hudi has a metaserver already
- Runs on Spark driver; Serves FileSystem
RPCs + queries on timeline
- Backed by rocksDB, updated
incrementally on every timeline action
- Very useful in streaming jobs
- But, still standalone
Data lakes need a new metaserver
- Flat file metastores are cool? (really?)
- Sometimes I miss HMS (sometimes..)
- Let’s learn from Cloud warehouses
47. Hudi powers one of the largest transactional
data lakes on the planet @ Uber
Operated 150PB+ Data Lake platform for 4+
years
Multi engine environment with Presto, Spark,
Hive, Vertica & more
Architected several data services for
deletion/GDPR across 15K+ data users
Mission critical to all of Uber w/ data
monitoring/schemas/quality enforcement
~8000
Tables
150+
PB
3-30
Mins Fresh
~1.5
PB/day
~850
million
vcore-secs
~4
Engines
Hudi @ Uber