SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Ecosystem
Unlocking the power
of Lakehouse
architectures
with Apache Pulsar
and Apache Hudi
Alexey Kudinkin
Founding Engineer • Onehouse
Addison Higham
Chief Architect • StreamNative
Alexey Kudinkin
Founding Engineer
Onehouse
● Founding Engineer at Onehouse
● Prior 5 years spent at Uber (re)building
out its Fulfillment Platform from the
grounds up
@alexeykudinkin @alexeykudinkin
● Member of StreamNative team for 2
years
● Last 8 years in big-data/streaming data
● Apache Pulsar Committer and member
of Pulsar community for 5.5 years
● Previously at Instructure as Data and
Platform Architect
Addison Higham
Chief Architect
StreamNative @addisonjh @addisonj
<DRAFT PLAN>
TBR
1. What are Lakehouses?
2. Overview of Apache Hudi
3. Why Pulsar for Lakehouse?
4. Apache Pulsar integration
5. Demo
Unlocking Power of the Lakhouses using Pulsar and Hudi
What are Lakehouses?
What is Lakehouse?
On-Prem Data
warehouses
(Traditional
BI/Reporting)
2000s - Hadoop
Data Lakes
(Search/Social)
2014 - Apache Spark
(Data Science)
2016 - Apache Hudi
(Txns, Streams)
2017 - Databricks
Delta*
2012 - BigQuery
(Serverless)
2014 - Snowflake
(Decoupling/UX)
2013- Redshift
(Cloud)
Cloud
Warehouse
Lakehouse
*Databricks was the one to coin term “Lakehouse”
What is Lakehouse?
Lakehouse Architecture
Query
Engine(s)
Storage
Transactional
Layer
Traditional Data Lakes
Cloud Storage (S3/GCS/ABS/…)
Parquet/ORC/JSON/CSV/…
Local
Cache
SQL Exec
Node A
Optimizer
Local
Cache
SQL Exec
Node B
Optimizer
Local
Cache
SQL Exec
Node C
Optimizer
Lakehouses
Cloud Storage (S3/GCS/ABS/…)
Parquet/ORC/JSON/CSV/…
Metadata
Local
Cache
SQL Exec
Node A
Optimizer
Local
Cache
SQL Exec
Node B
Optimizer
Local
Cache
SQL Exec
Node C
Optimizer
Table Format
Table Services Txn Manager Indexes
Overview of Apache Hudi
Unlocking Power of the Lakhouses using Pulsar and Hudi
Apache Hudi Overview
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
Impala, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API
Apache Hudi Overview
COW, MOR, WTH?
Copy-on-Write
Merge-on-Read
Incoming Data
Versioned Base files
v1
v2
v1
v2
v1
v2
v1
v2
COW Table
Incoming Data
Versioned Base files + Change logs
v1 v1 v1 v1
MOR Table
v1 v1 v1 v1
v1
v2
v1
v2
v1
v2
v1
v2
Compaction
Write
Write
Read
Read
v2 v2 v2
Snapshot read fetches
the latest version
…
v1* v1* v1*
Snapshot read fetches
the latest version and
merges change-log
…
Apache Hudi Overview
Comparing COW and MOR
COW MOR
Writing Cost
High
(one updated record →
rewrites whole file)
Low
(updated records persisted in
change-logs)
Ingestion Latency High
(see above)
Fast
(see above)
Querying Speed Fast
(data read as is)
Slow(er)
w/o compaction
(updated records from change-logs
have to be applied to original ones
when reading)
Fast
after compaction
(data read as is, identical to COW)
Overall
Fast querying at the expense of
write amplification and slower
ingestion
Fast writing allowing to amortize
updating cost across many writes
Apache Hudi Overview
Who’s using?
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https://eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at Petabyte scale
https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system at *Exabyte* scale
http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https://s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https://www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https://searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
Why Pulsar for Lakehouse?
Unlocking Power of the Lakhouses using Pulsar and Hudi
Why Pulsar for Lakehouse
Different domains, analogous problems
Microservices -> EDA
requires comprehensive
platform. Legacy message
tech can’t keep up.
Streams alone are not
enough for app tier.
Batch → streams and
hours → minutes of
latency breaks data lakes.
Legacy meta-stores and
inconsistent metadata
aren’t sufficient.
Why Pulsar for Lakehouse
Different domains, similar goals
Pulsar is the multi-tenant
real-time data platform
that scales across orgs to
simplify building
event-driven apps with
messages and streams
Lakehouse is the modern
solution to providing
consistent access to
minute-latency data and
batch data across a rich
ecosystem of tools
Apache Pulsar + Apache Hudi
Unlocking Power of the Lakhouses using Pulsar and Hudi
Apache Pulsar + Apache Hudi
Across the data ecosystem
Apache Pulsar + Apache Hudi together
provide teams with a powerful solution for
data across app, data, and time domains
Real-time Batch
Offload topic to hudi tables
milliseconds seconds minutes hours days months forever
app-tier Real-time analytics BI / Batch
Load hudi tables to topic
App-tier Data-tier
Apache Pulsar + Apache Hudi = Lakehouse
Integration Options
There are currently a few ways to ingest the
data from Pulsar using Spark and Hudi:
1. Using Pulsar’s Apache Spark connector
2. Using DeltaStreamer utility from Hudi
3. Using StreamNative Lakehouse Sink (Beta)
Apache Pulsar + Apache Hudi = Lakehouse
Using Pulsar’s Apache Spark connector
val topicName = "realtime-impressions"
val tableName = "rt_impressions"
// Fetching the data from Pulsar
val df =
spark.read.format("pulsar").
option("service.url", "pulsar://localhost:6650").
option("topics", topicName).
option("startingOffsets", startingOffsets).
option("endingOffsets", endingOffsets).
load()
// And writing it into Hudi table
df.write.format("hudi").
option("hoodie.datasource.write.table.name", tableName).
option("hoodie.datasource.write.operation", "bulk_insert").
// Record keys are necessary for Hudi to efficiently perform delete/update
operations
option("hoodie.datasource.write.recordkey.field", "event_id").
// We're creating a non-partitioned table
option("hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
mode(SaveMode.Append).
save(s"s3a://hudi-tables/$tableName")
Apache Pulsar + Apache Hudi = Lakehouse
Using Hudi’s DeltaStreamer
export TOPIC_NAME=stonks
./bin/spark-submit 
--master 'local[2]' 
--deploy-mode client 
--packages io.streamnative.connectors:pulsar-spark-connector_2.12:3.1.1.4 
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer <hudi.jar> 
--table-type COPY_ON_WRITE 
--source-class org.apache.hudi.utilities.sources.PulsarSource 
--source-ordering-field ts 
--target-base-path file:///data/tables/$TOPIC_NAME 
--target-table $TOPIC_NAME 
--hoodie-conf hoodie.datasource.write.recordkey.field=key 
--hoodie-conf hoodie.datasource.write.partitionpath.field=date 
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator 
--hoodie-conf hoodie.deltastreamer.source.pulsar.topic=$TOPIC_NAME 
--hoodie-conf hoodie.deltastreamer.source.pulsar.offset.autoResetStrategy=EARLIEST 
--hoodie-conf
hoodie.deltastreamer.source.pulsar.endpoint.service.url=pulsar://localhost:6650 
--hoodie-conf
hoodie.deltastreamer.source.pulsar.endpoint.admin.url=http://localhost:8080
Apache Pulsar + Apache Hudi = Lakehouse
Using StreamNative Lakehouse Sink
github.com/streamnative/pulsar-io-lakehouse
Hudi Sink
Topic
W/
Schema
Metadata
Change
Parquet
File
New Schema
New Record
Updated
Local
Buffer
1. Flush
or
2. New
Table
Commit
Apache Pulsar + Apache Hudi = Lakehouse
Using StreamNative Lakehouse Sink
Current Status: Beta
Ideal for:
● Append-only tables / low volume tables
○ CoW and MoR supported, but CoW is expensive, MoR requires
external compaction
● Low concurrency workloads
○ Conflicts more likely with higher concurrency
Future Work:
● Improved coordination / higher concurrency
● Read hudi tables into topics
● Integrate Lakehouse into tiered storage
Demo
Unlocking Power of the Lakhouses using Pulsar and Hudi
Alexey Kudinkin
Thank you!
alexey@onehouse.ai
@alexeykudinkin
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Apache Pulsar + Apache Hudi = Lakehouse
Using DeltaStreamer from Hudi
Demo Script
Step #1
1. Ingest 1st batch to Pulsar (stock ticks dataset)
2. Run DS (ingests 1st batch into Hudi)
3. Show the dataset (schema, data itself, counts)
4. Run DS again (no new data nothing to ingest)
Step #2
1. Ingest 2d batch to Pulsar (stock ticks dataset)
2. Run DS (ingest 2d batch into Hudi)
3. Show the dataset (schema, data itself, counts)

Más contenido relacionado

La actualidad más candente

Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfdogma28
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 

La actualidad más candente (20)

Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdf
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 

Similar a Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQLSATOSHI TAGOMORI
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 

Similar a Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022 (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 

Más de StreamNative

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022StreamNative
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022StreamNative
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...StreamNative
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...StreamNative
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022StreamNative
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022StreamNative
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022StreamNative
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022StreamNative
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...StreamNative
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021StreamNative
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative
 
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021StreamNative
 

Más de StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022

  • 1. Pulsar Summit San Francisco Hotel Nikko August 18 2022 Ecosystem Unlocking the power of Lakehouse architectures with Apache Pulsar and Apache Hudi Alexey Kudinkin Founding Engineer • Onehouse Addison Higham Chief Architect • StreamNative
  • 2. Alexey Kudinkin Founding Engineer Onehouse ● Founding Engineer at Onehouse ● Prior 5 years spent at Uber (re)building out its Fulfillment Platform from the grounds up @alexeykudinkin @alexeykudinkin
  • 3. ● Member of StreamNative team for 2 years ● Last 8 years in big-data/streaming data ● Apache Pulsar Committer and member of Pulsar community for 5.5 years ● Previously at Instructure as Data and Platform Architect Addison Higham Chief Architect StreamNative @addisonjh @addisonj
  • 4. <DRAFT PLAN> TBR 1. What are Lakehouses? 2. Overview of Apache Hudi 3. Why Pulsar for Lakehouse? 4. Apache Pulsar integration 5. Demo
  • 5. Unlocking Power of the Lakhouses using Pulsar and Hudi What are Lakehouses?
  • 6. What is Lakehouse? On-Prem Data warehouses (Traditional BI/Reporting) 2000s - Hadoop Data Lakes (Search/Social) 2014 - Apache Spark (Data Science) 2016 - Apache Hudi (Txns, Streams) 2017 - Databricks Delta* 2012 - BigQuery (Serverless) 2014 - Snowflake (Decoupling/UX) 2013- Redshift (Cloud) Cloud Warehouse Lakehouse *Databricks was the one to coin term “Lakehouse”
  • 7. What is Lakehouse? Lakehouse Architecture Query Engine(s) Storage Transactional Layer Traditional Data Lakes Cloud Storage (S3/GCS/ABS/…) Parquet/ORC/JSON/CSV/… Local Cache SQL Exec Node A Optimizer Local Cache SQL Exec Node B Optimizer Local Cache SQL Exec Node C Optimizer Lakehouses Cloud Storage (S3/GCS/ABS/…) Parquet/ORC/JSON/CSV/… Metadata Local Cache SQL Exec Node A Optimizer Local Cache SQL Exec Node B Optimizer Local Cache SQL Exec Node C Optimizer Table Format Table Services Txn Manager Indexes
  • 8. Overview of Apache Hudi Unlocking Power of the Lakhouses using Pulsar and Hudi
  • 9. Apache Hudi Overview Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API
  • 10. Apache Hudi Overview COW, MOR, WTH? Copy-on-Write Merge-on-Read Incoming Data Versioned Base files v1 v2 v1 v2 v1 v2 v1 v2 COW Table Incoming Data Versioned Base files + Change logs v1 v1 v1 v1 MOR Table v1 v1 v1 v1 v1 v2 v1 v2 v1 v2 v1 v2 Compaction Write Write Read Read v2 v2 v2 Snapshot read fetches the latest version … v1* v1* v1* Snapshot read fetches the latest version and merges change-log …
  • 11. Apache Hudi Overview Comparing COW and MOR COW MOR Writing Cost High (one updated record → rewrites whole file) Low (updated records persisted in change-logs) Ingestion Latency High (see above) Fast (see above) Querying Speed Fast (data read as is) Slow(er) w/o compaction (updated records from change-logs have to be applied to original ones when reading) Fast after compaction (data read as is, identical to COW) Overall Fast querying at the expense of write amplification and slower ingestion Fast writing allowing to amortize updating cost across many writes
  • 12. Apache Hudi Overview Who’s using? Uber rides - 250+ Petabytes from 24h+ to minutes latency https://eng.uber.com/uber-big-data-platform/ Package deliveries - real-time event analytics at Petabyte scale https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/ TikTok/Bytedance recommendation system at *Exabyte* scale http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance Trading transactions - Near real-time CDC from 4000+ postgres tables https://s.apache.org/hudi-robinhood-talk 150 source systems, ETL processing for 10,000+ tables https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ Real-time advertising for 20M+ concurrent viewers https://www.youtube.com/watch?v=mFpqrVxxwKc Store transactions - CDC & Warehousing https://searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
  • 13. Why Pulsar for Lakehouse? Unlocking Power of the Lakhouses using Pulsar and Hudi
  • 14. Why Pulsar for Lakehouse Different domains, analogous problems Microservices -> EDA requires comprehensive platform. Legacy message tech can’t keep up. Streams alone are not enough for app tier. Batch → streams and hours → minutes of latency breaks data lakes. Legacy meta-stores and inconsistent metadata aren’t sufficient.
  • 15. Why Pulsar for Lakehouse Different domains, similar goals Pulsar is the multi-tenant real-time data platform that scales across orgs to simplify building event-driven apps with messages and streams Lakehouse is the modern solution to providing consistent access to minute-latency data and batch data across a rich ecosystem of tools
  • 16. Apache Pulsar + Apache Hudi Unlocking Power of the Lakhouses using Pulsar and Hudi
  • 17. Apache Pulsar + Apache Hudi Across the data ecosystem Apache Pulsar + Apache Hudi together provide teams with a powerful solution for data across app, data, and time domains Real-time Batch Offload topic to hudi tables milliseconds seconds minutes hours days months forever app-tier Real-time analytics BI / Batch Load hudi tables to topic App-tier Data-tier
  • 18. Apache Pulsar + Apache Hudi = Lakehouse Integration Options There are currently a few ways to ingest the data from Pulsar using Spark and Hudi: 1. Using Pulsar’s Apache Spark connector 2. Using DeltaStreamer utility from Hudi 3. Using StreamNative Lakehouse Sink (Beta)
  • 19. Apache Pulsar + Apache Hudi = Lakehouse Using Pulsar’s Apache Spark connector val topicName = "realtime-impressions" val tableName = "rt_impressions" // Fetching the data from Pulsar val df = spark.read.format("pulsar"). option("service.url", "pulsar://localhost:6650"). option("topics", topicName). option("startingOffsets", startingOffsets). option("endingOffsets", endingOffsets). load() // And writing it into Hudi table df.write.format("hudi"). option("hoodie.datasource.write.table.name", tableName). option("hoodie.datasource.write.operation", "bulk_insert"). // Record keys are necessary for Hudi to efficiently perform delete/update operations option("hoodie.datasource.write.recordkey.field", "event_id"). // We're creating a non-partitioned table option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). mode(SaveMode.Append). save(s"s3a://hudi-tables/$tableName")
  • 20. Apache Pulsar + Apache Hudi = Lakehouse Using Hudi’s DeltaStreamer export TOPIC_NAME=stonks ./bin/spark-submit --master 'local[2]' --deploy-mode client --packages io.streamnative.connectors:pulsar-spark-connector_2.12:3.1.1.4 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer <hudi.jar> --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.PulsarSource --source-ordering-field ts --target-base-path file:///data/tables/$TOPIC_NAME --target-table $TOPIC_NAME --hoodie-conf hoodie.datasource.write.recordkey.field=key --hoodie-conf hoodie.datasource.write.partitionpath.field=date --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator --hoodie-conf hoodie.deltastreamer.source.pulsar.topic=$TOPIC_NAME --hoodie-conf hoodie.deltastreamer.source.pulsar.offset.autoResetStrategy=EARLIEST --hoodie-conf hoodie.deltastreamer.source.pulsar.endpoint.service.url=pulsar://localhost:6650 --hoodie-conf hoodie.deltastreamer.source.pulsar.endpoint.admin.url=http://localhost:8080
  • 21. Apache Pulsar + Apache Hudi = Lakehouse Using StreamNative Lakehouse Sink github.com/streamnative/pulsar-io-lakehouse Hudi Sink Topic W/ Schema Metadata Change Parquet File New Schema New Record Updated Local Buffer 1. Flush or 2. New Table Commit
  • 22. Apache Pulsar + Apache Hudi = Lakehouse Using StreamNative Lakehouse Sink Current Status: Beta Ideal for: ● Append-only tables / low volume tables ○ CoW and MoR supported, but CoW is expensive, MoR requires external compaction ● Low concurrency workloads ○ Conflicts more likely with higher concurrency Future Work: ● Improved coordination / higher concurrency ● Read hudi tables into topics ● Integrate Lakehouse into tiered storage
  • 23. Demo Unlocking Power of the Lakhouses using Pulsar and Hudi
  • 24. Alexey Kudinkin Thank you! alexey@onehouse.ai @alexeykudinkin Pulsar Summit San Francisco Hotel Nikko August 18 2022
  • 25. Apache Pulsar + Apache Hudi = Lakehouse Using DeltaStreamer from Hudi Demo Script Step #1 1. Ingest 1st batch to Pulsar (stock ticks dataset) 2. Run DS (ingests 1st batch into Hudi) 3. Show the dataset (schema, data itself, counts) 4. Run DS again (no new data nothing to ingest) Step #2 1. Ingest 2d batch to Pulsar (stock ticks dataset) 2. Run DS (ingest 2d batch into Hudi) 3. Show the dataset (schema, data itself, counts)