SlideShare una empresa de Scribd logo
[Live] Incremental data
processing with Hudi &
Spark + dbt
December 06, 2023
Shiyan Xu
Apache Hudi PMC
❏ PMC member @ Apache Hudi
❏ Open Source Engineer @ Onehouse
❏ ex Tech Lead Manager @ Zendesk
Shiyan Xu
Speaker Bio
in/xushiyan
@rshiyanxu
blog.datumagic.com
The medallion
architecture
Medallion Architecture Overview
So, what does it take to build
medallion architecture?
Challenges in the Medallion Architecture
But … what if you can simplify
the medallion architecture?
Simplified architecture with Apache Hudi
Apache Hudi Overview
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw Cleaned Derived
Incremental
processing with
Spark + dbt
dbt overview
Apache
Kafka
Raw Cleaned Derived
Lakehouse storage
Extract &
Load Transform
dbt (data build tool)
● handles the T in ELT
● compiles and runs SQL
with engines like Spark
Read more: What, exactly, is dbt?
dbt project structure
tells dbt the project context
let dbt know how to build a specific data set
define transformations between data sets
defines data set schemas
contains compiled/runtime SQLs
dbt case study: update user profiles
Profile
update
events
Raw
updates
Profiles Profile
changes
Downstream
jobs
dbt case study: update user profiles
-- raw_updates.sql
{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='insert_overwrite'
)
}}
with source_data as (
select '101' as user_id, 'A' as city, unix_timestamp() as
updated_at
union all
select '102' as user_id, 'B' as city, unix_timestamp() as
updated_at
union all
select '103' as user_id, 'C' as city, unix_timestamp() as
updated_at
)
select *
from source_data
select user_id, city,
updated_at from raw_updates
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| A|1701083620|
| 103| C|1701083620|
| 102| B|1701083620|
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
dbt case study: update user profiles
-- profiles.sql
{{
config(
materialized='incremental',
incremental_strategy='merge',
merge_update_columns = ['city', 'updated_at'],
unique_key='user_id',
file_format='hudi',
options={
'type': 'cow',
'primaryKey': 'user_id',
'preCombineField': 'updated_at',
'hoodie.table.cdc.enabled': 'true'
}
)
}}
with new_updates as (
select user_id, city, updated_at from {{ ref('raw_updates') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select user_id, city, updated_at from new_updates
select user_id, city,
updated_at from profiles
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
dbt case study: update user profiles
-- profile_changes.sql
{{
config(
materialized='incremental',
file_format='hudi'
)
}}
with new_changes as (
select
GET_JSON_OBJECT(after, '$.user_id') AS user_id,
GET_JSON_OBJECT(after, '$.city') AS new_city,
ts_ms as process_ts
from hudi_table_changes('dbt_example_cdc.profiles', 'cdc',
from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss'))
{% if is_incremental() %}
where ts_ms > (select max(process_ts) from {{ this }})
{% endif %}
)
select user_id, new_city, process_ts
from new_changes
select user_id, new_city
from profile_changes
+-------+--------+
|user_id|new_city|
+-------+--------+
| 102| E|
| 103| F|
| 101| D|
+-------+--------+
dbt
docs
UI
dbt x Hudi recap
● dbt supports incremental & merge semantics
● Hudi CDC feature supports rich data capabilities and fits
the incremental model
● Efficiency & cost-saving
● Sample code @
https://github.com/apache/hudi/tree/master/hudi-exam
ples/hudi-examples-dbt
Come Build With The Community!
Checkout Hudi docs 🔖
Give us a star in Github ⭐
Join Hudi Slack 👥
Follow us on Linkedin!
Join our Twitter Community!
Subscribe to our Mailing list (send an empty email to subscribe) 📩
Subscribe to Apache Hudi Youtube Channel
Thanks!
Questions?
Join Hudi Slack
in/xushiyan
@rshiyanxu
blog.datumagic.com

Más contenido relacionado

La actualidad más candente

Android with kotlin course
Android with kotlin courseAndroid with kotlin course
Android with kotlin course
Abdul Rahman Masri Attal
 
Sistemas de numeração
Sistemas de numeraçãoSistemas de numeração
Sistemas de numeração
Jocelma Rios
 
Programação Orientada a Objetos parte 1
Programação Orientada a Objetos parte 1Programação Orientada a Objetos parte 1
Programação Orientada a Objetos parte 1
Elaine Cecília Gatto
 
UML diagrams and symbols
UML diagrams and symbolsUML diagrams and symbols
UML diagrams and symbols
Kumar
 
Unified Modeling Language
Unified Modeling LanguageUnified Modeling Language
Unified Modeling Language
surana college
 
Aula 5 banco de dados
Aula 5   banco de dadosAula 5   banco de dados
Aula 5 banco de dados
Jorge Ávila Miranda
 
Espaço Vetorial: Teoria e Exercícios resolvidos
Espaço Vetorial: Teoria e Exercícios resolvidosEspaço Vetorial: Teoria e Exercícios resolvidos
Espaço Vetorial: Teoria e Exercícios resolvidos
numerosnamente
 
Unreal Engine Basics 01 - Game Framework
Unreal Engine Basics 01 - Game FrameworkUnreal Engine Basics 01 - Game Framework
Unreal Engine Basics 01 - Game Framework
Nick Pruehs
 
Modelos para o desenvolvimento da Competência Informacional
Modelos para o desenvolvimento da Competência InformacionalModelos para o desenvolvimento da Competência Informacional
Modelos para o desenvolvimento da Competência Informacional
Alexandre Pedro de Oliveira
 
Aula 7 banco de dados
Aula 7   banco de dadosAula 7   banco de dados
Aula 7 banco de dados
Jorge Ávila Miranda
 
Vulkan 1.1 Reference Guide
Vulkan 1.1 Reference GuideVulkan 1.1 Reference Guide
Vulkan 1.1 Reference Guide
The Khronos Group Inc.
 
Modelo caso uso
Modelo caso usoModelo caso uso
Modelo caso uso
Gabriel Faustino
 
Analise e Projeto de Sistemas
Analise e Projeto de SistemasAnalise e Projeto de Sistemas
Analise e Projeto de Sistemas
Victor Mateus Espindula
 
Arquitetura de Software - Uma visão gerencial
Arquitetura de Software - Uma visão gerencialArquitetura de Software - Uma visão gerencial
Arquitetura de Software - Uma visão gerencial
Alexandre Leão
 
Unreal Engine Basics 02 - Unreal Editor
Unreal Engine Basics 02 - Unreal EditorUnreal Engine Basics 02 - Unreal Editor
Unreal Engine Basics 02 - Unreal Editor
Nick Pruehs
 

La actualidad más candente (15)

Android with kotlin course
Android with kotlin courseAndroid with kotlin course
Android with kotlin course
 
Sistemas de numeração
Sistemas de numeraçãoSistemas de numeração
Sistemas de numeração
 
Programação Orientada a Objetos parte 1
Programação Orientada a Objetos parte 1Programação Orientada a Objetos parte 1
Programação Orientada a Objetos parte 1
 
UML diagrams and symbols
UML diagrams and symbolsUML diagrams and symbols
UML diagrams and symbols
 
Unified Modeling Language
Unified Modeling LanguageUnified Modeling Language
Unified Modeling Language
 
Aula 5 banco de dados
Aula 5   banco de dadosAula 5   banco de dados
Aula 5 banco de dados
 
Espaço Vetorial: Teoria e Exercícios resolvidos
Espaço Vetorial: Teoria e Exercícios resolvidosEspaço Vetorial: Teoria e Exercícios resolvidos
Espaço Vetorial: Teoria e Exercícios resolvidos
 
Unreal Engine Basics 01 - Game Framework
Unreal Engine Basics 01 - Game FrameworkUnreal Engine Basics 01 - Game Framework
Unreal Engine Basics 01 - Game Framework
 
Modelos para o desenvolvimento da Competência Informacional
Modelos para o desenvolvimento da Competência InformacionalModelos para o desenvolvimento da Competência Informacional
Modelos para o desenvolvimento da Competência Informacional
 
Aula 7 banco de dados
Aula 7   banco de dadosAula 7   banco de dados
Aula 7 banco de dados
 
Vulkan 1.1 Reference Guide
Vulkan 1.1 Reference GuideVulkan 1.1 Reference Guide
Vulkan 1.1 Reference Guide
 
Modelo caso uso
Modelo caso usoModelo caso uso
Modelo caso uso
 
Analise e Projeto de Sistemas
Analise e Projeto de SistemasAnalise e Projeto de Sistemas
Analise e Projeto de Sistemas
 
Arquitetura de Software - Uma visão gerencial
Arquitetura de Software - Uma visão gerencialArquitetura de Software - Uma visão gerencial
Arquitetura de Software - Uma visão gerencial
 
Unreal Engine Basics 02 - Unreal Editor
Unreal Engine Basics 02 - Unreal EditorUnreal Engine Basics 02 - Unreal Editor
Unreal Engine Basics 02 - Unreal Editor
 

Similar a Incremental data processing with Hudi & Spark + dbt.pdf

Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
Open Party
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
MongoDB
 
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Amazon Web Services
 
DAC4B 2015 - Polybase
DAC4B 2015 - PolybaseDAC4B 2015 - Polybase
DAC4B 2015 - Polybase
Łukasz Grala
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
SHRUG GIS
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
Natasha Wilson
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
Simon Su
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
Doris Chen
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard Training
Phuoc Do
 
How to create an Angular builder
How to create an Angular builderHow to create an Angular builder
How to create an Angular builder
Maurizio Vitale
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
Te kslate sap bods
Te kslate sap bodsTe kslate sap bods
Te kslate sap bods
tekslate1
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
Samatha Kamuni
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
Samatha Kamuni
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
Samatha Kamuni
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
Samatha Kamuni
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
Kellyn Pot'Vin-Gorman
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 

Similar a Incremental data processing with Hudi & Spark + dbt.pdf (20)

Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
 
DAC4B 2015 - Polybase
DAC4B 2015 - PolybaseDAC4B 2015 - Polybase
DAC4B 2015 - Polybase
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard Training
 
How to create an Angular builder
How to create an Angular builderHow to create an Angular builder
How to create an Angular builder
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Te kslate sap bods
Te kslate sap bodsTe kslate sap bods
Te kslate sap bods
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 

Último

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Último (20)

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Incremental data processing with Hudi & Spark + dbt.pdf

  • 1. [Live] Incremental data processing with Hudi & Spark + dbt December 06, 2023 Shiyan Xu Apache Hudi PMC
  • 2. ❏ PMC member @ Apache Hudi ❏ Open Source Engineer @ Onehouse ❏ ex Tech Lead Manager @ Zendesk Shiyan Xu Speaker Bio in/xushiyan @rshiyanxu blog.datumagic.com
  • 5. So, what does it take to build medallion architecture?
  • 6. Challenges in the Medallion Architecture
  • 7. But … what if you can simplify the medallion architecture?
  • 9. Apache Hudi Overview Open Formats CDC Incremental Change Feed Transactions + Concurrency Managed Perf Tuning +++ More Auto Catalog Sync Merge-On-Read Stream Writers AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Lakehouse Platform Apache Kafka Raw Cleaned Derived
  • 11. dbt overview Apache Kafka Raw Cleaned Derived Lakehouse storage Extract & Load Transform dbt (data build tool) ● handles the T in ELT ● compiles and runs SQL with engines like Spark Read more: What, exactly, is dbt?
  • 12. dbt project structure tells dbt the project context let dbt know how to build a specific data set define transformations between data sets defines data set schemas contains compiled/runtime SQLs
  • 13. dbt case study: update user profiles Profile update events Raw updates Profiles Profile changes Downstream jobs
  • 14. dbt case study: update user profiles -- raw_updates.sql {{ config( materialized='incremental', file_format='hudi', incremental_strategy='insert_overwrite' ) }} with source_data as ( select '101' as user_id, 'A' as city, unix_timestamp() as updated_at union all select '102' as user_id, 'B' as city, unix_timestamp() as updated_at union all select '103' as user_id, 'C' as city, unix_timestamp() as updated_at ) select * from source_data select user_id, city, updated_at from raw_updates +-------+----+----------+ |user_id|city|updated_at| +-------+----+----------+ | 101| A|1701083620| | 103| C|1701083620| | 102| B|1701083620| | 101| D|1701084137| | 102| E|1701084365| | 103| F|1701084369| +-------+----+----------+
  • 15. dbt case study: update user profiles -- profiles.sql {{ config( materialized='incremental', incremental_strategy='merge', merge_update_columns = ['city', 'updated_at'], unique_key='user_id', file_format='hudi', options={ 'type': 'cow', 'primaryKey': 'user_id', 'preCombineField': 'updated_at', 'hoodie.table.cdc.enabled': 'true' } ) }} with new_updates as ( select user_id, city, updated_at from {{ ref('raw_updates') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %} ) select user_id, city, updated_at from new_updates select user_id, city, updated_at from profiles +-------+----+----------+ |user_id|city|updated_at| +-------+----+----------+ | 101| D|1701084137| | 102| E|1701084365| | 103| F|1701084369| +-------+----+----------+
  • 16. dbt case study: update user profiles -- profile_changes.sql {{ config( materialized='incremental', file_format='hudi' ) }} with new_changes as ( select GET_JSON_OBJECT(after, '$.user_id') AS user_id, GET_JSON_OBJECT(after, '$.city') AS new_city, ts_ms as process_ts from hudi_table_changes('dbt_example_cdc.profiles', 'cdc', from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss')) {% if is_incremental() %} where ts_ms > (select max(process_ts) from {{ this }}) {% endif %} ) select user_id, new_city, process_ts from new_changes select user_id, new_city from profile_changes +-------+--------+ |user_id|new_city| +-------+--------+ | 102| E| | 103| F| | 101| D| +-------+--------+
  • 18. dbt x Hudi recap ● dbt supports incremental & merge semantics ● Hudi CDC feature supports rich data capabilities and fits the incremental model ● Efficiency & cost-saving ● Sample code @ https://github.com/apache/hudi/tree/master/hudi-exam ples/hudi-examples-dbt
  • 19. Come Build With The Community! Checkout Hudi docs 🔖 Give us a star in Github ⭐ Join Hudi Slack 👥 Follow us on Linkedin! Join our Twitter Community! Subscribe to our Mailing list (send an empty email to subscribe) 📩 Subscribe to Apache Hudi Youtube Channel