SlideShare a Scribd company logo
1 of 23
Download to read offline
A High Performance
Mutable Engagement
Activity Delta Lake
Zhidong Ke, Heng Zhang
● Engagement Delta Lake
● Pipeline Requirements
● Pipeline Design
● Performance Benchmarking Results
● Q & A
Agenda
● At salesforce, our customers are using High Velocity Sales(HVS) to intelligently convert
leads and create new opportunities
● We built the Engagement Activity Platform(EAP) to capture and store user engagement
activities
● Engagement activity delta lake is the key component on EAP
● This large amount of data can only scale using engagement activity delta lake, built on
top of Delta Lake
What’s Engagement Activity Delta Lake
Key Use Case of Engagement Delta Lake
● Use engagement metrics/rates to help identify which cadence, template is more
effective
● Use engagements such as open/reply rate to identify which customers are more
engaged
● Leverage engagement dashboard to drive intelligence into sales productivity
Delta Lake Requirements
● Independent Stream Process to Support Engagement Data Lifecycle
● Downstream Batch/Incremental Read
● High Throughput Transactions in Engagement ID Mutation
● High Data Consistency and Integrity
Independent Stream Process to Support Engagement Data Lifecycle
● Created a separate table called Notification Table that was partitioned by organization
ID and ingestion timestamp.
● Downstream consumers can use streaming mode
○ Pulling from Notification Table to get delta changes metadata (table
name/orgId/timestamp)
○ Use the metadata to pull engagement data from Data Table
● Consumers can also pull directly from Data Table using batch mode
Downstream Batch/Incremental Read
Downstream Batch/Incremental Read
● We extended this design pattern to mutation/TTL/GDPR jobs
● We keep the insert/update/delete counter per batch for auditing
Notification Table
High Throughput Transactions in Engagement ID Mutation
Engagement ID Mutation
● Support : Convert, Merge and Delete
● Convert: A lead L could become a contact C with a new ID, and all
engagements that belong to L will have a new Engagement ID.
High Throughput Transactions in Engagement ID Mutation
Engagement ID Mutation
Mutation Request Table
EngagementData Table
Id: string
High Throughput Transactions in Engagement ID Mutation
Use Graph To Detect Cascading Mutation
High Throughput Transactions in Engagement ID Mutation
Partitioned by OrgId and Z-Order by Engagement Date
● To have the data of Engagement table evenly distributed across
reasonably-sized files
● Data are written to a partition directory of org, clustered by z-order
column and
● Benefits:
○ Manage granularity, small files per org/date are merged into a
bigger one which helps reduce the number of small files.
○ We can tune partition granularity with spark.databricks.delta.optimize.maxFileSize
High Throughput Transactions in Engagement ID Mutation
Query by I/O Pruning -- Data Skipping and Z-Order
● Data Skipping
○ Delta Lake automatically maintains the min and max value for up
to 32 fields in delta table and stores those values as part of the
metadata
○ By leveraging min-max ranges, Delta Lake is able to skip the files
that are out of the range of the querying field values
● Z-Order
○ In order to make it effective, data can be clustered by Z-Order
columns so that min-max ranges are narrow and, ideally,
non-overlapping
High Throughput Transactions in Engagement ID Mutation
Query by I/O Pruning -- Data Skipping and Z-Order
High Data Consistency and Integrity
Exact Once Write Across Tables
● Checkpoint Store that stores the start, end offset, Kafka metadata, and last job state for
a given checkpoint
● We created a batch metadata store to store job name, batch ID (the last succeed batch
ID provided by Spark foreachBatch API), process name, and timestamp of last modified
time.
High Data Consistency and Integrity
Exact Once Write Across Tables
Happy Path Flow
job_name batch_id process_name last_modified
Ingestion 10 data_ingestion 1611683367
Ingestion 10 data_notification 1611673367
High Data Consistency and Integrity
Exact Once Write Across Tables
Unhappy Path Flow
job_name batch_id process_name last_modified
Ingestion 10 data_ingestion 1611683367
Ingestion 9 data_notification 1611673367
High Data Consistency and Integrity
Global Synchronization and Ordering
● Avoid Conflicting Commit
● Ensure Engagement Lifecycle Order:
○ Ingestion -> Mutation -> Deletion
● Apply to micro batch
High Data Consistency and Integrity
Global Synchronization and Ordering
● Global Synchronization
○ ZK Distributed Lock
● Ordering
○ Compare & Swap
High Data Consistency and Integrity
Global Synchronization and Ordering
1. Streaming Job starts and Job Coordinator is
initialized with Zookeeper.
2. Streaming job pulls data from Kafka periodically
and starts a micro-batch process when message
arrives in Kafka.
3. Within a micro-batch process, Job Coordinator
first tries to obtain a distributed lock with the
given resource name set in
job.coordinator.lock.name. If it cannot obtain a
lock within a given time, it gives up. The next pull
will start from the last checkpoint.
4. Once it obtains the lock, it reads the the
Predecessor field and compares it with the
expected one set in job.coordinator.predecessor.
(4.1) If the predecessor is not expected, it gives
up this turn, releases the lock, and the next pull
will start from the last checkpoint. (4.2) If the
predecessor is expected, it registers its name set
in job.coordinator.name.
5. The micro-batch process starts.
6. The checkpoint is saved.
Engagement ID Mutation Performance Benchmarking Result
28 million update/delete requests within 8 minutes
cluster of 32 - i3 8x Large
spark.databricks.delta.optimize.maxFileSize = 128 MB
Resources
● Engagement Activity Delta Lake
○ Blog , Video
● Boost Delta Lake Performance with Data Skipping and Z-Order
○ Blog, Video
● Global Synchronization and Ordering in Delta Lake
○ Blog, Video
Thank You

More Related Content

What's hot

Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 

What's hot (20)

A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 
Building a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public HealthBuilding a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public Health
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
Translating Models to Medicine an Example of Managing Visual Communications
Translating Models to Medicine an Example of Managing Visual CommunicationsTranslating Models to Medicine an Example of Managing Visual Communications
Translating Models to Medicine an Example of Managing Visual Communications
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 

Similar to A High Performance Mutable Engagement Activity Delta Lake

20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 

Similar to A High Performance Mutable Engagement Activity Delta Lake (20)

Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scale
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
.NET Fest 2017. Anton Moldovan. How do we cook highload microservices at SBTech?
.NET Fest 2017. Anton Moldovan. How do we cook highload microservices at SBTech?.NET Fest 2017. Anton Moldovan. How do we cook highload microservices at SBTech?
.NET Fest 2017. Anton Moldovan. How do we cook highload microservices at SBTech?
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta Lake
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Demystifying Data Virtualization (ASEAN)
Demystifying Data Virtualization (ASEAN)Demystifying Data Virtualization (ASEAN)
Demystifying Data Virtualization (ASEAN)
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 

Recently uploaded (20)

Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 

A High Performance Mutable Engagement Activity Delta Lake

  • 1. A High Performance Mutable Engagement Activity Delta Lake Zhidong Ke, Heng Zhang
  • 2. ● Engagement Delta Lake ● Pipeline Requirements ● Pipeline Design ● Performance Benchmarking Results ● Q & A Agenda
  • 3. ● At salesforce, our customers are using High Velocity Sales(HVS) to intelligently convert leads and create new opportunities ● We built the Engagement Activity Platform(EAP) to capture and store user engagement activities ● Engagement activity delta lake is the key component on EAP ● This large amount of data can only scale using engagement activity delta lake, built on top of Delta Lake What’s Engagement Activity Delta Lake
  • 4. Key Use Case of Engagement Delta Lake ● Use engagement metrics/rates to help identify which cadence, template is more effective ● Use engagements such as open/reply rate to identify which customers are more engaged ● Leverage engagement dashboard to drive intelligence into sales productivity
  • 5. Delta Lake Requirements ● Independent Stream Process to Support Engagement Data Lifecycle ● Downstream Batch/Incremental Read ● High Throughput Transactions in Engagement ID Mutation ● High Data Consistency and Integrity
  • 6. Independent Stream Process to Support Engagement Data Lifecycle
  • 7. ● Created a separate table called Notification Table that was partitioned by organization ID and ingestion timestamp. ● Downstream consumers can use streaming mode ○ Pulling from Notification Table to get delta changes metadata (table name/orgId/timestamp) ○ Use the metadata to pull engagement data from Data Table ● Consumers can also pull directly from Data Table using batch mode Downstream Batch/Incremental Read
  • 8. Downstream Batch/Incremental Read ● We extended this design pattern to mutation/TTL/GDPR jobs ● We keep the insert/update/delete counter per batch for auditing Notification Table
  • 9. High Throughput Transactions in Engagement ID Mutation Engagement ID Mutation ● Support : Convert, Merge and Delete ● Convert: A lead L could become a contact C with a new ID, and all engagements that belong to L will have a new Engagement ID.
  • 10. High Throughput Transactions in Engagement ID Mutation Engagement ID Mutation Mutation Request Table EngagementData Table Id: string
  • 11. High Throughput Transactions in Engagement ID Mutation Use Graph To Detect Cascading Mutation
  • 12. High Throughput Transactions in Engagement ID Mutation Partitioned by OrgId and Z-Order by Engagement Date ● To have the data of Engagement table evenly distributed across reasonably-sized files ● Data are written to a partition directory of org, clustered by z-order column and ● Benefits: ○ Manage granularity, small files per org/date are merged into a bigger one which helps reduce the number of small files. ○ We can tune partition granularity with spark.databricks.delta.optimize.maxFileSize
  • 13. High Throughput Transactions in Engagement ID Mutation Query by I/O Pruning -- Data Skipping and Z-Order ● Data Skipping ○ Delta Lake automatically maintains the min and max value for up to 32 fields in delta table and stores those values as part of the metadata ○ By leveraging min-max ranges, Delta Lake is able to skip the files that are out of the range of the querying field values ● Z-Order ○ In order to make it effective, data can be clustered by Z-Order columns so that min-max ranges are narrow and, ideally, non-overlapping
  • 14. High Throughput Transactions in Engagement ID Mutation Query by I/O Pruning -- Data Skipping and Z-Order
  • 15. High Data Consistency and Integrity Exact Once Write Across Tables ● Checkpoint Store that stores the start, end offset, Kafka metadata, and last job state for a given checkpoint ● We created a batch metadata store to store job name, batch ID (the last succeed batch ID provided by Spark foreachBatch API), process name, and timestamp of last modified time.
  • 16. High Data Consistency and Integrity Exact Once Write Across Tables Happy Path Flow job_name batch_id process_name last_modified Ingestion 10 data_ingestion 1611683367 Ingestion 10 data_notification 1611673367
  • 17. High Data Consistency and Integrity Exact Once Write Across Tables Unhappy Path Flow job_name batch_id process_name last_modified Ingestion 10 data_ingestion 1611683367 Ingestion 9 data_notification 1611673367
  • 18. High Data Consistency and Integrity Global Synchronization and Ordering ● Avoid Conflicting Commit ● Ensure Engagement Lifecycle Order: ○ Ingestion -> Mutation -> Deletion ● Apply to micro batch
  • 19. High Data Consistency and Integrity Global Synchronization and Ordering ● Global Synchronization ○ ZK Distributed Lock ● Ordering ○ Compare & Swap
  • 20. High Data Consistency and Integrity Global Synchronization and Ordering 1. Streaming Job starts and Job Coordinator is initialized with Zookeeper. 2. Streaming job pulls data from Kafka periodically and starts a micro-batch process when message arrives in Kafka. 3. Within a micro-batch process, Job Coordinator first tries to obtain a distributed lock with the given resource name set in job.coordinator.lock.name. If it cannot obtain a lock within a given time, it gives up. The next pull will start from the last checkpoint. 4. Once it obtains the lock, it reads the the Predecessor field and compares it with the expected one set in job.coordinator.predecessor. (4.1) If the predecessor is not expected, it gives up this turn, releases the lock, and the next pull will start from the last checkpoint. (4.2) If the predecessor is expected, it registers its name set in job.coordinator.name. 5. The micro-batch process starts. 6. The checkpoint is saved.
  • 21. Engagement ID Mutation Performance Benchmarking Result 28 million update/delete requests within 8 minutes cluster of 32 - i3 8x Large spark.databricks.delta.optimize.maxFileSize = 128 MB
  • 22. Resources ● Engagement Activity Delta Lake ○ Blog , Video ● Boost Delta Lake Performance with Data Skipping and Z-Order ○ Blog, Video ● Global Synchronization and Ordering in Delta Lake ○ Blog, Video