SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Harikrishnan Kunhumveettil & Mathan Pillai
Operating and Supporting Delta
Lake in Production
Who we are?
Mathan PillaiHarikrishnan Kunhumveettil
Currently
Sr.TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Sr.TSE MapR
Hadoop Tech. Lead, Nielsen
Currently
Sr. TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Tech Lead ,Intersys Consulting
Sr Big data Consultant ,Saama
Technologies
Agenda
■ Delta Lake in Production - Data
○ Optimize and Auto-Optimize - Overview
○ Choosing the right strategy - The What
○ Choosing the right strategy - The When
○ Choosing the right strategy - The Where
■ Delta Lake in Production - Metadata
○ Sneak Peek Into Delta Log
○ Delta Log Configs
○ Delta Log Misconception
○ Delta Log Exceptions
○ Tips & Tricks
Delta Lake in Production - Data
Optimize and Auto-Optimize - In a nutshell
▪ Bin-
packing/Compaction
▪ Handles small file
problem
▪ Idempotent
▪ Incremental
▪ Creates 1 GB file or
10M records
▪ Controlled by
optimize.maxFileSize
▪ Helps in Data
Skipping
▪ Use Range
Partitioning
▪ Hilbert Curve In
Preview
▪ Partially incremental
▪ Supports
all/new/minCubeSize
▪ Controlled by
optimize.zorder.mergeS
trategy.minCubeSize.th
reshold
OPTIMIZE + ZORDEROPTIMIZE
▪ Unintentionally
referred as Auto-
optimize
▪ Introduce an extra
shuffle phase
▪ Creates row-
compressed data of
512mb (binSize)
▪ Output file ~128 mb
▪ Controlled by
optimizeWrite.binSize
Optimize Write
▪ Mini-Optimize
▪ Creates file as big
as 128 MB
▪ Post-commit
action
▪ Triggered when
more than 50
files/directory
▪ Controlled by:
autoCompact.minNumFi
les
autoCompact.maxFileS
ize
Auto-Compaction
Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
Choosing the right strategy - The What? strategy
● Optimize writes:
○ Misconception - does not work with Streaming workloads
○ Makes life easy for OPTIMIZE and VACUUM
○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic!
3.2 PB
~ 700 TB input data
~ 400 TB new writes
OPTIMIZE takes ~ 6 -8
hours
Run Optimize job 3
times/day
OPTIMIZE WRITE
OPTIMIZE Job takes 2-3
hours.
Run optimize 4 times/day
More than 40% resource
saved on OPTIMIZE
Choosing the right strategy - The What? strategy
● Z-Order Vs Partition By
○ Z-order is better than creating large number of small files.
○ More effective use of DBIO cache through the handling of less metadata
326 TB
3 partitions
25 million files
326 TB
2 partitions
650k files
Choosing the right strategy - The What? strategy
import com.databricks.sql.transaction.tahoe.DeltaLog
import org.apache.hadoop.fs.Path
val deltaPath = "<table_path>"
val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log"))
val currentFiles = deltaLog.snapshot.allFiles
display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
Choosing the right strategy - The When? strategy
● Auto-Optimize runs on the same cluster during/after a write.
● Optimize - Trade off between read performance and cost
● Delay Z-Ordering if you are continuously adding data on active partition.
○ If active reads are not on the latest partition
○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default
○ Reducing the value to make Z-order run time efficient, degrades the read performance
● Should I always run OPTIMIZE + VACUUM ?
○ VACUUM happens on the Spark Driver.
○ Roughly 200k files/hour in ADLS
○ Roughly 300k files/hour in AWS S3
○ DRY RUN gives the estimate
Choosing the right strategy - The Where? strategy
● Auto-optimize runs on the same cluster during/after a write.
● Z-ordering is CPU intensive.
○ Involves Parquet Decoding and Encoding
○ General purpose instances vs Compute optimized clusters.
● Always have “where” clause for OPTIMIZE queries
● Auto-scaling clusters for VACUUM only workloads
Delta Lake in Production - Metadata
Delta Lake Transaction Log
■ Sneak Peek Into Delta Log
■ Delta Log Configs
■ Delta Exceptions
■ Tips & Tricks
Sneak Peek Into Delta Log
Who What When Where
Version N Who What When Where
Version N-1 Who What When Where
Version N-2 Who What When Where
Sneak Peek Into Delta Log
Who ?
Sneak Peek Into Delta Log
What ?
Sneak Peek Into Delta Log
When ?
Sneak Peek Into Delta Log
Where ?
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Delta Log Configs
LogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
JSON
JSON
Delta Log Configs
CheckpointRetentionDurationLogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
How long checkpoint files are kept ?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.checkpointRetentionDur
ation' = '7 days')
PARQUET
PARQUET
You can drive in
parallel in a
freeway, but not in
a tunnel
Delta Exceptions
concurrentModificationException Analogy
Delta Exceptions
concurrentModificationException Analogy
You can drive in
parallel in a
freeway, but not
in a tunnel
Delta Exceptions
concurrentModificationException
Verify if concurrent updates happened to same partition
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
concurrentDeleteDeleteException
Concurrent operation deleted a file that your operation deletes
SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0
Tips & Tricks
How to find what records were added between 2 versions of Delta Table ?
%scala
display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json")
.where("add is not null")
.select("add.path"))
Tips & Tricks
How to find what files were added in a specific version of Delta Table ?
val oldestVersionAvailable =
val newestVersionAvailable =
val pathToDeltaTable = ""
val pathToFileName = ""
(oldestVersionAvailable to newestVersionAvailable).map { version =>
var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json")
if (df1.columns.toSeq.contains("remove")) {
var df2 = df1.where("remove is not null").select("remove.path")
var df3 = df2.filter('path.contains(pathToFileName))
if (df3.count > 0)
print(s"Commit Version $version removed the file $pathToFileName n")
}
}
Tips & Tricks
How to find which delta commit removed a specific file ?
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Más contenido relacionado

La actualidad más candente

Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Odi 11g master and work repository creation steps
Odi 11g master and work repository creation stepsOdi 11g master and work repository creation steps
Odi 11g master and work repository creation stepsDharmaraj Borse
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Vipin Mishra
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 

La actualidad más candente (20)

Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Odi 11g master and work repository creation steps
Odi 11g master and work repository creation stepsOdi 11g master and work repository creation steps
Odi 11g master and work repository creation steps
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 

Similar a Operating and Supporting Delta Lake in Production

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology OverviewDan Lynn
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database ConcurrencyAivars Kalvans
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningDatabricks
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentationIlya Bogunov
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure DataKai Sasaki
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 

Similar a Operating and Supporting Delta Lake in Production (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database Concurrency
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
11g R2
11g R211g R2
11g R2
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Último (17)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Operating and Supporting Delta Lake in Production

  • 1. Harikrishnan Kunhumveettil & Mathan Pillai Operating and Supporting Delta Lake in Production
  • 2. Who we are? Mathan PillaiHarikrishnan Kunhumveettil Currently Sr.TSE ,Databricks. Areas: Spark SQL, Delta, SS Previously Sr.TSE MapR Hadoop Tech. Lead, Nielsen Currently Sr. TSE ,Databricks. Areas: Spark SQL, Delta, SS Previously Tech Lead ,Intersys Consulting Sr Big data Consultant ,Saama Technologies
  • 3. Agenda ■ Delta Lake in Production - Data ○ Optimize and Auto-Optimize - Overview ○ Choosing the right strategy - The What ○ Choosing the right strategy - The When ○ Choosing the right strategy - The Where ■ Delta Lake in Production - Metadata ○ Sneak Peek Into Delta Log ○ Delta Log Configs ○ Delta Log Misconception ○ Delta Log Exceptions ○ Tips & Tricks
  • 4. Delta Lake in Production - Data
  • 5. Optimize and Auto-Optimize - In a nutshell ▪ Bin- packing/Compaction ▪ Handles small file problem ▪ Idempotent ▪ Incremental ▪ Creates 1 GB file or 10M records ▪ Controlled by optimize.maxFileSize ▪ Helps in Data Skipping ▪ Use Range Partitioning ▪ Hilbert Curve In Preview ▪ Partially incremental ▪ Supports all/new/minCubeSize ▪ Controlled by optimize.zorder.mergeS trategy.minCubeSize.th reshold OPTIMIZE + ZORDEROPTIMIZE ▪ Unintentionally referred as Auto- optimize ▪ Introduce an extra shuffle phase ▪ Creates row- compressed data of 512mb (binSize) ▪ Output file ~128 mb ▪ Controlled by optimizeWrite.binSize Optimize Write ▪ Mini-Optimize ▪ Creates file as big as 128 MB ▪ Post-commit action ▪ Triggered when more than 50 files/directory ▪ Controlled by: autoCompact.minNumFi les autoCompact.maxFileS ize Auto-Compaction Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
  • 6. Choosing the right strategy - The What? strategy ● Optimize writes: ○ Misconception - does not work with Streaming workloads ○ Makes life easy for OPTIMIZE and VACUUM ○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic! 3.2 PB ~ 700 TB input data ~ 400 TB new writes OPTIMIZE takes ~ 6 -8 hours Run Optimize job 3 times/day OPTIMIZE WRITE OPTIMIZE Job takes 2-3 hours. Run optimize 4 times/day More than 40% resource saved on OPTIMIZE
  • 7. Choosing the right strategy - The What? strategy ● Z-Order Vs Partition By ○ Z-order is better than creating large number of small files. ○ More effective use of DBIO cache through the handling of less metadata 326 TB 3 partitions 25 million files 326 TB 2 partitions 650k files
  • 8. Choosing the right strategy - The What? strategy import com.databricks.sql.transaction.tahoe.DeltaLog import org.apache.hadoop.fs.Path val deltaPath = "<table_path>" val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log")) val currentFiles = deltaLog.snapshot.allFiles display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
  • 9. Choosing the right strategy - The When? strategy ● Auto-Optimize runs on the same cluster during/after a write. ● Optimize - Trade off between read performance and cost ● Delay Z-Ordering if you are continuously adding data on active partition. ○ If active reads are not on the latest partition ○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default ○ Reducing the value to make Z-order run time efficient, degrades the read performance ● Should I always run OPTIMIZE + VACUUM ? ○ VACUUM happens on the Spark Driver. ○ Roughly 200k files/hour in ADLS ○ Roughly 300k files/hour in AWS S3 ○ DRY RUN gives the estimate
  • 10. Choosing the right strategy - The Where? strategy ● Auto-optimize runs on the same cluster during/after a write. ● Z-ordering is CPU intensive. ○ Involves Parquet Decoding and Encoding ○ General purpose instances vs Compute optimized clusters. ● Always have “where” clause for OPTIMIZE queries ● Auto-scaling clusters for VACUUM only workloads
  • 11. Delta Lake in Production - Metadata
  • 12. Delta Lake Transaction Log ■ Sneak Peek Into Delta Log ■ Delta Log Configs ■ Delta Exceptions ■ Tips & Tricks
  • 13. Sneak Peek Into Delta Log Who What When Where Version N Who What When Where Version N-1 Who What When Where Version N-2 Who What When Where
  • 14. Sneak Peek Into Delta Log Who ?
  • 15. Sneak Peek Into Delta Log What ?
  • 16. Sneak Peek Into Delta Log When ?
  • 17. Sneak Peek Into Delta Log Where ?
  • 18. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 19. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 20. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 21. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 22. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 23. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 24. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 25. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 26. Delta Log Configs LogRetentionDuration How long log files are kept? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.logRetentionDuration'=' 7 days') JSON JSON
  • 27. Delta Log Configs CheckpointRetentionDurationLogRetentionDuration How long log files are kept? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.logRetentionDuration'=' 7 days') How long checkpoint files are kept ? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.checkpointRetentionDur ation' = '7 days') PARQUET PARQUET
  • 28. You can drive in parallel in a freeway, but not in a tunnel Delta Exceptions concurrentModificationException Analogy
  • 29. Delta Exceptions concurrentModificationException Analogy You can drive in parallel in a freeway, but not in a tunnel
  • 30. Delta Exceptions concurrentModificationException Verify if concurrent updates happened to same partition
  • 31. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads
  • 32. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads concurrentDeleteReadException Concurrent operation deleted a file that your operation read
  • 33. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads concurrentDeleteReadException Concurrent operation deleted a file that your operation read concurrentDeleteDeleteException Concurrent operation deleted a file that your operation deletes
  • 34. SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0 Tips & Tricks How to find what records were added between 2 versions of Delta Table ?
  • 35. %scala display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json") .where("add is not null") .select("add.path")) Tips & Tricks How to find what files were added in a specific version of Delta Table ?
  • 36. val oldestVersionAvailable = val newestVersionAvailable = val pathToDeltaTable = "" val pathToFileName = "" (oldestVersionAvailable to newestVersionAvailable).map { version => var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json") if (df1.columns.toSeq.contains("remove")) { var df2 = df1.where("remove is not null").select("remove.path") var df3 = df2.filter('path.contains(pathToFileName)) if (df3.count > 0) print(s"Commit Version $version removed the file $pathToFileName n") } } Tips & Tricks How to find which delta commit removed a specific file ?
  • 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.