The document discusses strategies for optimizing and managing metadata in Delta Lake. It provides an overview of optimize, auto-optimize, and optimize write strategies and how to choose the appropriate strategy based on factors like workload, data size, and cluster resources. It also discusses Delta Lake transaction logs, configurations like log retention duration, and tips for working with Delta Lake metadata.
2. Who we are?
Mathan PillaiHarikrishnan Kunhumveettil
Currently
Sr.TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Sr.TSE MapR
Hadoop Tech. Lead, Nielsen
Currently
Sr. TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Tech Lead ,Intersys Consulting
Sr Big data Consultant ,Saama
Technologies
3. Agenda
■ Delta Lake in Production - Data
○ Optimize and Auto-Optimize - Overview
○ Choosing the right strategy - The What
○ Choosing the right strategy - The When
○ Choosing the right strategy - The Where
■ Delta Lake in Production - Metadata
○ Sneak Peek Into Delta Log
○ Delta Log Configs
○ Delta Log Misconception
○ Delta Log Exceptions
○ Tips & Tricks
5. Optimize and Auto-Optimize - In a nutshell
▪ Bin-
packing/Compaction
▪ Handles small file
problem
▪ Idempotent
▪ Incremental
▪ Creates 1 GB file or
10M records
▪ Controlled by
optimize.maxFileSize
▪ Helps in Data
Skipping
▪ Use Range
Partitioning
▪ Hilbert Curve In
Preview
▪ Partially incremental
▪ Supports
all/new/minCubeSize
▪ Controlled by
optimize.zorder.mergeS
trategy.minCubeSize.th
reshold
OPTIMIZE + ZORDEROPTIMIZE
▪ Unintentionally
referred as Auto-
optimize
▪ Introduce an extra
shuffle phase
▪ Creates row-
compressed data of
512mb (binSize)
▪ Output file ~128 mb
▪ Controlled by
optimizeWrite.binSize
Optimize Write
▪ Mini-Optimize
▪ Creates file as big
as 128 MB
▪ Post-commit
action
▪ Triggered when
more than 50
files/directory
▪ Controlled by:
autoCompact.minNumFi
les
autoCompact.maxFileS
ize
Auto-Compaction
Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
6. Choosing the right strategy - The What? strategy
● Optimize writes:
○ Misconception - does not work with Streaming workloads
○ Makes life easy for OPTIMIZE and VACUUM
○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic!
3.2 PB
~ 700 TB input data
~ 400 TB new writes
OPTIMIZE takes ~ 6 -8
hours
Run Optimize job 3
times/day
OPTIMIZE WRITE
OPTIMIZE Job takes 2-3
hours.
Run optimize 4 times/day
More than 40% resource
saved on OPTIMIZE
7. Choosing the right strategy - The What? strategy
● Z-Order Vs Partition By
○ Z-order is better than creating large number of small files.
○ More effective use of DBIO cache through the handling of less metadata
326 TB
3 partitions
25 million files
326 TB
2 partitions
650k files
8. Choosing the right strategy - The What? strategy
import com.databricks.sql.transaction.tahoe.DeltaLog
import org.apache.hadoop.fs.Path
val deltaPath = "<table_path>"
val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log"))
val currentFiles = deltaLog.snapshot.allFiles
display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
9. Choosing the right strategy - The When? strategy
● Auto-Optimize runs on the same cluster during/after a write.
● Optimize - Trade off between read performance and cost
● Delay Z-Ordering if you are continuously adding data on active partition.
○ If active reads are not on the latest partition
○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default
○ Reducing the value to make Z-order run time efficient, degrades the read performance
● Should I always run OPTIMIZE + VACUUM ?
○ VACUUM happens on the Spark Driver.
○ Roughly 200k files/hour in ADLS
○ Roughly 300k files/hour in AWS S3
○ DRY RUN gives the estimate
10. Choosing the right strategy - The Where? strategy
● Auto-optimize runs on the same cluster during/after a write.
● Z-ordering is CPU intensive.
○ Involves Parquet Decoding and Encoding
○ General purpose instances vs Compute optimized clusters.
● Always have “where” clause for OPTIMIZE queries
● Auto-scaling clusters for VACUUM only workloads
26. Delta Log Configs
LogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
JSON
JSON
27. Delta Log Configs
CheckpointRetentionDurationLogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
How long checkpoint files are kept ?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.checkpointRetentionDur
ation' = '7 days')
PARQUET
PARQUET
28. You can drive in
parallel in a
freeway, but not in
a tunnel
Delta Exceptions
concurrentModificationException Analogy
33. Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
concurrentDeleteDeleteException
Concurrent operation deleted a file that your operation deletes
34. SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0
Tips & Tricks
How to find what records were added between 2 versions of Delta Table ?
36. val oldestVersionAvailable =
val newestVersionAvailable =
val pathToDeltaTable = ""
val pathToFileName = ""
(oldestVersionAvailable to newestVersionAvailable).map { version =>
var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json")
if (df1.columns.toSeq.contains("remove")) {
var df2 = df1.where("remove is not null").select("remove.path")
var df3 = df2.filter('path.contains(pathToFileName))
if (df3.count > 0)
print(s"Commit Version $version removed the file $pathToFileName n")
}
}
Tips & Tricks
How to find which delta commit removed a specific file ?