The document discusses time travel capabilities in Delta Lake for data warehousing and analytics. It introduces key time travel use cases like data archiving, rollbacks, reproducing machine learning experiments, and governance. It then explains how Delta Lake enables time travel through its transaction log and ability to retrieve past versions or snapshots of tables. The presentation concludes with a demonstration of these time travel features.
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Data Time Travel by Delta Time Machine
1. Data Time Travel by
Delta Time Machine
Burak Yavuz | Software Engineer
Vini Jaiswal | Customer Success Engineer
2. Who are we?
● Software Engineer @ Databricks
“We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering - Stanford University
● BS in Mechanical Engineering - Bogazici University, Turkey
● Customer Success Engineer @ Databricks
“Making Customers Successful with their data and ML/AI use cases”
● “Brickstein’s Briefcase” Host
● Data Science Lead - Citi | Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
● BS in Electrical Engineering - Rajiv Gandhi Technology University, IndiaVini Jaiswal
Burak Yavuz
3. Agenda
Intro to Time Travel
Time Travel Use Cases
▪ Data Archiving
▪ Rollbacks
▪ Reproducing ML experiments
▪ Governance
Solving with Delta
Demo - Riding the time machine
5. What might time travel look like?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
7. Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
● May need to store data for many years (7+)
8. Governance
8
Flights
Delays per
airplane
Planes
Weather
● What if records need to be forgotten with respect to Data Subject request
● And, at the same time, how do you stay in compliance with international
regulations?
Flights
(JSON)
events per
second
Kinesis
Planes
(CSV)
slow
changing
S3
Weather
(JSON)
every 5
minutes a
new dump
on S3
10. Reproduce Experiments
● Reproducibility is the cornerstone of all scientific inquiry
● In order for a machine learning model to be improved, a data scientist
must first reproduce the results of the model.
Reproduce
Experiments
15. Update Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Set Transaction – records an idempotent txn id
Change Protocol – upgrades the version of the txn protocol
Result: Current Metadata, List of Files, List of Txns, Version
Table = Result of a set of actions
18. Time Travelling by version
SELECT * FROM my_table VERSION AS OF 1071;
SELECT * FROM my_table@v1071 -- no backticks to specify @
spark.read.option("versionAsOf", 1071).load("/some/path")
spark.read.load("/some/path@v1071")
deltaLog.getSnapshotAt(1071)
19. Time Travelling by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28';
SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS
spark.read.option("timestampAsOf", "1492-10-28").load("/some/path")
spark.read.load("/some/path@14921028000000000")
deltaLog.getSnapshotAt(1071)
20. Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Commit timestamps come from storage system modification
timestamps
375-01-01
1453-05-29
1923-10-29
1920-04-23
21. Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Timestamps can be out of order. We adjust by adding 1 millisecond to the
previous commit’s timestamp.
375-01-01
1453-05-29
1923-10-29
1920-04-23
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
22. Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Price is right rules: Pick closest commit with timestamp that doesn’t exceed
the user’s timestamp.
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
1492-10-28
deltaLog.getSnapshotAt(1071)
23. Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
○ Should you be storing changes (CDC) or the latest snapshot?
● May need to store data for many years (7+)
○ How do you make it cost efficient?
24. What might time travel look like?
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
25. Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
26. Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1926'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1972'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1880'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '2018'
0.82
Better to save data by year and query with a predicate instead of using time travel.
27. Slowly Changing Dimensions (SCD)
- Type 1: Only keep latest data
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Helsingborg 2012
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Barcelona 2020
To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
28. Problems with SCD Type 1 + Time Travel
● Trade-off between data recency, query performance, and storage
costs
○ Data recency requires many frequent updates
○ Better query performance requires regular compaction of the data
○ The two above lead to many copies of the data
○ Many copies of the data lead to prohibitive storage costs
● Time Travel requires older copies of the data to exist
29. Slowly Changing Dimensions (SCD)
- Type 2: Insert row for each change
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 Y
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 N
Henrik Larsson September 20, 1971 Barcelona 2020 Y
To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
31. Rollbacks
● Undoing work (restoring an old version of the table)
RESTORE my_table TO TIMESTAMP AS OF '2020-11-10'
● Replaying Structured Streaming Pipelines
RESTORE target_table TO TIMESTAMP AS OF '2020-11-10'
spark.readStream.format("delta")
.option("startingTimestamp", "2020-11-10")
.load(path)
// fix logic
.writeStream
.table("target_table")
32. Reproduce Experiments
● Use Time Travel to ensure all experiments run on the same snapshot
of the table
○ SELECT * FROM my_table VERSION AS OF 1071;
○ SELECT * FROM my_table@v1071
● Archive a blessed snapshot using CLONE
○ CREATE TABLE my_table_xmas
○ CLONE my_table VERSION AS OF 1071
34. Time Series Analytics
If you want to find out how many new customers were added
over the last week
SELECT
count(distinct userId) - (
SELECT count(distinct userId)
FROM my_table
TIMESTAMP AS OF date_sub(current_date(), 7))
FROM my_table