SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Data Time Travel by
Delta Time Machine
Burak Yavuz | Software Engineer
Vini Jaiswal | Customer Success Engineer
Who are we?
● Software Engineer @ Databricks
“We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering - Stanford University
● BS in Mechanical Engineering - Bogazici University, Turkey
● Customer Success Engineer @ Databricks
“Making Customers Successful with their data and ML/AI use cases”
● Data Science Lead - Citi | Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
● BS in Electrical Engineering - Rajiv Gandhi Technology University, India
Vini Jaiswal
Burak Yavuz
Agenda
Intro to Time Travel
Time Travel Use Cases
▪ Data Archiving
▪ Rollbacks
▪ Governance
▪ Reproducing ML experiments
Solving with Delta
Demo - Riding the time machine
Introduction to Time Travel
What might time travel look like?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Data
Archiving
Governance Rollbacks Reproduce
Experiments
Time Travel Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
● May need to store data for many years (7+)
Governance
8
Flights
Delays per
airplane
Planes
Weather
● What if records need to be forgotten with respect to Data Subject request
● And, at the same time, how do you stay in compliance with international
regulations?
Flights
(JSON)
events per
second
Kinesis
Planes
(CSV)
slow
changing
S3
Weather
(JSON)
every 5
minutes a
new dump
on S3
Rollbacks
9
Flights
Planes
Weather
Flights
(JSON)
events per
second
Event
Hubs
Planes
(CSV)
slow
changing
Blob
Weather
(JSON)
every 5
minutes a new
dump on Blob
What if a new job is deployed that
accidentally specifies
.mode(“overwrite”)
New job with .mode(“overwrite”)
Delays per
airplane
All
historic
data gone
Reproduce Experiments
● Reproducibility is the cornerstone of all scientific inquiry
● In order for a machine learning model to be improved, a data scientist
must first reproduce the results of the model.
Reproduce
Experiments
Solving with Delta
For more info check out
Diving Into Delta Lake:
Unpacking the Transaction Log
Wednesday (Nov 11) 15:00 GMT
Transaction Protocol
▪ Serializable ACID Writes
▪ Snapshot Isolation
▪ Scalability to billions of partitions or files
▪ Incremental processing
Computing Delta’s State
000000.json
000001.json
000002.json
000003.json
000004.json
000005.json
000006.json
000007.json
listFrom
version 0
Cache version
7
Update Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Set Transaction – records an idempotent txn id
Change Protocol – upgrades the version of the txn protocol
Result: Current Metadata, List of Files, List of Txns, Version
Table = Result of a set of actions
Computing Delta’s State
000000.json
...
000007.json
000008.json
000009.json
0000010.json
0000010.checkpoint.parquet
0000011.json
0000012.json
Cache version
12
listFrom
version 0
Computing Delta’s State
0000010.checkpoint.parquet
0000011.json
0000012.json
0000013.json
0000014.json
Cache version
14
listFrom
version 10
Time Travelling by version
SELECT * FROM my_table VERSION AS OF 1071;
SELECT * FROM my_table@v1071 -- no backticks to specify @
spark.read.option("versionAsOf", 1071).load("/some/path")
spark.read.load("/some/path@v1071")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28';
SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS
spark.read.option("timestampAsOf", "1492-10-28").load("/some/path")
spark.read.load("/some/path@14921028000000000")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Commit timestamps come from storage system modification timestamps
375-01-01
1453-05-29
1923-10-29
1920-04-23
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Timestamps can be out of order. We adjust by adding 1 millisecond to the
previous commit’s timestamp.
375-01-01
1453-05-29
1923-10-29
1920-04-23
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Price is right rules: Pick closest commit with timestamp that doesn’t exceed
the user’s timestamp.
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
1492-10-28
deltaLog.getSnapshotAt(1071)
Back to the Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
○ Should you be storing changes (CDC) or the latest snapshot?
● May need to store data for many years (7+)
○ How do you make it cost efficient?
What might time travel look like?
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1926'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1972'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1880'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '2018'
0.82
Better to save data by year and query with a predicate instead of using time travel.
Slowly Changing Dimensions (SCD)
- Type 1: Only keep latest data
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Helsingborg 2012
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Barcelona 2020
To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
Problems with SCD Type 1 + Time Travel
● Trade-off between data recency, query performance, and storage
costs
○ Data recency requires many frequent updates
○ Better query performance requires regular compaction of the data
○ The two above lead to many copies of the data
○ Many copies of the data lead to prohibitive storage costs
● Time Travel requires older copies of the data to exist
Slowly Changing Dimensions (SCD)
- Type 2: Insert row for each change
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 Y
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 N
Henrik Larsson September 20, 1971 Barcelona 2020 Y
To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
Governance
31
DESCRIBE HISTORY my_table
Rollbacks
● Undoing work (restoring an old version of the table)
RESTORE my_table TO TIMESTAMP AS OF '2020-11-10'
● Replaying Structured Streaming Pipelines
RESTORE target_table TO TIMESTAMP AS OF '2020-11-10'
spark.readStream.format("delta")
.option("startingTimestamp", "2020-11-10")
.load(path)
// fix logic
.writeStream
.option("checkpointLocation", "<new_location>")
.table("target_table")
Rollbacks
Rollback accidental bad writes
INSERT INTO my_table
SELECT * FROM my_table
TIMESTAMP AS OF
date_sub(current_date(), 1)
Fix incorrect updates as follows:
MERGE INTO my_table target
USING my_table TIMESTAMP AS OF
date_sub(current_date(), 1) source
ON source.userId = target.userId
WHEN MATCHED THEN UPDATE SET *
Reproduce Experiments
● Use Time Travel to ensure all experiments run on the same snapshot
of the table
○ SELECT * FROM my_table VERSION AS OF 1071;
○ SELECT * FROM my_table@v1071
● Archive a blessed snapshot using CLONE
○ CREATE TABLE my_table_xmas
○ CLONE my_table VERSION AS OF 1071
Reproduce Experiments & reports with MLflow
Reproduce Experiments & reports
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
spark.read.format("delta").option("timestampAsOf",
timestamp_string).load("/events/")
Reproduce experiments & reports
Time Series Analytics
If you want to find out how many new customers were added
over the last week
SELECT
count(distinct userId) - (
SELECT count(distinct userId)
FROM my_table
TIMESTAMP AS OF date_sub(current_date(), 7))
FROM my_table
DEMO - Riding the time machine
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Más contenido relacionado

La actualidad más candente

stackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introductionstackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introduction
NETWAYS
 

La actualidad más candente (20)

Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Fiware IoT_IDAS_intro_ul20_v2
Fiware IoT_IDAS_intro_ul20_v2Fiware IoT_IDAS_intro_ul20_v2
Fiware IoT_IDAS_intro_ul20_v2
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
stackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introductionstackconf 2021 | Weaviate Vector Search Engine – Introduction
stackconf 2021 | Weaviate Vector Search Engine – Introduction
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdf
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Functions in python
Functions in python Functions in python
Functions in python
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 

Similar a Data Time Travel by Delta Time Machine

MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 

Similar a Data Time Travel by Delta Time Machine (20)

Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
 
Dataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveDataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspective
 
Spark3
Spark3Spark3
Spark3
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed Data
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Urban flood prediction digital ocean august edition
Urban flood prediction   digital ocean august editionUrban flood prediction   digital ocean august edition
Urban flood prediction digital ocean august edition
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 

Data Time Travel by Delta Time Machine

  • 1. Data Time Travel by Delta Time Machine Burak Yavuz | Software Engineer Vini Jaiswal | Customer Success Engineer
  • 2. Who are we? ● Software Engineer @ Databricks “We make your streams come true” ● Apache Spark Committer ● MS in Management Science & Engineering - Stanford University ● BS in Mechanical Engineering - Bogazici University, Turkey ● Customer Success Engineer @ Databricks “Making Customers Successful with their data and ML/AI use cases” ● Data Science Lead - Citi | Data Intern - Southwest Airlines ● MS in Information Technology & Management - UTDallas ● BS in Electrical Engineering - Rajiv Gandhi Technology University, India Vini Jaiswal Burak Yavuz
  • 3. Agenda Intro to Time Travel Time Travel Use Cases ▪ Data Archiving ▪ Rollbacks ▪ Governance ▪ Reproducing ML experiments Solving with Delta Demo - Riding the time machine
  • 5. What might time travel look like? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 7. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ● May need to store data for many years (7+)
  • 8. Governance 8 Flights Delays per airplane Planes Weather ● What if records need to be forgotten with respect to Data Subject request ● And, at the same time, how do you stay in compliance with international regulations? Flights (JSON) events per second Kinesis Planes (CSV) slow changing S3 Weather (JSON) every 5 minutes a new dump on S3
  • 9. Rollbacks 9 Flights Planes Weather Flights (JSON) events per second Event Hubs Planes (CSV) slow changing Blob Weather (JSON) every 5 minutes a new dump on Blob What if a new job is deployed that accidentally specifies .mode(“overwrite”) New job with .mode(“overwrite”) Delays per airplane All historic data gone
  • 10. Reproduce Experiments ● Reproducibility is the cornerstone of all scientific inquiry ● In order for a machine learning model to be improved, a data scientist must first reproduce the results of the model. Reproduce Experiments
  • 12. For more info check out Diving Into Delta Lake: Unpacking the Transaction Log Wednesday (Nov 11) 15:00 GMT
  • 13. Transaction Protocol ▪ Serializable ACID Writes ▪ Snapshot Isolation ▪ Scalability to billions of partitions or files ▪ Incremental processing
  • 15. Update Metadata – name, schema, partitioning, etc Add File – adds a file (with optional statistics) Remove File – removes a file Set Transaction – records an idempotent txn id Change Protocol – upgrades the version of the txn protocol Result: Current Metadata, List of Files, List of Txns, Version Table = Result of a set of actions
  • 18. Time Travelling by version SELECT * FROM my_table VERSION AS OF 1071; SELECT * FROM my_table@v1071 -- no backticks to specify @ spark.read.option("versionAsOf", 1071).load("/some/path") spark.read.load("/some/path@v1071") deltaLog.getSnapshotAt(1071)
  • 19. Time Travelling by timestamp SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28'; SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS spark.read.option("timestampAsOf", "1492-10-28").load("/some/path") spark.read.load("/some/path@14921028000000000") deltaLog.getSnapshotAt(1071)
  • 20. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Commit timestamps come from storage system modification timestamps 375-01-01 1453-05-29 1923-10-29 1920-04-23
  • 21. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Timestamps can be out of order. We adjust by adding 1 millisecond to the previous commit’s timestamp. 375-01-01 1453-05-29 1923-10-29 1920-04-23 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001
  • 22. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Price is right rules: Pick closest commit with timestamp that doesn’t exceed the user’s timestamp. 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001 1492-10-28 deltaLog.getSnapshotAt(1071)
  • 23. Back to the Use Cases
  • 24. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ○ Should you be storing changes (CDC) or the latest snapshot? ● May need to store data for many years (7+) ○ How do you make it cost efficient?
  • 25. What might time travel look like? 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82 Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
  • 26. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 27. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1926' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1972' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1880' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '2018' 0.82 Better to save data by year and query with a predicate instead of using time travel.
  • 28. Slowly Changing Dimensions (SCD) - Type 1: Only keep latest data First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Helsingborg 2012 First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Barcelona 2020 To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
  • 29. Problems with SCD Type 1 + Time Travel ● Trade-off between data recency, query performance, and storage costs ○ Data recency requires many frequent updates ○ Better query performance requires regular compaction of the data ○ The two above lead to many copies of the data ○ Many copies of the data lead to prohibitive storage costs ● Time Travel requires older copies of the data to exist
  • 30. Slowly Changing Dimensions (SCD) - Type 2: Insert row for each change First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 Y First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 N Henrik Larsson September 20, 1971 Barcelona 2020 Y To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
  • 32. Rollbacks ● Undoing work (restoring an old version of the table) RESTORE my_table TO TIMESTAMP AS OF '2020-11-10' ● Replaying Structured Streaming Pipelines RESTORE target_table TO TIMESTAMP AS OF '2020-11-10' spark.readStream.format("delta") .option("startingTimestamp", "2020-11-10") .load(path) // fix logic .writeStream .option("checkpointLocation", "<new_location>") .table("target_table")
  • 33. Rollbacks Rollback accidental bad writes INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) Fix incorrect updates as follows: MERGE INTO my_table target USING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source ON source.userId = target.userId WHEN MATCHED THEN UPDATE SET *
  • 34. Reproduce Experiments ● Use Time Travel to ensure all experiments run on the same snapshot of the table ○ SELECT * FROM my_table VERSION AS OF 1071; ○ SELECT * FROM my_table@v1071 ● Archive a blessed snapshot using CLONE ○ CREATE TABLE my_table_xmas ○ CLONE my_table VERSION AS OF 1071
  • 35. Reproduce Experiments & reports with MLflow
  • 36. Reproduce Experiments & reports SELECT count(*) FROM events TIMESTAMP AS OF timestamp SELECT count(*) FROM events VERSION AS OF version spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/") Reproduce experiments & reports
  • 37. Time Series Analytics If you want to find out how many new customers were added over the last week SELECT count(distinct userId) - ( SELECT count(distinct userId) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 7)) FROM my_table
  • 38. DEMO - Riding the time machine
  • 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.