SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Data Time Travel by
Delta Time Machine
Burak Yavuz | Software Engineer
Vini Jaiswal | Customer Success Engineer
Who are we?
● Software Engineer @ Databricks
“We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering - Stanford University
● BS in Mechanical Engineering - Bogazici University, Turkey
● Customer Success Engineer @ Databricks
“Making Customers Successful with their data and ML/AI use cases”
● “Brickstein’s Briefcase” Host
● Data Science Lead - Citi | Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
● BS in Electrical Engineering - Rajiv Gandhi Technology University, IndiaVini Jaiswal
Burak Yavuz
Agenda
Intro to Time Travel
Time Travel Use Cases
▪ Data Archiving
▪ Rollbacks
▪ Reproducing ML experiments
▪ Governance
Solving with Delta
Demo - Riding the time machine
Introduction to Time Travel
What might time travel look like?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Data
Archiving
Governance Rollbacks Reproduce
Experiments
Time Travel Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
● May need to store data for many years (7+)
Governance
8
Flights
Delays per
airplane
Planes
Weather
● What if records need to be forgotten with respect to Data Subject request
● And, at the same time, how do you stay in compliance with international
regulations?
Flights
(JSON)
events per
second
Kinesis
Planes
(CSV)
slow
changing
S3
Weather
(JSON)
every 5
minutes a
new dump
on S3
Rollbacks
9
Flights
Planes
Weather
Flights
(JSON)
events per
second
Event
Hubs
Planes
(CSV)
slow
changing
Blob
Weather
(JSON)
every 5
minutes a new
dump on Blob
What if a new job is deployed that
accidentally specifies
.mode(“overwrite”)
New job with .mode(“overwrite”)
Delays per
airplane
All
historic
data gone
Reproduce Experiments
● Reproducibility is the cornerstone of all scientific inquiry
● In order for a machine learning model to be improved, a data scientist
must first reproduce the results of the model.
Reproduce
Experiments
Solving with Delta
For more info check out
Diving Into Delta Lake:
Unpacking the Transaction Log
Wednesday (Nov 11) 15:00 GMT
Transaction Protocol
▪ Serializable ACID Writes
▪ Snapshot Isolation
▪ Scalability to billions of partitions or files
▪ Incremental processing
Computing Delta’s State
000000.json
000001.json
000002.json
000003.json
000004.json
000005.json
000005.json
000006.json
000007.json
listFrom
version 0
Cache
version 7
Update Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Set Transaction – records an idempotent txn id
Change Protocol – upgrades the version of the txn protocol
Result: Current Metadata, List of Files, List of Txns, Version
Table = Result of a set of actions
Computing Delta’s State
000000.json
...
000007.json
000008.json
000009.json
0000010.json
0000010.checkpoint.parquet
0000011.json
0000012.json
Cache
version 12
listFrom
version 0
Computing Delta’s State
0000010.checkpoint.parquet
0000011.json
0000012.json
0000013.json
0000014.json
Cache
version 14
listFrom
version 10
Time Travelling by version
SELECT * FROM my_table VERSION AS OF 1071;
SELECT * FROM my_table@v1071 -- no backticks to specify @
spark.read.option("versionAsOf", 1071).load("/some/path")
spark.read.load("/some/path@v1071")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28';
SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS
spark.read.option("timestampAsOf", "1492-10-28").load("/some/path")
spark.read.load("/some/path@14921028000000000")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Commit timestamps come from storage system modification
timestamps
375-01-01
1453-05-29
1923-10-29
1920-04-23
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Timestamps can be out of order. We adjust by adding 1 millisecond to the
previous commit’s timestamp.
375-01-01
1453-05-29
1923-10-29
1920-04-23
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Price is right rules: Pick closest commit with timestamp that doesn’t exceed
the user’s timestamp.
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
1492-10-28
deltaLog.getSnapshotAt(1071)
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
○ Should you be storing changes (CDC) or the latest snapshot?
● May need to store data for many years (7+)
○ How do you make it cost efficient?
What might time travel look like?
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1926'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1972'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1880'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '2018'
0.82
Better to save data by year and query with a predicate instead of using time travel.
Slowly Changing Dimensions (SCD)
- Type 1: Only keep latest data
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Helsingborg 2012
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Barcelona 2020
To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
Problems with SCD Type 1 + Time Travel
● Trade-off between data recency, query performance, and storage
costs
○ Data recency requires many frequent updates
○ Better query performance requires regular compaction of the data
○ The two above lead to many copies of the data
○ Many copies of the data lead to prohibitive storage costs
● Time Travel requires older copies of the data to exist
Slowly Changing Dimensions (SCD)
- Type 2: Insert row for each change
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 Y
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 N
Henrik Larsson September 20, 1971 Barcelona 2020 Y
To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
Governance
31
DESCRIBE HISTORY my_table
Rollbacks
● Undoing work (restoring an old version of the table)
RESTORE my_table TO TIMESTAMP AS OF '2020-11-10'
● Replaying Structured Streaming Pipelines
RESTORE target_table TO TIMESTAMP AS OF '2020-11-10'
spark.readStream.format("delta")
.option("startingTimestamp", "2020-11-10")
.load(path)
// fix logic
.writeStream
.table("target_table")
Reproduce Experiments
● Use Time Travel to ensure all experiments run on the same snapshot
of the table
○ SELECT * FROM my_table VERSION AS OF 1071;
○ SELECT * FROM my_table@v1071
● Archive a blessed snapshot using CLONE
○ CREATE TABLE my_table_xmas
○ CLONE my_table VERSION AS OF 1071
Reproduce Experiments & reports with MLflow
Time Series Analytics
If you want to find out how many new customers were added
over the last week
SELECT
count(distinct userId) - (
SELECT count(distinct userId)
FROM my_table
TIMESTAMP AS OF date_sub(current_date(), 7))
FROM my_table
DEMO - Riding the time machine
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Más contenido relacionado

La actualidad más candente

Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...Edureka!
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Hackolade Tutorial - part 13 - Leverage a Polyglot data model
Hackolade Tutorial - part 13 - Leverage a Polyglot data modelHackolade Tutorial - part 13 - Leverage a Polyglot data model
Hackolade Tutorial - part 13 - Leverage a Polyglot data modelPascalDesmarets1
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Unity遊戲程式設計 - 2D運動與碰撞處理I
Unity遊戲程式設計 - 2D運動與碰撞處理IUnity遊戲程式設計 - 2D運動與碰撞處理I
Unity遊戲程式設計 - 2D運動與碰撞處理I吳錫修 (ShyiShiou Wu)
 

La actualidad más candente (20)

Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
Talend ETL Tutorial | Talend Tutorial For Beginners | Talend Online Training ...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Hackolade Tutorial - part 13 - Leverage a Polyglot data model
Hackolade Tutorial - part 13 - Leverage a Polyglot data modelHackolade Tutorial - part 13 - Leverage a Polyglot data model
Hackolade Tutorial - part 13 - Leverage a Polyglot data model
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Presto overview
Presto overviewPresto overview
Presto overview
 
emc++ chapter32
emc++ chapter32emc++ chapter32
emc++ chapter32
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Unity遊戲程式設計 - 2D運動與碰撞處理I
Unity遊戲程式設計 - 2D運動與碰撞處理IUnity遊戲程式設計 - 2D運動與碰撞處理I
Unity遊戲程式設計 - 2D運動與碰撞處理I
 

Similar a Data Time Travel by Delta Time Machine

Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineDatabricks
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangDatabricks
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeDatabricks
 
Urban flood prediction digital ocean august edition
Urban flood prediction   digital ocean august editionUrban flood prediction   digital ocean august edition
Urban flood prediction digital ocean august editiontransight
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsCarlo Carandang
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...ScyllaDB
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Jason L Brugger
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQLEDB
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaYaroslav Tkachenko
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Jeremy Schneider
 
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...StampedeCon
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
 
Rapid analytic development on near real time data
Rapid analytic development on near real time dataRapid analytic development on near real time data
Rapid analytic development on near real time dataAustin Heyne
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeGlobant
 

Similar a Data Time Travel by Delta Time Machine (20)

Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
Urban flood prediction digital ocean august edition
Urban flood prediction   digital ocean august editionUrban flood prediction   digital ocean august edition
Urban flood prediction digital ocean august edition
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
 
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
Rapid analytic development on near real time data
Rapid analytic development on near real time dataRapid analytic development on near real time data
Rapid analytic development on near real time data
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Último (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Data Time Travel by Delta Time Machine

  • 1. Data Time Travel by Delta Time Machine Burak Yavuz | Software Engineer Vini Jaiswal | Customer Success Engineer
  • 2. Who are we? ● Software Engineer @ Databricks “We make your streams come true” ● Apache Spark Committer ● MS in Management Science & Engineering - Stanford University ● BS in Mechanical Engineering - Bogazici University, Turkey ● Customer Success Engineer @ Databricks “Making Customers Successful with their data and ML/AI use cases” ● “Brickstein’s Briefcase” Host ● Data Science Lead - Citi | Data Intern - Southwest Airlines ● MS in Information Technology & Management - UTDallas ● BS in Electrical Engineering - Rajiv Gandhi Technology University, IndiaVini Jaiswal Burak Yavuz
  • 3. Agenda Intro to Time Travel Time Travel Use Cases ▪ Data Archiving ▪ Rollbacks ▪ Reproducing ML experiments ▪ Governance Solving with Delta Demo - Riding the time machine
  • 5. What might time travel look like? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 7. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ● May need to store data for many years (7+)
  • 8. Governance 8 Flights Delays per airplane Planes Weather ● What if records need to be forgotten with respect to Data Subject request ● And, at the same time, how do you stay in compliance with international regulations? Flights (JSON) events per second Kinesis Planes (CSV) slow changing S3 Weather (JSON) every 5 minutes a new dump on S3
  • 9. Rollbacks 9 Flights Planes Weather Flights (JSON) events per second Event Hubs Planes (CSV) slow changing Blob Weather (JSON) every 5 minutes a new dump on Blob What if a new job is deployed that accidentally specifies .mode(“overwrite”) New job with .mode(“overwrite”) Delays per airplane All historic data gone
  • 10. Reproduce Experiments ● Reproducibility is the cornerstone of all scientific inquiry ● In order for a machine learning model to be improved, a data scientist must first reproduce the results of the model. Reproduce Experiments
  • 12. For more info check out Diving Into Delta Lake: Unpacking the Transaction Log Wednesday (Nov 11) 15:00 GMT
  • 13. Transaction Protocol ▪ Serializable ACID Writes ▪ Snapshot Isolation ▪ Scalability to billions of partitions or files ▪ Incremental processing
  • 15. Update Metadata – name, schema, partitioning, etc Add File – adds a file (with optional statistics) Remove File – removes a file Set Transaction – records an idempotent txn id Change Protocol – upgrades the version of the txn protocol Result: Current Metadata, List of Files, List of Txns, Version Table = Result of a set of actions
  • 18. Time Travelling by version SELECT * FROM my_table VERSION AS OF 1071; SELECT * FROM my_table@v1071 -- no backticks to specify @ spark.read.option("versionAsOf", 1071).load("/some/path") spark.read.load("/some/path@v1071") deltaLog.getSnapshotAt(1071)
  • 19. Time Travelling by timestamp SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28'; SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS spark.read.option("timestampAsOf", "1492-10-28").load("/some/path") spark.read.load("/some/path@14921028000000000") deltaLog.getSnapshotAt(1071)
  • 20. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Commit timestamps come from storage system modification timestamps 375-01-01 1453-05-29 1923-10-29 1920-04-23
  • 21. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Timestamps can be out of order. We adjust by adding 1 millisecond to the previous commit’s timestamp. 375-01-01 1453-05-29 1923-10-29 1920-04-23 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001
  • 22. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Price is right rules: Pick closest commit with timestamp that doesn’t exceed the user’s timestamp. 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001 1492-10-28 deltaLog.getSnapshotAt(1071)
  • 23. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ○ Should you be storing changes (CDC) or the latest snapshot? ● May need to store data for many years (7+) ○ How do you make it cost efficient?
  • 24. What might time travel look like? 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82 Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
  • 25. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 26. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1926' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1972' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1880' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '2018' 0.82 Better to save data by year and query with a predicate instead of using time travel.
  • 27. Slowly Changing Dimensions (SCD) - Type 1: Only keep latest data First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Helsingborg 2012 First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Barcelona 2020 To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
  • 28. Problems with SCD Type 1 + Time Travel ● Trade-off between data recency, query performance, and storage costs ○ Data recency requires many frequent updates ○ Better query performance requires regular compaction of the data ○ The two above lead to many copies of the data ○ Many copies of the data lead to prohibitive storage costs ● Time Travel requires older copies of the data to exist
  • 29. Slowly Changing Dimensions (SCD) - Type 2: Insert row for each change First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 Y First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 N Henrik Larsson September 20, 1971 Barcelona 2020 Y To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
  • 31. Rollbacks ● Undoing work (restoring an old version of the table) RESTORE my_table TO TIMESTAMP AS OF '2020-11-10' ● Replaying Structured Streaming Pipelines RESTORE target_table TO TIMESTAMP AS OF '2020-11-10' spark.readStream.format("delta") .option("startingTimestamp", "2020-11-10") .load(path) // fix logic .writeStream .table("target_table")
  • 32. Reproduce Experiments ● Use Time Travel to ensure all experiments run on the same snapshot of the table ○ SELECT * FROM my_table VERSION AS OF 1071; ○ SELECT * FROM my_table@v1071 ● Archive a blessed snapshot using CLONE ○ CREATE TABLE my_table_xmas ○ CLONE my_table VERSION AS OF 1071
  • 33. Reproduce Experiments & reports with MLflow
  • 34. Time Series Analytics If you want to find out how many new customers were added over the last week SELECT count(distinct userId) - ( SELECT count(distinct userId) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 7)) FROM my_table
  • 35. DEMO - Riding the time machine
  • 36. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.