SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
SPARK SUMMIT
EUROPE2016
Distributed Time Series Analysis
Framework For Spark
Larisa Sawyer
Two Sigma
Larisa Sawyer
November1, 2016 2
$0.0
$500.0
$1,000.0
$1,500.0
$2,000.0
$2,500.0
1/3/1950
1/3/1953
1/3/1956
1/3/1959
1/3/1962
1/3/1965
1/3/1968
1/3/1971
1/3/1974
1/3/1977
1/3/1980
1/3/1983
1/3/1986
1/3/1989
1/3/1992
1/3/1995
1/3/1998
1/3/2001
1/3/2004
1/3/2007
1/3/2010
1/3/2013
1/3/2016
S&P 500
Time seriesexamples
November1, 2016
w Stock market prices
w Temperatures
w Height
w …
18°C
20°C
22°C
24°C
26°C
28°C
30°C
32°C
34°C
New York
Brussels
100cm
110cm
120cm
130cm
140cm
150cm
160cm
170cm
180cm
5 6 7 8 9 10 11 12 13 14 15
Age (years)
Avg US female
100cm
110cm
120cm
130cm
140cm
150cm
160cm
170cm
180cm
5 6 7 8 9 10 11 12 13 14 15
Age (years)
Avg US female
Larisa
3
What do we do with time series data?
November1, 2016
w Forecast future values given past observations
$8.90
$8.95
$8.90
$9.06
$9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
?
?
?
4
November1, 2016
Univariate time series
5
Multivariate time series
November1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unalignedtime series
w Temporal join is a fundamental operation for time series analysis
$8.90
$8.95
$8.90
$9.06
$9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
75°F
72°F
71°F
72°F
68°F
67°F
65°F
temperature
6
Multivariate time series
November1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unalignedtime series
w Temporal join is a fundamental operation for time series analysis
€7.94
€7.98
€7.94
€8.08
€8.12
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
23°C
22°C
21°C
22°C
20°C
19°C
18°C
temperature
7
What is a left join?
November1, 2016
time series 1 time series 2
8
What is temporal join?
November1, 2016
w A particular joinfunction defined by a matching criteriaover time
w Examples of criteria
w look-backward
w look-forward
time series 1 time series 2
look-forward
time series 1 time series 2
look-backward
observation
9
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
10
ImportantLegal Information
November1, 2016 11
The information presented here is offered for informational purposes only and should
not be used for any other purpose (including, without limitation, the making of
investment decisions). Examples provided herein are for illustrative purposes only and
are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the
solicitation of any offer to buy any security or other interest. We consider this
information to be confidential and not for redistribution or dissemination.
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
12
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
13
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
14
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
15
Temporaljoins in practice
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
16
Time Series Scale
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AMWe need fast and scalable
distributed temporal join
17
Existing solutions
November1, 2016
w Existing packages don’t support temporal join or can’t handle large time series
w Pandas / R / Matlab
w Limited to single machine
w Spark
w Does scale, but all data is unordered
w spark-ts
w Expects univariate time series to fit on single machine
w Splits by col
w Supports only snapshot data
18
Flint: A new time series library for Spark
November1, 2016
w Goal
w Provide a collection of functions to manipulate and analyze time series at scale
w Group, temporal join, summarize, aggregate …
w How
w Build a time series aware data structure
w TimeSeriesRDD extends RDD
w Optimize using temporal locality
w Reduce shuffling
w Reduce memory pressure by streaming
19
What is a TimeSeriesRDD?
November1, 2016
w TimeSeriesRDD vs RDD
w Associate time range on each partition
w Track partition time-ranges
w Preserve temporal order
20
RDD
November1, 2016
time temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
RDD
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
(6:58 AM, 64°F)
(6:59 AM, 65°F)
…
(7:34 AM, 74°F)
(7:35 AM, 74°F)
…
(7:58 AM, 76°F)
(7:59 AM, 77°F)
…
Raw Data
21
TimeSeriesRDD
November1, 2016
time temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
RDD
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
(6:58 AM, 64°F)
(6:59 AM, 65°F)
…
(7:34 AM, 74°F)
(7:35 AM, 74°F)
…
(7:58 AM, 76°F)
(7:59 AM, 77°F)
…
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
TSRDD
[06:00AM,
07:00AM)
[07:00AM,
8:00AM)
[8:00AM,
∞)
Raw Data
22
Group function
November1, 2016
w A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
group 3
group 4
23
Group function
November1, 2016
w A group function groups rows with nearby timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
24
Group in Spark
November1, 2016
w Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
25
w Data is shuffled and materialized on the workers
Group in Spark
November1, 2016
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
sortBy
RDD
w Back to Temporal Order
w Temporal order is not preserved
26
Group in TimeSeriesRDD
November1, 2016
w Data is grouped per partition locally as streams
TimeSeriesRDD
2:00PM
1:00PM
3:00PM
4:00PM
1:00PM
1:00PM
2:00PM
2:00PM
3:00PM
3:00PM
4:00PM
4:00PM
27
• Running time of count after group
• 16 executors (10G memory and 4 cores per executor)
• Data read from HDFS
Benchmark for group + count
November1, 2016
0s 20s 40s 60s 80s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD
50 - 100X5 - 10X
28
Temporaljoin
November1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction– look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
29
Temporaljoin
November1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction– look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
30
Temporaljoin
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series 1
31
time series 2
Temporaljoin
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
32
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
33
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Build dependencygraph for the joinedTimeSeriesRDD
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
[1:00AM, 4:00AM)
[4:00AM, 6:00AM)
[1:00AM, 4:00AM)
[4:00AM, 6:00AM)
34
1:00AM1:00AM
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams per partition
1:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM
35
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
36
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
5:00AM
1:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM
4:00AM
3:00AM
37
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM
38
Benchmark for temporaljoin + count
November1, 2016
0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD
20 - 50X5 - 10X
39
• Running time of count after temporal join
• 16 executors (10G memory and 4 cores per executor)
• Data read from HDFS
Functions over TimeSeriesRDD
November1, 2016
w Grouping functions
w Temporal joins such as look-forward, look-backward etc.
w Summarizers such as average, variance, z-score etc. over grouping functions
40
Open Source
November1, 2016
w True!
w https://github.com/twosigma/flint
41
What’snext?
November1, 2016
w TimeSeriesDataframe / TimeSeriesDataset
w Speed up
w RicherAPIs
w Python bindings
w Additional summarizers
42
Key contributors
November1, 2016
w ChristopherAycock
w Yuri Bogomolov
w Jonathan Coveney
w Li Jin
w David Medina
w Julia Meinwald
w David Palaitis
w Larisa Sawyer
w Leif Walsh
w Wenbo Zhao
43
Flint: Time Series For Spark
November1, 2016
A library to solve for general time series analysis operations at massive scale
Anne Hathaway has nothing to do with Berkshire Hathaway
Check it out in open source, and contribute
https://github.com/twosigma/flint
44
SPARK SUMMIT
EUROPE2016
THANK YOU.
Larisa.Sawyer@twosigma.com

Más contenido relacionado

Destacado

Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 

Destacado (20)

Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey Stella
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 

Más de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Último (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

Spark Summit EU talk by Larisa Sawyer

  • 1. SPARK SUMMIT EUROPE2016 Distributed Time Series Analysis Framework For Spark Larisa Sawyer Two Sigma
  • 3. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 1/3/1950 1/3/1953 1/3/1956 1/3/1959 1/3/1962 1/3/1965 1/3/1968 1/3/1971 1/3/1974 1/3/1977 1/3/1980 1/3/1983 1/3/1986 1/3/1989 1/3/1992 1/3/1995 1/3/1998 1/3/2001 1/3/2004 1/3/2007 1/3/2010 1/3/2013 1/3/2016 S&P 500 Time seriesexamples November1, 2016 w Stock market prices w Temperatures w Height w … 18°C 20°C 22°C 24°C 26°C 28°C 30°C 32°C 34°C New York Brussels 100cm 110cm 120cm 130cm 140cm 150cm 160cm 170cm 180cm 5 6 7 8 9 10 11 12 13 14 15 Age (years) Avg US female 100cm 110cm 120cm 130cm 140cm 150cm 160cm 170cm 180cm 5 6 7 8 9 10 11 12 13 14 15 Age (years) Avg US female Larisa 3
  • 4. What do we do with time series data? November1, 2016 w Forecast future values given past observations $8.90 $8.95 $8.90 $9.06 $9.10 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price ? ? ? 4
  • 6. Multivariate time series November1, 2016 w We can forecast better by joining multiple time series w Our framework enables fast distributed temporal join of large scale unalignedtime series w Temporal join is a fundamental operation for time series analysis $8.90 $8.95 $8.90 $9.06 $9.10 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price 75°F 72°F 71°F 72°F 68°F 67°F 65°F temperature 6
  • 7. Multivariate time series November1, 2016 w We can forecast better by joining multiple time series w Our framework enables fast distributed temporal join of large scale unalignedtime series w Temporal join is a fundamental operation for time series analysis €7.94 €7.98 €7.94 €8.08 €8.12 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price 23°C 22°C 21°C 22°C 20°C 19°C 18°C temperature 7
  • 8. What is a left join? November1, 2016 time series 1 time series 2 8
  • 9. What is temporal join? November1, 2016 w A particular joinfunction defined by a matching criteriaover time w Examples of criteria w look-backward w look-forward time series 1 time series 2 look-forward time series 1 time series 2 look-backward observation 9
  • 10. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM 10
  • 11. ImportantLegal Information November1, 2016 11 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination.
  • 12. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 12
  • 13. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 13
  • 14. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 14
  • 15. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 15
  • 16. Temporaljoins in practice November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM 16
  • 17. Time Series Scale November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AMWe need fast and scalable distributed temporal join 17
  • 18. Existing solutions November1, 2016 w Existing packages don’t support temporal join or can’t handle large time series w Pandas / R / Matlab w Limited to single machine w Spark w Does scale, but all data is unordered w spark-ts w Expects univariate time series to fit on single machine w Splits by col w Supports only snapshot data 18
  • 19. Flint: A new time series library for Spark November1, 2016 w Goal w Provide a collection of functions to manipulate and analyze time series at scale w Group, temporal join, summarize, aggregate … w How w Build a time series aware data structure w TimeSeriesRDD extends RDD w Optimize using temporal locality w Reduce shuffling w Reduce memory pressure by streaming 19
  • 20. What is a TimeSeriesRDD? November1, 2016 w TimeSeriesRDD vs RDD w Associate time range on each partition w Track partition time-ranges w Preserve temporal order 20
  • 21. RDD November1, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … RDD (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (6:58 AM, 64°F) (6:59 AM, 65°F) … (7:34 AM, 74°F) (7:35 AM, 74°F) … (7:58 AM, 76°F) (7:59 AM, 77°F) … Raw Data 21
  • 22. TimeSeriesRDD November1, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … RDD (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (6:58 AM, 64°F) (6:59 AM, 65°F) … (7:34 AM, 74°F) (7:35 AM, 74°F) … (7:58 AM, 76°F) (7:59 AM, 77°F) … (6:00 AM, 60°F) (6:01 AM, 61°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … TSRDD [06:00AM, 07:00AM) [07:00AM, 8:00AM) [8:00AM, ∞) Raw Data 22
  • 23. Group function November1, 2016 w A group function groups rows with exactly the same timestamps time city temperature 1:00 PM New York 70°F 1:00 PM Brussels 60°F 2:00 PM New York 71°F 2:00 PM Brussels 61°F 3:00 PM New York 72°F 3:00 PM Brussels 62°F 4:00 PM New York 73°F 4:00 PM Brussels 63°F group 1 group 2 group 3 group 4 23
  • 24. Group function November1, 2016 w A group function groups rows with nearby timestamps time city temperature 1:00 PM New York 70°F 1:00 PM Brussels 60°F 2:00 PM New York 71°F 2:00 PM Brussels 61°F 3:00 PM New York 72°F 3:00 PM Brussels 62°F 4:00 PM New York 73°F 4:00 PM Brussels 63°F group 1 group 2 24
  • 25. Group in Spark November1, 2016 w Groups rows with exactly the same timestamps RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 25
  • 26. w Data is shuffled and materialized on the workers Group in Spark November1, 2016 RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM sortBy RDD w Back to Temporal Order w Temporal order is not preserved 26
  • 27. Group in TimeSeriesRDD November1, 2016 w Data is grouped per partition locally as streams TimeSeriesRDD 2:00PM 1:00PM 3:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM 27
  • 28. • Running time of count after group • 16 executors (10G memory and 4 cores per executor) • Data read from HDFS Benchmark for group + count November1, 2016 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M TimeseriesRDD DataFrame RDD 50 - 100X5 - 10X 28
  • 29. Temporaljoin November1, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction– look-backward or look-forward w window – how much to look-backward or look-forward look-backward temporal join window 29
  • 30. Temporaljoin November1, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction– look-backward or look-forward w window – how much to look-backward or look-forward look-backward temporal join window 30
  • 31. Temporaljoin November1, 2016 w Temporal join with criteria look-back and window of 1 hour 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM time series 1 31 time series 2
  • 32. Temporaljoin November1, 2016 w Temporal join with criteria look-back and window of 1 hour w How do we do temporal join in TimeSeriesRDD? TimeSeriesRDD TimeSeriesRDD 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM 32
  • 33. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w partition time space into disjoint intervals TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM 33
  • 34. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Build dependencygraph for the joinedTimeSeriesRDD TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM [1:00AM, 4:00AM) [4:00AM, 6:00AM) [1:00AM, 4:00AM) [4:00AM, 6:00AM) 34
  • 35. 1:00AM1:00AM Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams per partition 1:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 2:00AM 4:00AM 5:00AM 3:00AM 5:00AM 35
  • 36. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM 36
  • 37. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 5:00AM 1:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 4:00AM 3:00AM 37
  • 38. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 5:00AM 5:00AM 38
  • 39. Benchmark for temporaljoin + count November1, 2016 0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s 20M 40M 60M 80M 100M TimeseriesRDD DataFrame RDD 20 - 50X5 - 10X 39 • Running time of count after temporal join • 16 executors (10G memory and 4 cores per executor) • Data read from HDFS
  • 40. Functions over TimeSeriesRDD November1, 2016 w Grouping functions w Temporal joins such as look-forward, look-backward etc. w Summarizers such as average, variance, z-score etc. over grouping functions 40
  • 41. Open Source November1, 2016 w True! w https://github.com/twosigma/flint 41
  • 42. What’snext? November1, 2016 w TimeSeriesDataframe / TimeSeriesDataset w Speed up w RicherAPIs w Python bindings w Additional summarizers 42
  • 43. Key contributors November1, 2016 w ChristopherAycock w Yuri Bogomolov w Jonathan Coveney w Li Jin w David Medina w Julia Meinwald w David Palaitis w Larisa Sawyer w Leif Walsh w Wenbo Zhao 43
  • 44. Flint: Time Series For Spark November1, 2016 A library to solve for general time series analysis operations at massive scale Anne Hathaway has nothing to do with Berkshire Hathaway Check it out in open source, and contribute https://github.com/twosigma/flint 44