Machine Learning for Time Series, Strata London 2018
1. MACHINE
LEARNING FOR
TIME SERIES
DR. MIKIO L. BRAUN
AI ARCHITECT AT ZALANDO
@mikiobraun
WHAT WORKS AND WHAT DOESN’T
STRATA DATA LONDON, MARCH 23, 2018
2. !2
TIME SERIES ANALYSIS
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
4. !4
MACHINE LEARNING FOR TIME SERIES
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
5. !5
CLASSICAL METHODS
Strong assumptions on stationarity. Predictions as linear combinations of past data / i.i.d. noise.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
6. !6
ESTIMATING WITH THE BOX-JENKINS PROGRAM
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
7. !7
• Solid theoretical background.
• Very explicit modeling.
• A lot of control as it is a manual process.
• Bayesian version available to provide uncertainty
estimates.
WHAT WORKS
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
8. !8
CHALLENGES: SEASONALITY & NON-STATIONARITY
In reality, data is seldom stationary,
but shows trends, seasonality,
cycles, ... .
In the classical approach, these are
manually removed first.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
9. !9
DIFFERENCING AND SCALING
• Running means.
• De-trending by differencing.
• Variance stabilization by log, square root, Box-
Cox transformation.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
10. !10
• What if assumptions do not hold?
• Stationarity is a rather strong requirement.
• Linear autoregressive models are somewhat “boring.”
CLASSICAL METHODS: WHAT DOESN’T WORK SO WELL
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
11. !11
MORE GENERAL MACHINE LEARNING APPROACH
Be explicitly collecting the past of a point, we can
construct a supervised learning setting.
Still different as points are highly correlated.
Can use any number of methods (linear, SVMs,
neural networks, …)
Easily extends to other areas as well:
• Multiple input variables.
• Multiple output variables.
• Additional variables to feed into the model.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
12. !12
EVALUATION AND CROSS-VALIDATION WITH TIME SERIES DATA
In ML, one often uses cross-validation to
estimate performance on future data.
Since time series data is highly correlated, one
cannot sample test data at random but should
sample block-wise.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
13. !13
CHALLENGES: EXTEND PREDICTION
Prediction can be done either one point at a
time, using test data as past values as they
become available.
Or one can use the predictions themselves,
which leads to much less stable predictions.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
14. !14
CLICK DATA & BEYOND SIMPLE TIME SERIES MODELS
Another interesting data source is event
data (click data, customer actions, …).
These show very similar properties: strong
dependence, predictions depend on past,
etc.
Often, data needs to be summarized and
transformed to get good predictions.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
15. !15
• Aggregate histograms over time scales.
• Transform into Fourier Space.
• Apply bandpass / low pass / high pass filter.
• Intelligent filtering: independent component analysis,
canonical correlation analysis.
• Downside: Quite costly to retrain on each iteration.
FEATURE ENGINEERING FOR TIME SERIES
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
16. !16
DEEP LEARNING: LONG SHORT TERM MEMORY
Recurrent neural network base predictions on past
data point and hidden state.
Hidden state can aggregate features automatically.
LSTM is a particularly flexible variant that has
(learnable) gates and transformations to control how
hidden state is updated.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
17. !17
APPLICATION: ANALYZING USER ACTIONS @ ZALANDO
• Goal is to predict buy probability based on user
histories.
• Before: many handcrafted features + logistic
regression
• Drawback: retune all the features again and again
• With DL: embedding of user histories in a RNN plus
user specific features.
• Performs already pretty well.
Lang, Rettenmeier: „Understanding Customer
Behavior with Recurrent Neural Networks“, MLRec
2017
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
18. !18
DEEP LEARNING FOR CUSTOMER ACTIONS @ ZALANDO
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
19. !19
APPLICATION: DEMAND PREDICTION FOR RARE EVENTS @ UBER
Uber is interested in having reliable models also during extreme events like Thanksgiving or
New Year's Day—which have little coverage in usual data.
https://eng.uber.com/neural-networks/
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
20. !20
DEMAND PREDICTION AT UBER: THE DATA
Available data uses a number of exogenous features like weather, app views.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
21. !21
ARCHITECTURE: TIME SERIES AUTOENCODERS
Combination of a stacked LSTM autoencoder to
capture general dynamics and informative
features.
These are then concatenated with the actual
input and put into another LSTM forecast
network.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
22. !22
APPLICATION: DEMAND PREDICTION @ AMAZON, MANY TIME SERIES & PROBABILITIES
https://arxiv.org/abs/1704.04110
Challenges of predicting article
demand over thousands of articles:
• Numbers on many scales.
• Amount of available data varies.
• We want probability distributions in
predictions.
• Predictions ahead in time.
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
23. !23
• Use LSTM to learn interactions in the time
series.
• LSTMs also propagate knowledge about
dynamics to data points with few data
points.
• LSTM predicts parameters of distributions
in each point.
• Pre- & post-scale time series.
DEEP AR @ AMAZON
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
24. !24
• Procedure:
1. Predict parameter
2. Compute likelihood
3. Sample next point
• Train by maximizing
likelihood.
• Train directly on requested
prediction into the future.
• Sample points to go into the
future.
DEEP AR: TRAINING
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
25. !25
SUMMARY
MIKIO BRAUN, MACHINE LEARNING FOR TIME SERIES: WHAT WORKS AND WHAT DOESN'T, STRATA DATA 2018 LONDON
Classical Timer Series
Models
General Machine
Learning
Feature Engineering Deep Learning
Use to get started.
Use if explicit modeling is
good.
If you are unsure about
modeling assumptions.
But: use proper validation
to ensure good
performance.
For more complex data.
If you have a priori
knowledge about the
domain.
If you have a lot of data.
If you frequently want to
iterate & experiment.
If explicit modeling &
feature engineering is too
costly.