This talk describes an experimental approach to time series modeling using 1D convolution filter layers in a neural network architecture. This approach was developed at System1 for forecasting marketplace value of online advertising categories.
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jeff Roach
1. 1D Convolutional Neural Networks for
Time Series Modeling
PyData LA 2018
Nathan Janos and Jeff Roach
2. Who are we?
● Nathan Janos
○ Chief Data Officer @ System1 (4.5 years)
○ 15 years in ad-tech optimization
● Jeff Roach
○ Data Scientist @ System1 (2+ years)
○ Background in epidemiology
4. From 2D to 1D
???
Graphics attributed to Mathworks and Wikipedia Creative Commons
5. Motivation
● Deep learning and the new wave of neural networks are increasingly popular
● Focus is in the visual space for classification
● We are interested in time series forecasting
● Couldn’t find as much modern work in this area
○ Sequence classification in language, text, audio
○ LSTM (long short-term memory), GRU (gated recurrent unit), RNN (recurrent NN)
Graphic attributed to Wikipedia Creative Commons
6. Discrete Time Signal Processing
● What about combining DSP with NNs?
○ Used in domains such as speech processing, sonar, radar, biomedical engineering, seismology
● Why not treat our hourly data as samples like one of these signals?
● 2D convolution works well for image classification
● Can 1D convolution work for time series forecasting?
● Had the idea to apply classic discrete time convolution techniques to 1D data...
7. Convolution
● Inspired by the convolution used in visual NNs (cross correlation)
● But instead use the definition of convolution used in signal processing
● It’s the integral of the product of two functions after one is reversed and shifted
Graphics attributed to Wikipedia Creative Commons
9. Parameterized
w w w w w w w w w w w w w w w w w w w w w
y(t)
filter layer 1
T is length of time series
t size of intermediate time series
W is size of window
F is number of filters
D is depth of filters
F = 6
pool pool pool pool pool pool
relu relu relu relu relu relu
w w w w w w w w w w w w w w w w w w w w wfilter layer 2
pool pool pool pool pool pool
relu relu relu relu relu relu
D = 2
W = 24
T = 1512
t = 24
t = 12
t = 6
regression layer
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w
y(t+1)
10. Parameter Space Example
● One layer of filters has n*(n + 1)/2 = 6(6-1)/2 = 15 weights
● Two layers = 15 * 2 = 30 filter weight parameters
● Two layers deep and a window of 24 hours = each bottom filter output has
24/2/2 = 6 values
● 6 filters * 6 output values = 36 regression weight parameters
● 66 total parameters
● A network with 24 filters, 3 deep, running on week of hourly of data = 1404
11. Learning About Learning Rate
iterations over time
learning rate
This is a type of stochastic gradient descent with restarts (SGDR)
12. Data and Testing
● Revenue per click data on mobile devices in automotive category
● Hourly data from 4/1/2018 to 6/2/2018 (63 days, 9 weeks of data)
● Train on first 8 weeks of hourly data
● Test on last week of data
● Compared MASE (mean absolute scaled error) of model to MASE of “simple”
1-hour lagged data model
○ MASE < 1.0 means we are beating the simple model
13.
14.
15. Best Network Results
● Using networks with about 6 filters, 3 deep, window of 24 hours of data
● Training took ~20 minutes
● Training on 8 weeks of data
● Best MASE compared to simple 1-hour lag model was ~0.86
16. Prototype Conclusion
● Probably should use GPU framework to make it faster
● Lots of time spent on hyperparameter tuning
● Should consider other network architectures
● Build out in an established NN framework to leverage backpropagation
18. Goals
● Port to Python
○ PyTorch
○ Fastai
● Find architecture improvements
● Beat current best production model (TBATS)
○ Linear time series model that captures complex seasonal trends
○ Exponential Smoothing State Space Model With Box-Cox Transformation, ARMA Errors, Trend And Seasonal
Components
○ TBATS R package to fit model as described in De Livera, Hyndman & Snyder (2011)
19.
20. Architecture
w w w
y(t)
T is length of time series
t size of intermediate time series
W is size of window
F is number of filters
F = 1
pool
relu
w w w
pool
relu
W = 24
T = 1512
w w w w y(t+1)
WaveNet
ww
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out
21. Architecture
w w w w w w w w w w w w w w w w w w w w w
y(t)
T is length of time series
t size of intermediate time series
W is size of window
F is number of filters
F = 6
pool pool pool pool pool pool
relu relu relu relu relu relu
w w w w w w w w w w w w w w w w w w w w w
pool pool pool pool pool pool
relu relu relu relu relu relu
W = 24
T = 1512
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w y(t+1)
WaveNet Expansion
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out
22. Architecture
w w w
y(t)
T is length of time series
t size of intermediate time series
W is size of window
F is number of filters
F = 1
pool
relu
w w w
pool
relu
W = 24
T = 1512
w w w w y(t+1)www1x
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out
WaveNet Expansion
23. Architecture
w w w
y(t)
T is length of time series
t size of intermediate time series
W is size of window
F is number of filters
F = 6
pool
relu
w w w
pool
relu
W = 24
T = 1512
w y(t+1)w24x w12x 6x
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out
WaveNet Expansion
24. Architecture
w w w
y(t)
filter layer 1
T is length of time series
t size of intermediate time series
W is size of window
pool
relu
w w wfilter layer 2
pool
relu
W = 24
T = 1512
t = 24 in
t = 12 out
t = 6 in
regression layer w y(t+1)
Ensemble
t = 12 in
t = 6 out
108x
Dropout
BatchNorm
(continuous variables)
relu
dropout
Embedding
Linear
Fully Connected
2x layers
1000, 500 neurons
0.001, 0.01 dropout
0.04 dropout
w500x
WaveNet
fastai’s MixedInputModel
Hour as category
26. Model Comparison t-1
Model (Language, Processing Unit) MASE Time
TBATS (R, CPU) 0.90 30s
WaveNet Expansion (Matlab, CPU) 0.86 1200s
WaveNet Expansion (PyTorch, GPU) 0.86 16s
FilterNet (PyTorch, GPU) 0.82 27s
27. Model Comparison t-1 Different Category
● Previously trained Automotive category
● Forecasted on Finance category
Trained on Automotive data Automotive Finance
TBATS (R, CPU) 0.90 0.96
FilterNet (PyTorch, GPU) 0.82 0.90
28. Model Comparison t-1 Missing Data
Replaced every nth step with n-1 past data point TBATS FilterNet % diff
2 step (every other) 1.23 1.29 +5%
6 1.03 0.98 -5%
12 0.98 0.91 -7%
24 0.96 0.87 -9%
MASE
29. Jagged Dataset
● Jagged
○ Categories use different features
○ Long/short time periods
○ Few/many missing data points
● ~1300 advertising categories
● Hourly data
● Training = 37 days or 888 hours
● Test = 7 days or 168 hours
30. Model Comparison t-1 Jagged
MASE Time
TBATS, Single Category 0.84 17s
FilterNet, Single Category 0.83 4s
FilterNet, Full training set (~1300 Categories) 0.78 60s
FilterNet, Full training set, test Category removed from training set 0.78 60s
31. Model Comparison t-1 Jagged
# of training set days TBATS FilterNet % diff
14 1.30 0.82 -37%
21 1.05 0.86 -18%
28 0.82 0.83 0%
37 0.84 0.83 0%
MASE
32. Conclusion
● FilterNet perks
○ Performance (7%)
○ Training speed (10%-300%)
○ Context
○ Less sensitive to data quantity
■ But, more sensitive to data quality
● Convolution and context models are complimentary
34. Prototype Training Detail
● Initialize
○ Seed filter weights with random values from [-1, -0.5, 0, 0.5, 1]
○ Seed regression weights with random values in range -0.2 to 0.2
○ Set learning rate to 1.0
● Iterate 100s to 100,000s of times
○ Forward propagate current network and store MSE
○ Randomly select a subset of weights (usually 10%) to move a small amount one at a time
■ Filter weights are moved in random increments of 0.1 or 0.01
■ Regression weights are moved by another different small amount
■ Store resulting MSE from moving each weight independently
○ Update parameters
■ If MSE for a filter weight delta is lower update it by that random increment
■ If MSE for a regression weight delta is lower update by gradient
○ Update learning rate: multiply by 0.95
○ If learning rate < threshold set back up to last initial learning rate * 0.95
38. Model Comparison t-1 Facebook
MASE Time
TBATS, Single Category 0.84 17s
FilterNet, Single Category, w/ imputed, Batch Size = 1 1.41 150s
FilterNet, Single Category, Batch Size = 1 0.84 150s
FilterNet, Single Category, Batch Size = 512 0.83 4s
FilterNet, Full training set (~1300 Categories) 0.78 60s
FilterNet, Full training set, w/o CNN 0.79 60s
FilterNet, Full training set, w/o Mixed Input 0.80 60s
FilterNet, Full training set, test Category removed from training set 0.78 60s
39. Production
● Inspired by how fastai loads pretrained models
● Save trained model to dictionary
○ State, structure, tuning parameters, etc.
● Model framework in common location
● Rebuild model using framework and dictionary