Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Dmitry Larko, H2O.ai - Time Series in H2O Driverless AI - #H2OWorld 2019 NYC

140 visualizaciones

Publicado el

This session was recorded in NYC on October 22nd, 2019 and can be viewed here: https://www.youtube.com/watch?v=eF4Oa0ZzXdQ&list=PLNtMya54qvOE3AvWRCNF2tybxNobUbAYp&index=6&t=3s

Time Series in H2O Driverless AI
Time series is a unique field in predictive modelling where standard feature engineering techniques and models are employed to get the most accurate results. In this session we will examine some of the most important features of Driverless AI’s newest recipe regarding Time Series. It will cover validation strategies, feature engineering, feature selection and modelling. The capabilities will be showcased through several cases.

Bio: Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science.He has a lot of experience in predictive analytics software development for different domains and tasks.

He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.

Publicado en: Tecnología
  • Sé el primero en comentar

Dmitry Larko, H2O.ai - Time Series in H2O Driverless AI - #H2OWorld 2019 NYC

  1. 1. Time series in Driverless AIDmitry Larko Sr. Data Scientist H2O.ai
  2. 2. Background
  3. 3. • Some input data • A target variable • An objective (or a success metric) like RMSE or MAE • Some allocated resources (time and hardware) e.g.salesx1 x2 x3 x4 y 0.14 0.69 0.01 0.71 300 0.22 0.44 0.45 0.69 100 0.12 0.35 0.51 0.23 40 0.22 0.42 0.79 0.60 23 0.93 0.82 0.72 0.50 1900 0.32 0.58 0.28 0.22 231 0.95 0.59 0.68 0.09 700 0.34 0.58 0.35 0.81 423 0.05 0.80 0.28 0.86 222 0.23 0.49 0.63 0.03 190 0.05 0.34 0.53 0.73 890 0.74 0.02 0.33 0.56 1000 Driverless AI Process - Data visualization (AutoViz) - Feature engineering & selection - Automated Modeling - Model interpretability (MLI) - Scoring pipeline (predictions)
  4. 4. 0 50 100 150 200 250 300 350 400 12/31/2017 1/2/2018 1/4/2018 1/6/2018 1/8/2018 1/10/2018 1/12/2018 1/14/2018 Sales over time Linear relationshipNonlinear (seasonal) relationship What is a Time Series Problem? 0 50 100 150 200 250 12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 Sales over time
  5. 5. 0 100 200 300 400 500 600 700 800 12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 3/1/2018 3/11/2018 sales per per day (all groups) 0 100 200 300 400 500 600 700 800 12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 3/1/2018 3/11/2018 sales by group group 1 group 2 group 3 time groups sales 01/01/2018 group1 30 01/01/2018 group2 100 01/01/2018 group3 10 02/01/2018 group1 60.2 02/01/2018 group2 200.2 02/01/2018 group3 20.2 03/01/2018 group1 90.3 03/01/2018 group2 300.3 03/01/2018 group3 30.3 04/01/2018 group1 120.4 04/01/2018 group2 400.4 04/01/2018 group3 40.4 Time Groups
  6. 6. Modeling Foundation 1 2 3 4 5 6 7 8 9 10 11 12 [Gap] 1 2 3 4 5 6 7 8 9 10 11 12 [Gap] [Gap] testtrain tvs train tvs valid test time: Gap | Forecast Horizon invalid lag size valid lag size time:
  7. 7. Date 1/1/2018 2/1/2018 3/1/2018 4/1/2018 5/1/2018 6/1/2018 7/1/2018 8/1/2018 9/1/2018 10/1/2018 Day Month Year Weekday Weeknum IsHoliday 1 1 2018 2 1 1 2 1 2018 3 1 0 3 1 2018 4 1 0 4 1 2018 5 1 0 5 1 2018 6 1 0 6 1 2018 7 1 0 7 1 2018 1 2 0 8 1 2018 2 2 0 9 1 2018 3 2 0 10 1 2018 4 2 0 Feature Engineering
  8. 8. Date Sales 1/1/2018 100 2/1/2018 150 3/1/2018 160 4/1/2018 200 5/1/2018 210 6/1/2018 150 7/1/2018 160 8/1/2018 120 9/1/2018 80 10/1/2018 70 Lag1 Lag2 - - 100 - 150 100 160 150 200 160 210 200 150 210 160 150 120 160 80 120 Moving Average - 100 125 155 180 205 180 155 140 100 Feature Engineering (cont.) • Lags on subsets of the specified group columns (e.g. {Store, Department} vs. {Department} vs. {Store}) • Exponentially Weighted Moving Averages (EWMA) of n-th order differentiated lags • Aggregation of lags (mean, std, sums, etc.) • Interactions of lags (e.g. Lag2 - Lag1) • Linear regression on lags (taking slope and/or intercept as new features)
  9. 9. What’s new? Training Holdout Predictions / Backtesting • Final pipeline will be refitted on various train/valid splits to generate holdout predictions: Split 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 1 2 3 4 5 6 7 8 9 10 11 12 3 1 2 3 4 5 6 7 8 9 10 4 1 2 3 4 5 6 7 8 5 1 2 3 4 5 6 x x x training data validation/holdout data Training & Validation/Holdout Data optional training data time
  10. 10. Test Time Augmentation & Rolling Predictions • If test set is larger than the forecast horizon the predictions beyond are inferior • Lookup tables to create features (lags, aggregates etc.) are missing the necessary data hence models operate with many missing values missing data present data
  11. 11. 1st Solution: Test Time Augmentation (TTA) Model stays the same, only the memory of the fitted transformers is updated (TTA). Keep rolling the prediction window over the whole test set to get valid predictions. Pros: no model changes, fast Cons: model degradation over time
  12. 12. 2nd Solution: Extend Train and Refit To get valid predictions for whole test set we extend the train set with latest data and refit the original model to generate precise predictions. Roll the prediction window step by step over the whole test period and keep retraining models. Pros: most precise Cons: time consuming
  13. 13. What’s new? Bring Your Own Recipe (BYOR) • Custom time series transformers or models to be used within Driverless AI • Interface to bring in domain specific (or just additional) feature transformers • Interface to bring in popular algorithms like ARIMA, LSTM, Prophet etc. • Either as custom models or as feature transformations (i.e. using their predictions as input features for DAI) • Example implementions available • FBProphetModel • ExponentialSmoothingModel • AutoArimaTransformer • ProphetTransformer • … https://github.com/h2oai/driverlessai-recipes/tree/master/transformers/timeseries https://github.com/h2oai/driverlessai-recipes/tree/master/models/timeseries
  14. 14. Will be released soon: Unknown Features at Prediction Time • Some features might not be known at the time a prediction is made • Driverless will make sure that only historical information for these features are used
  15. 15. Will be released soon: Time Aware Target Transformations • Detrending • Fast linear (least squares) • Robust linear (RANSAC regression) • Logistic growth • Centering • y‘(t) = y(t) – c • Differencing • y‘(t) = y(t) – y(t - k) • Ratio • y‘(t) = y(t) / y(t - k)
  16. 16. Time Aware Target Transformations (cont.) • Example: Capture trends with tree based models Without detrending With detrending
  17. 17. Will be released soon: Prediction Intervals • Basend on the method from Williams & Goodman (1971) • Very general approach: • Makes no assumptions about the distribution of forecast errors • Makes no assumptions about the model used to create forecasts • General idea: • Using time based holdout predictions to determine real forecast errors • Constructing empirical prediction intervals based on forecast error quantiles
  18. 18. Masterminds behind DAI time series • Data Scientists • Former #1 & #4
  19. 19. Thank You Twitter: @DmitryLarko

×