Model selection and tuning at scale

Model Selection and
Tuning at Scale
March 2016

About us
Owen Zhang
Chief Product Officer @ DataRobot
Former #1 ranked Data Scientist on
Kaggle
Former VP, Science @ AIG
Peter Prettenhofer
Software Engineer @ DataRobot
Scikit-learn core developer

Agenda
● Introduction
● Case-study Criteo 1TB
● Conclusion / Discussion

Model Selection
● Estimating the performance of different models in order to choose the best one.
● K-Fold Cross-validation
● The devil is in the detail:
○ Partitioning
○ Leakage
○ Sample size
○ Stacked-models require nested layers
Train Validation Holdout
1 2 3 4 5

Model Complexity & Overfitting

Underfitting or Overfitting?
http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

Model Tuning
● Optimizing the performance of a model
● Example: Gradient Boosted Trees
○ Nr of trees
○ Learning rate
○ Tree depth / Nr of leaf nodes
○ Min leaf size
○ Example subsampling rate
○ Feature subsampling rate

Search Space
Hyperparameter GBRT (naive) GBRT RandomForest
Nr of trees 5 1 1
Learning rate 5 5 -
Tree depth 5 5 1
Min leaf size 3 3 3
Example subsample rate 3 1 1
Feature subsample rate 2 2 5
Total 2250 150 15

Hyperparameter Optimization
● Grid Search
● Random Search
● Bayesian optimization

Challenges at Scale
● Why learning with more data is harder?
○ Paradox: we could use more complex models due to more data but we cannot because
of computational constraints*
○ => we need more efficient ways for creating complex models!
● Need to account for the combined cost: model fitting + model selection / tuning
○ Smart hyperparameter tuning tries to decrease the # of model fits
○ … we can accomplish this with fewer hyperparameters too**
* Pedro Domingos, A few useful things to know about machine learning, 2012.
** Practitioners often favor algorithms with few hyperparameters such as RandomForest or
AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)

A case study -- binary classification on 1TB of data
● Criteo click through data
● Down sampled ads impression data on 24 days
● Fully anonymized dataset:
○ 1 target
○ 13 integer features
○ 26 hashed categorical features
● Experiment setup:
○ Using day 0 - day 22 data for training, day 23 data for testing

Big Data?
Data size:
● ~46GB/day
● ~180,000,000/day
However it is very imbalanced (even after downsampling non-events)
● ~3.5% events rate
Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB
● Will fit into a single node under “optimal” conditions
● Loss of model accuracy is negligible in most situations
Assuming 0.1% raw event (click through) rate:
Raw Data:
35TB@.1%
Data:
1TB@3.5%
Data:
70GB@50%

Where to start?
● 70GB (~260,000,000 data points) is still a lot of data
● Let’s take a tiny slice of that to experiment
○ Take 0.25%, then .5%, then 1%, and do grid search on them
Time (Seconds)
RF
ASVM
Regularized
Regression
GBM (with Count)
GBM (without Count)Better

GBM is the way to go, let’s go up to 10% data
# of Trees
Sample Size/Depth of Tree/Time to Finish

A “Fairer” Way of Comparing Models
A better model
when time is the
constraint

Can We Extrapolate?
?
Where We (can) do
better than generic
Bayesian
Optimization

Tree Depth vs Data Size
● A natural heuristic -- increment tree depth by 1 every time data size doubles
1%
2%
4%
10%
Optimal Depth = a + b * log(DataSize)

What about VW?
● Highly efficient online learning algorithm
● Support adaptive learning rate
● Inherently linear, user needs to specify non-linear feature or interactions explicitly
● 2-way and 3-way interactions can be generated on the fly
● Supports “every k” validation
● The only “tuning” REQUIRED is specification of interactions
○ Due to availability of progressive validation, bad interactions can be detected immediately
thus don’t waste time:

Data pipeline for VW
Training
Test
T1
T2
Tm
Test
T1s
Random
Split
T2s
Tms
Random
Shuffle
Concat +
Interleave
It takes longer to
prep the data than
to run the model!

VW Results
Without
With Count + Count*Numeric
Interaction
1% Data
10% Data
100% Data

Putting it All Together 1 Hour 1 Day

Do We Really “Tune/Select Model @ Scale”?
● What we claim we do:
○ Model tuning and selection on big data
● We we actually do:
○ Model tuning and selection on small data
○ Re-run the model and expect/hope performance/hyper
parameters extrapolate as expected
● If you start the model tuning/selection process with GBs (even
100s of MBs) of data, you are doing it wrong!

Some Interesting Observations
● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise)
non-linear model, even with much larger data
● There is meaningful structure in the hyper parameter space
● When we have limited time (relative to data size), running “deeper” models on smaller data
sample may actually yield better results
● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We
need models that has # of parameters that can scale with # of data points
○ GBM can have any many parameters as we want
○ So does factorization machines
● For any data any model we will run into a “diminishing return” issue, as data get bigger and
bigger

Model selection and tuning at scale

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Model selection and tuning at scale

Similar a Model selection and tuning at scale (20)

Último

Último (20)

Model selection and tuning at scale