Disciplined approach to neural network hyperparameters

LESLIE SMITH’S PAPERS
FOR DL JOURNAL CLUB

DISCIPLINED APPROACH PAPER
• A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size,
Momentum, and Weight Decay
• There is no Part 2
• https://arxiv.org/abs/1803.09820
• Collection of empirical observations spread out through the paper

CONVERGENCE / TEST-VAL LOSS
• Observe box in top-left corner of Figure 1(a)
• Shows training loss (red) decreasing and validation loss
(blue) decreasing then increasing.
• Plot to left of validation loss minima indicates
underfitting
• Plot to right of validation loss minima indicates
overfitting.
• Achieving the horizontal part of test/validation loss
(minima) is goal of hyperparameter tuning.

UNDERFITTING
• Underfitting is indicated by continuously decreasing
test loss rather than horizontal plateau (Fig 3(a)).
• Steepness of test loss curve indicates how well the
model is learning (Fig 3(b)).

OVERFITTING
• Increasing Learning Rate moves the model from underfitting
to overfitting.
• Blue curve (Fig 4a) shows steepest fall – indication that this
will produce better final accuracy.
• Yellow curve (Fig 4a) shows overfitting with LR > 0.006.
• More overfitting examples – blue curves in bottom figs.
• Blue curve (Fig 4b) shows underfitting.
• Red curve (Fig 4b) shows overfitting.

CYCLIC LEARNING RATE (CLR)
• Motivation: Underfitting if LR too low, overfitting if too high; requires grid search
• CLR
• Specify upper and lower bound for LR
• Specify step size == number of iterations or epochs used for each step
• Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly
from max to min.
• Other variants tried but no significant benefit observed.

CLR – CHOOSE MAX AND MIN LR
• LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to
decrease)
• LR lower bound, one of:
• Factor of 3 or 4 less than upper bound.
• Factor of 10 or 20 less than upper bound if only 1 cycle is used.
• Find experimentally using short test of ~1000 iterations, pick largest that allows convergence.
• Step size – if LR too high, training becomes unstable, increase step size to increase difference between
max and min LR bounds.

SUPER CONVERGENCE
• Super convergence – some networks remain stable under
high LR, so can be trained very quickly with CLR with high
upper bound.
• Fig 5a shows super convergence (orange curve) training
faster to higher accuracy using large LR than blue curve.
• 1-cycle policy – one cycle that is smaller than number of
iterations/epochs, then remaining iterations with LR
lowered by several order of magnitude.

REGULARIZATION
• Many forms of regularization
• Large Learning Rate
• Small batch size
• Weight decay (aka L2 regularization)
• Dropout
• Need to balance different regularizers for each dataset and architecture.
• Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning
needs to be balanced with lower WD.
• General guidance: reduce other forms of regularization and train with high LR makes training efficient.

BATCH SIZE
• Larger batch sizes permit larger LR using 1cycle schedule.
• Larger batch size may increase training time, so tradeoff
required.
• Tradeoff – use batch size so number of epochs is optimum
for data/model.
• Batch size limited by GPU memory.
• Fig 6a shows validation accuracy for different batch sizes.
Larger batch sizes better but effect tapers off (BS=1024
blue curve very close to BS=512 red curve).

(CYCLIC) MOMENTUM
• Set momentum as large as possible without causing instability.
• Constant LR => use large constant momentum (0.9 – 0.99)
• Cyclic LR => decrease cyclic momentum as cyclic LR increases
during early to middle part of training (0.95 – 0.85).
• Fig 8a – blue curve is constant momentum, red curve is
decreasing CM and yellow curve is increasing CM (with
increasing CLR).
• These observations also carry over to deep networks (Fig 8b).

WEIGHT DECAY
• Cyclical WD not useful, should remain constant throughout
training.
• Value should be found by grid search (ok with early
termination).
• Fig 9a shows loss plots for different values of WD (with LR=5e-
3, mom=0.95).
• Fig 9b shows equivalent accuracy plots.

CYCLIC LEARNING RATE PAPER
• Cyclical Learning Rates for Training Neural Networks
• https://arxiv.org/abs/1506.01186
• Describes CLR in depth and describes results of training common networks with CLR.

CYCLIC LEARNING RATE
• Successor to
• Learning rate schedules – varying LR exponentially over training.
• Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR
based on values of gradients.
• Based on observation that increasing LR has short-term
negative effect but long-term positive effect.
• Let LR vary between range of values.
• Triangular LR (Fig 2) is usually good enough but other variants
also possible.
• Accuracy plot (Fig 1) shows CLR (red curve) is better compared
to Exponential LR.

ESTIMATING CLR PARAMETERS
• Step size
• Step size = 2 to 10 times * number of iterations per epoch
• Number of training iterations per epoch = number of training records /
batch size
• Upper and lower bounds for LR
• Run model for few epochs with some bounds (1e-4 to 2e-1 for
example)
• Upper bound == where accuracy stops increasing, becomes ragged, or
falls (~ 6e-3).
• Lower bound
• Either 1/3 or ¼ of upper bound (~ 2e-3)
• Point at which accuracy starts to increase (~ 1e-3)

LR FINDER USAGE
• LR Finder – first available in Fast.AI library.
• Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is
decreasing fastest.
• Can also be found using lr.plot_loss_change() – minimum point (here 1e-2).
• Lower bound is about 1-2 orders of magnitude lower.
• LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder
• LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder
• Keras example -- https://github.com/sujitpal/keras-tutorial-
odsc2020/blob/master/02_03_exercise_2_solved.ipynb
• Fast. AI example --
https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac
cel_sgd.ipynb

Disciplined approach to neural network hyperparameters

Recomendados

Recomendados

Más contenido relacionado

Similar a Disciplined approach to neural network hyperparameters

Similar a Disciplined approach to neural network hyperparameters (20)

Más de Sujit Pal

Más de Sujit Pal (20)

Último

Último (20)

Disciplined approach to neural network hyperparameters