14. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 14/49
Layer normalization [University of Toronto, G. Hinton]Layer normalization [University of Toronto, G. Hinton]
Problem: BatchNorm is dependent on batch, and is not obvious how to apply to
RNN
Varied length sequence in RNN
Hard to applied to online learning
Solution: transpose the normalization into layer and place it before non-linearity
16. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 16/49
Layer normalizationLayer normalization
Compute the layer normalization statistics over all the hidden units in the same layer.
hidden unit
= f( + )h
l+1
i
w
lT
i
h
l
b
l
i
⇒ = , = f( + )a
l
i
w
lT
i
h
l
h
l+1
i
a
l
i
b
l
i
i − th
=μ
l
1
h
∑
i=1
h
a
l
i
=σ
l
( −
1
h
∑
i=1
h
a
l
i
μ
l
)
2
− −−−−−−−−−−−
⎷
24. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 24/49
Recurrent batch normalizationRecurrent batch normalization
B (x) = γ ⊙ + βNγ,β
x − μ
+ ϵσ
2
− −−−−
√
= sigm(B ( ) + B ( ) + b)ft N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= sigm(B ( ) + B ( ) + b)it N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= sigm(B ( ) + B ( ) + b)ot N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= tanh(B ( ) + B ( ) + b)gt
N ,γh
βh
Wh ht−1 N ,γx
βx
Wx xt−1
= ⊙ + ⊙ct ft ct−1 it gt
= ⊙ tanh(B ( ))ht ot N ,γc
βc
ct
25. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 25/49
Group normalization [Facebook AI Research]Group normalization [Facebook AI Research]
problem: BatchNorm's error increase rapidly when the batch size drcrease
CV require small batches constrained by memory comsumption
solution: divide channels into groups and compute within each group the mean
and variance for normalization
2481632
batch size (images per worker)
22
24
26
28
30
32
34
36
error(%)
Batch Norm
Group Norm
28. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 28/49
How does batch normalization help optimization? [MIT]How does batch normalization help optimization? [MIT]
No, it is not about internal covariate shift!No, it is not about internal covariate shift!
It makes the optimization landscape signi cantly smoother.It makes the optimization landscape signi cantly smoother.
29. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 29/49
Investigate the connection between ICS and BatchNormInvestigate the connection between ICS and BatchNorm
VGG on CIFAR-10 w/o BatchNormVGG on CIFAR-10 w/o BatchNorm
Dramatic improvement both in terms of optimization and generalization
Difference in distribution stability
0 5k 10k 15k
Steps
50
100
TrainingAccuracy(%)
Standard, LR=0.1
Standard + BatchNorm, LR=0.1
Standard, LR=0.5
Standard + BatchNorm, LR=0.5
0 5k 10k 15k
Steps
50
100
TestAccuracy(%)
Standard, LR=0.1
Standard + BatchNorm, LR=0.1
Standard, LR=0.5
Standard + BatchNorm, LR=0.5
Layer#3
Standard Standard + BatchNorm
Layer#11
31. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 31/49
Does BatchNorm's performance stem from controlling ICS?Does BatchNorm's performance stem from controlling ICS?
We train the network with random noise injected after BatchNorm layers.
Each activation for each sample in the batch using i.i.d. noise with non-zero mean
and non-unit variance.
Noise distribution change at each time step.
0 5k 10k 15k
Steps
20
40
60
80
100
TrainingAccuracy
Standard
Standard + BatchNorm
Standard + "Noisy" Batchnorm
Layer#2
Standard Standard +
BatchNorm
Standard +
"Noisy" BatchNorm
Layer#9Layer#13
32. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 32/49
Is BatchNorm reducing ICS?Is BatchNorm reducing ICS?
Is there a broader notion of ICS that has such a direct link to training
performance?
Attempt to capture ICS from a perspective that is more tied to the underlying
optimization phenomenon.
Measure the difference between the gradients of each layer before and after
updates to all the previous layer.
33. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 33/49
Is BatchNorm reducing ICS?Is BatchNorm reducing ICS?
internal covariate shift (ICS) as
corresponds to the gradient of the layer parameters
is the same gradient after all the previous layers have been updated.
Re ect the change in the optimization landscape of caused by the changes of its
input.
Def.
|| − |Gt,i G
′
t,i
|
2
= ∇L( , . . . , , . . . , ; , )Gt,i W
(t)
1
W
(t)
i
W
(t)
k
x
(t)
y
(t)
= ∇L( , . . . , , . . . , ; , )G
′
t,i
W
(t+1)
1
W
(t+1)
i
W
(t+1)
k
x
(t)
y
(t)
Gt,i
G
′
t,i
Wi
44. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 44/49
Is BatchNorm the best (only?) way to smoothen theIs BatchNorm the best (only?) way to smoothen the
landscape?landscape?
Is this smoothening effect a unique feature of BatchNorm?
Study schemes taht x the rst order moment of the activations, as BatchNorm
does.
normalize them by the average of their norm
norm, norm and norm
Lp
L1 L2 L∞
45. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 45/49
Is BatchNorm the best (only?) way to smoothen theIs BatchNorm the best (only?) way to smoothen the
landscape?landscape?
0 5k 10k
Steps
20
40
60
80
100
TrainingAccuracy(%)
Standard
Standard + BatchNorm
Standard + L 1
Standard + L 2
Standard + L
0 5k 10k
Steps
102
103
104
TrainingLoss
Standard
Standard + BatchNorm
Standard + L 1
Standard + L 2
Standard + L
(a) VGG (b) Deep Linear Model
47. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 47/49
Is BatchNorm the best (only?) way to smoothen theIs BatchNorm the best (only?) way to smoothen the
landscape?landscape?
All the normalization strategies offer comparable performance to BatchNorm
For deep linear network, -normalization performs even better than BatchNorm
-normalization leads to larger distributional covariate shift than vanilla network,
yet stiil yield improved optimization performance
l1
lp
48. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 48/49
ConclusionConclusion
BatchNorm might not even be reducing internal covariate shift.
BatchNorm makes the landscape of the corresponding optimization problem be
signi cantly more smooth.
Provide empirical demostration and theoretical justi cation. (Lipschitzness)
The smoothening effect is not uniquely tied to BatchNorm.
49. 2018/10/21 Batch_normalization slides
http://127.0.0.1:8000/Batch_normalization.slides.html?print-pdf#/ 49/49
Q & AQ & A
Extra papersExtra papers
Understanding Batch Normalization (https://arxiv.org/abs/1806.02375)
Norm matters: ef cient and accurate normalization schemes in deep networks
(https://arxiv.org/abs/1803.01814)
Batch-normalized Recurrent Highway Networks
(https://arxiv.org/abs/1809.10271)
Differentiable Learning-to-Normalize via Switchable Normalization
(https://arxiv.org/abs/1806.10779)