In this talk, I explain several major stochastic optimizers from the perspective of the metric, that is the definition of the parameter space of the model.
2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers
1. Metric Perspective of Stochastic Optimizers
Asahi Ushio
E-mail: ushio@ykw.elec.keio.ac.jp
Dept. Electronics and Electrical Engineering, Keio University, Japan
Internal Research Seminar
July 19, 2017
Note: the author has moved to Cogent Labs as a researcher from April 2018.
Asahi Ushio Metric Perspective of Stochastic Optimizers 1 / 35
5. Preliminaries
Problem Setting
Model: For an input xt ∈ Rn
, the output ŷt ∈ R is derived by
Activation : ŷt = M(zt) (1)
Output : zt = Nw(xt) ∈ R. (2)
Loss: With instantaneous loss function lt(w) of parameter w,
L(w) = E [lt(w)]t . (3)
Problem Setting M(z) Nw(x) lt(ŷt) yt
Regression z wT
x
||ŷt − yt||2
2
yt ∈ R
Classification
1
1 + e−z
wT
x − [yt log(ŷt) + (1 − yt) log(1 − ŷt)] yt ∈ {0, 1}
Multi-Classification
ezi
∑k
i=1 ezi
wT
i x −
∑k
i=1 yt,i log(ŷi) yt,i ∈ {0, 1}
Asahi Ushio Metric Perspective of Stochastic Optimizers 5 / 35
6. Preliminaries
Brief Illustration of Model
Input
Hidden Layer
Output Activation
Figure: Neural Network
Input
Output Activation
Figure: Linear Model
Asahi Ushio Metric Perspective of Stochastic Optimizers 6 / 35
7. Preliminaries
Stochastic Optimization
Purpose: Find ŵ = arg min
w
L(w) by stochastic approximation.
SGD (Stochastic Gradient Descent) and Varinats
wt = wt−1 − ηtgt, ηt ∈ R s.t. lim
t→∞
ηt = 0 and lim
t→∞
t
∑
i=1
ηi = ∞ (4)
Vanilla SGD : gt =
d
dw
lt(wt−1)
Momentum : gt = γgt−1 + (1 − γ)
d
dw
lt(wt−1), γ ∈ R
NAG : gt = γgt−1 + (1 − γ)
d
dw
lt(wt−1 − γgt−1)
Minibatch : gt =
1
I
I−1
∑
i=0
d
dw
lt−i(wt−1), I ∈ R.
Suppose lt(w) is differentiable.
Asahi Ushio Metric Perspective of Stochastic Optimizers 7 / 35
8. Hessian Approximation
Quasi-Newton Method
Newton Method employs Hessian matrix:
wt = wt−1 − Btgt (5)
Bt = H−1
t , Ht =
d2
L(wt−1)
dw2
.
Quasi-Newton employs Hessian approximation Ĥt instead of Ht.
1 FDM (Finite Difference Method): SGD-QN [1], AdaDelta [2], VSGD [3, 4]
2 Extended Gauss-Newton Approximation
3 LBFGS
Diagonal approximation is often used in stochastic optimization:
Ĥt = diag (h1,t . . . hn,t) (6)
Asahi Ushio Metric Perspective of Stochastic Optimizers 8 / 35
9. Hessian Approximation Finite Difference Method
SGD-QN
SGD-QN [1] employs instantaneous estimator of Hessian:
1
hi,t
: =
η
t
E
[
1
hF DM
i,τ
]t
τ=1
(7)
E
[
1
hF DM
i,τ
]t
τ=1
≈
1
h̄i,t
:= α
1
hF DM
i,t
+ (1 − α)
1
h̄i,t−1
(8)
hF DM
i,t : =
gi,t − gi,t−1
wi,t−1 − wi,t−2
(9)
The FDM approximation (9) is called secant condition in quasi-newton
method context.
Asahi Ushio Metric Perspective of Stochastic Optimizers 9 / 35
10. Hessian Approximation Finite Difference Method
AdaDelta
SGD updates (5) can be reformulated
Bt = −(wt − wt−1)g−T
t . (10)
AdaDelta [2] approximates Hessian by (10) :
hi,t := E
[
−
gi,t
wi,t − wi,t−1
]t
τ=1
≈
RMS [gi,t]
RMS [wi,t−1 − wi,t−2]
. (11)
Here wi,t − wi,t−1 is not known at t, so approximated by wi,t−1 − wi,t−2.
With numerical stability parameter ϵ: (sensitive parameter in practice)
RMS [gt] : =
√
E [g2
τ ]
t
τ=1 + ϵ (12)
E
[
g2
τ
]t
τ=1
≈ ḡt := αg2
t + (1 − α)ḡt−1 (13)
Asahi Ushio Metric Perspective of Stochastic Optimizers 10 / 35
11. Hessian Approximation Finite Difference Method
VSGD: Quadratic Loss
Quadratic Approximation of Loss
Taylor expansion gives
lt(w) ≈ lt(a) +
dlt(a)
dw
T
(w − a) +
1
2
(w − a)T dl2
t (a)
dw2
(w − a), ∀a ∈ Rn
For ŵt = arg min
w∈Rn
lt(w),
lt(ŵt) = 0,
dlt(ŵt)
dw
= 0. (14)
Then lt(w) can be locally approximated by the quadratic function
lt(w) ≈
1
2
(w − ŵt)T d2
lt(ŵt)
dw2
(w − ŵt). (15)
Asahi Ushio Metric Perspective of Stochastic Optimizers 11 / 35
12. Hessian Approximation Finite Difference Method
VSGD: Noisy Quadratic Loss
Noisy Quadratic Approximation of Loss with Diagonal Hessian Approximation
Suppose ŵt ∼ N(ŵ, diag
(
σ2
1, . . . , σ2
n
)
):
lq
t (w) : =
1
2
(w − ŵt)T
Ĥt(w − ŵt)
=
1
2
n
∑
i=1
hi,t(wi − ŵi,t)2
(16)
SGD for lq
t with element-wise learning rate ηi,t:
wi,t = wi,t−1 − ηi,t
dlq
t (wt−1)
dwi
= wi,t−1 − ηi,thi,t(wi,t−1 − ŵi,t)
= wi,t−1 − ηi,thi,t(wi,t−1 − ŵi + ui), ui,t ∼ N(0, σ2
i ). (17)
Asahi Ushio Metric Perspective of Stochastic Optimizers 12 / 35
13. Hessian Approximation Finite Difference Method
VSGD: Adaptive Learning Rate
Greedy Optimal Learning Rate
VSGD (variance SGD) [3, 4] choose the learning rate, which minimize
the conditional expectation of loss function:
ηi,t : = arg min
η
{
E [lq
t (wi,t)|wi,t−1]t
}
= arg min
η
{
E
[(
wi,t−1 −
1
η
hi,t(wi,t−1 − ŵi + ui) − ŵi,t
)2
]
t
}
=
1
hi,t
(wi,t−1 − ŵi)2
(wi,t−1 − ŵi)2 + σ2
i
(18)
Asahi Ushio Metric Perspective of Stochastic Optimizers 13 / 35
14. Hessian Approximation Finite Difference Method
VSGD: Variance Approximation
In practice, ŵ is unknown so approximated by
(wi,t−1 − ŵi)2
=
(
E [wi,τ − ŵi,τ ]
t−1
τ=1
)2
=
(
E [gi,τ ]
t−1
τ=1
)2
(19)
(wi,t−1 − ŵi)2
+ σ2
i = E
[
(wi,τ − ŵi,τ )2
]t−1
τ=1
= E
[
g2
i,τ
]t−1
τ=1
(20)
where
E [gi,τ ]
t
τ=1 ≈ ḡi,t := αi,tgi,t + (1 − αi,t)ḡi,t−1 (21)
E
[
g2
i,τ
]t
τ=1
≈ v̄i,t := αi,tg2
i,t + (1 − αi,t)v̄i,t−1. (22)
Asahi Ushio Metric Perspective of Stochastic Optimizers 14 / 35
15. Hessian Approximation Finite Difference Method
VSGD: Hessian Approximation
Hessian is approximated by hi,t := E [hi,τ ]
t
τ=1.
Two approximation of E [hi,τ ]
t
τ=1 based on FDM:
(Scheme 1) h̄i,t := αi,tĥi,t + (1 − αi,t)h̄i,t (23)
(Scheme 2) h̄i,t :=
E
[(
ĥi,t
)2
]
E
[
ĥi,t
] (24)
E
[(
ĥi,t
)2
]
≈ vi,t := αi,t
(
ĥi,t
)2
+ (1 − αi,t)vi,t
E
[
ĥi,t
]
≈ mi,t := αi,tĥi,t + (1 − αi,t)mi,t
where
ĥi,t :=
gi,t − dlt(wi,t + ḡi,t)/dw
ḡi,t
. (25)
Asahi Ushio Metric Perspective of Stochastic Optimizers 15 / 35
17. Hessian Approximation Root Mean Square
RMS: AdaGrad, RMSprop, Adam
AdaGrad [14], Adam [16], RMSProp [15] can be summarized as
Bt = ηtRt (27)
Rt : = diag (1/r1,t . . . 1/rn,t) (28)
ri,t : = RMS [gi,t] =
√
E
[
g2
i,t
]
(29)
Approximation of expectation and learning rate is different:
(AdaGrad) E
[
g2
i,t
]
≈
1
t
t
∑
τ=1
g2
i,τ , ηt = η/
√
t (30)
(RMSProp) E
[
g2
i,t
]
≈ v̄i,t, ηt = η (31)
(Adam) E
[
g2
i,t
]
≈ v̂i,t :=
v̄i,t
1 − αt
, ηt = η
1
1 − γt
(32)
where v̄i,t := αg2
i,t + (1 − α)v̄i,t−1.
Asahi Ushio Metric Perspective of Stochastic Optimizers 17 / 35
18. Hessian Approximation Root Mean Square
Adam
Moment Bias
For the momentum sequence:
gt := γgt−1 + (1 − γ)
d
dw
lt(wt−1) = (1 − γ)
t
∑
i=1
γt−i d
dw
li(wi−1),
the expectation of gt includes moment bias (1 − γt
) such as
E [gt]t = E
[
(1 − γ)
t
∑
i=1
γt−i d
dw
li(wi−1)
]
t
= E
[
d
dw
lt(wt−1)
]
t
(1 − γ)
t
∑
i=1
γt−i
= E
[
d
dw
lt(wt−1)
]
t
(1 − γt
).
Adam’s learning rate is aimed to reduce the moment bias.
Asahi Ushio Metric Perspective of Stochastic Optimizers 18 / 35
19. Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton
Relation to the Gauss-Newton
Gauss-Newton is an approximation of Hessian, which is limited to the
squared loss function.
Extended Gauss-Newton is an extension of Gauss-Newton by [6].
1 Applicable to any loss function.
Multilayer Perceptron Model
The loss function (3) can be seen as
L(ŷ) = E [lt(ŷt)]t (33)
where
Activation : ŷt = M(zt) (1)
Output : zt = Nw(xt). (2)
Asahi Ushio Metric Perspective of Stochastic Optimizers 19 / 35
20. Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: Hessian Derivation
Hessian of (33) is
H =
d2
L(ŷ)
dw2
=
d
dw
{
dz
dw
(
dŷ
dz
dL(ŷ)
dŷ
)}
=
dz
dw
d
dw
(
dŷ
dz
dL(ŷ)
dŷ
)T
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
=
dz
dw
dz
dw
T
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
= JN JT
N
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
(34)
where JN = dz
dw is Jacobian.
Asahi Ushio Metric Perspective of Stochastic Optimizers 20 / 35
21. Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: General Case and Regression
Extended Gauss-Newton
Ignore the 2nd order derivation in (34),
HGN = JN JT
N
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
. (35)
Regression: Since dŷ
dz = 1,
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
= Et
[
d
dz
d
dŷ
lt(ŷ)
]
= Et
[
d
dz
(ŷ − yt)
]
= 1. (36)
Extended Gauss-Newton approximation is
HGN = JN JT
N (37)
Asahi Ushio Metric Perspective of Stochastic Optimizers 21 / 35
22. Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: Classification
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
= Et
[
d
dz
(
dŷ
dz
d
dŷ
lt(ŷ)
)]
= −Et
[
d
dz
(
(ŷ − ŷ2
)
(
yt
ŷ
−
1 − yt
1 − ŷ
))]
= Et
[
d
dz
(ŷ − yt)
]
= ŷ − ŷ2
(38)
where
dŷ
dz
=
e−z
(1 + e−z)2
=
1
1 + e−z
−
1
(1 + e−z)2
= ŷ − ŷ2
d
dŷ
lt(ŷ) = −
(
yt
ŷ
−
1 − yt
1 − ŷ
)
Extended Gauss-Newton approximation is
HGN = (ŷ − ŷ2
)JN JT
N (39)
Asahi Ushio Metric Perspective of Stochastic Optimizers 22 / 35
24. Natural Gradient
Natural Gradient
Fisher Information Matrix
Suppose the observation y is sampled via
y ∼ p(ŷ). (42)
Then fisher information matrix becomes
F =
d log p(y|ŷ)
dw
d log p(y|ŷ)
dw
T
. (43)
Natural Gradient
wt = wt−1 − Btgt (44)
Bt = F−1
Asahi Ushio Metric Perspective of Stochastic Optimizers 24 / 35
25. Natural Gradient
Natural Gradient: Regression
Regression: Suppose gaussian distribution,
p(y|ŷ, σ) =
1
√
2πσ2
e− (ŷ−y)2
2σ2
. (45)
Fisher information matrix:
F =
(ŷ − y)2
σ4
JN JT
N (46)
where
d log p(y|ŷ, σ)
dw
=
d
dw
{
−
(ŷ − y)2
2σ2
}
= −
ŷ − y
σ2
dŷ
dw
= −
ŷ − y
σ2
dz
dw
dŷ
dz
= −
ŷ − y
σ2
JN (47)
Asahi Ushio Metric Perspective of Stochastic Optimizers 25 / 35
26. Natural Gradient
Natural Gradient: Classification
Classification: Suppose binomial distribution,
p(y|ŷ) = ŷy
(1 − ŷ)1−y
. (48)
Fisher information matrix:
F = (y − ŷ)2
JN JT
N (49)
where
d log p(y|ŷ, σ)
dw
=
dz
dw
dŷ
dz
d
dŷ
{y log ŷ + (1 − y) log (1 − ŷ)}
= JN
dŷ
dz
(
y
ŷ
−
1 − y
1 − ŷ
)
= JN (1 − ŷ)ŷ
(
y
ŷ
−
1 − y
1 − ŷ
)
= JN (y − ŷ) (50)
Asahi Ushio Metric Perspective of Stochastic Optimizers 26 / 35
27. Natural Gradient
Natural Gradient: Multi-class Classification
Multi-class Classification: Suppose multinomial distribution
p(y1, .., yk|ŷ1, ..., ŷk) =
k
∏
i=1
ŷyi
i (51)
Fisher information matrix:
Fi = {(1 − ŷi)yi}2
JN,iJT
N,i (52)
where
d log {p(y1, .., yk|ŷ1, ..., ŷk)}
dwi
=
dzi
dwi
dŷi
dzi
d
∑k
i=1 yi log(ŷi)
dŷi
(53)
= JN,i
{
(ŷi − ŷ2
i )
yi
ŷi
}
= JN,i(1 − ŷi)yi
(54)
Asahi Ushio Metric Perspective of Stochastic Optimizers 27 / 35
28. Experiment
Experiment: Settings
Data
Regression: Synthetic data, 1000 features.
Classification: MNIST (hand written digits), 764 features, 1 ∼ 9 labels.
Hyperparameters
Grid Search: Employ the best performed hyperparameters.
1 Learning rate.
2 Weight of RMS.
Evaluation
Regression: Mean square error.
Classification: Misclassification rate.
Asahi Ushio Metric Perspective of Stochastic Optimizers 28 / 35
29. Experiment
Experiment: Regression
0 5000 10000 15000 20000 25000
Iteration
0
20000
40000
60000
Error
Rate
SGD
AdaGrad
RMSprop
Adam
AdaDelta
VSGD
VSGD is the best even though tuning free.
Asahi Ushio Metric Perspective of Stochastic Optimizers 29 / 35
30. Experiment
Experiment: Classification
0 10000 20000 30000 40000 50000 60000
Iteration
0.2
0.4
0.6
0.8
Error
Rate
SGD
VSGD
AdaGrad
RMSprop
Adam
AdaDelta
AdaDelta is the best even though tuning free.
Asahi Ushio Metric Perspective of Stochastic Optimizers 30 / 35
31. Conclusion
Conclusion & Future Work
Conclusion
Summarize stochastic optimizers from the view of its metric.
Quasi-Newton type, RMS type and Natural Gradient type.
Conduct brief experiments (classification and regression).
See the efficacies of tuning free algorithm.
Asahi Ushio Metric Perspective of Stochastic Optimizers 31 / 35
32. reference
Antoine Bordes, Léon Bottou, and Patrick Gallinari,
“Sgd-qn: Careful quasi-newton stochastic gradient descent,”
Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1737–1754, 2009.
Matthew D Zeiler,
“Adadelta: an adaptive learning rate method,”
arXiv preprint arXiv:1212.5701, 2012.
Tom Schaul, Sixin Zhang, and Yann LeCun,
“No more pesky learning rates,”
in International Conference on Machine Learning, 2013, pp. 343–351.
Tom Schaul and Yann LeCun,
“Adaptive learning rates and parallelization for stochastic, sparse, non-smooth
gradients,”
in International Conference on Learning Representations, Scottsdale, AZ,
2013.
Oriol Vinyals and Daniel Povey,
“Krylov subspace descent for deep learning,”
in Artificial Intelligence and Statistics, 2012, pp. 1261–1268.
Asahi Ushio Metric Perspective of Stochastic Optimizers 32 / 35
33. reference
Nicol N Schraudolph,
“Fast curvature matrix-vector products for second-order gradient descent,”
Neural computation, vol. 14, no. 7, pp. 1723–1738, 2002.
James Martens,
“Deep learning via hessian-free optimization,”
in Proceedings of the 27th International Conference on Machine Learning,
2010, pp. 735–742.
Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer,
“A stochastic quasi-newton method for large-scale optimization,”
SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016.
Nicol N Schraudolph, Jin Yu, and Simon Günter,
“A stochastic quasi-newton method for online convex optimization,”
in Artificial Intelligence and Statistics, 2007, pp. 436–443.
Aryan Mokhtari and Alejandro Ribeiro,
“Res: Regularized stochastic bfgs algorithm,”
IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6089–6104, 2014.
Asahi Ushio Metric Perspective of Stochastic Optimizers 33 / 35
34. reference
Shun-Ichi Amari,
“Natural gradient works efficiently in learning,”
Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio,
“Topmoumoute online natural gradient algorithm,”
in Advances in neural information processing systems, 2008, pp. 849–856.
Nicolas L Roux and Andrew W Fitzgibbon,
“A fast natural newton method,”
in Proceedings of the 27th International Conference on Machine Learning
(ICML-10), 2010, pp. 623–630.
John Duchi, Elad Hazan, and Yoram Singer,
“Adaptive subgradient methods for online learning and stochastic
optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
Asahi Ushio Metric Perspective of Stochastic Optimizers 34 / 35
35. reference
Tijmen Tieleman and Geoffrey Hinton,
“Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude,”
COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31,
2012.
Diederik Kingma and Jimmy Ba,
“Adam: A method for stochastic optimization,”
in International Conference on Learning and Representations, 2015, pp. 1–13.
Asahi Ushio Metric Perspective of Stochastic Optimizers 35 / 35