2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

Metric Perspective of Stochastic Optimizers
Asahi Ushio
E-mail: ushio@ykw.elec.keio.ac.jp
Dept. Electronics and Electrical Engineering, Keio University, Japan
Internal Research Seminar
July 19, 2017
Note: the author has moved to Cogent Labs as a researcher from April 2018.
Asahi Ushio Metric Perspective of Stochastic Optimizers 1 / 35

Overview
1 Preliminaries
2 Hessian Approximation
Finite Diﬀerence Method
Root Mean Square
Extended Gauss-Newton: Preliminaries
3 Natural Gradient
4 Experiment
5 Conclusion
6 reference

Preliminaries
Outline
Explanation of stochastic optimizers in machine learning,
especially, from the perspective of each metric.
Mostly, stochastic optimizers can be divided into three types of metric.
1 Quasi-Newton Method Type
1 Finite Diﬀerence Method (FDM): SGD-QN [1], AdaDelta [2], VSGD [3, 4]
2 Extended Gauss-Newton: KSD [5] , SMD [6], HF [7]
3 LBFGS: stochastic LBFGS [8, 9], RES [10],
2 Natural Gradient Type: Natural Gradient [11], TONGA [12, 13]
3 Root Mean Square (RMS) Type: AdaGrad [14], RMSprop [15], Adam [16]

Preliminaries
Overview of Stochastic Algorithm
-‐ TONGA
-‐ Fast
Natural
Gradient
-‐ AdaGrad
-‐ RMSProp
-‐ Adam
-‐ SGD-‐QN
-‐ vSGD
-‐ AdaDelta
-‐ Online
LBFGS
-‐ RES
-‐ Stochastic
LBFGS
Quasi-‐Newton
(Hessian)
Approximation
FDM-‐based BFGS-‐based
Natural
Gradient
(Fisher

Information
Matrix)

Approximation
RMS
-‐ Hessian-‐Free
-‐ KSD
(Kryrov Subspace

Descent)
-‐ SMD
(Stochastic
Meta

Descent)
Extended

Gauss-‐Newton
-‐ PNLMS
-‐ IPNLMS
Proportionate

Metric

Preliminaries
Problem Setting
Model: For an input xt ∈ Rn
, the output ŷt ∈ R is derived by
Activation : ŷt = M(zt) (1)
Output : zt = Nw(xt) ∈ R. (2)
Loss: With instantaneous loss function lt(w) of parameter w,
L(w) = E [lt(w)]t . (3)
Problem Setting M(z) Nw(x) lt(ŷt) yt
Regression z wT
x
||ŷt − yt||2
2
yt ∈ R
Classification
1
1 + e−z
wT
x − [yt log(ŷt) + (1 − yt) log(1 − ŷt)] yt ∈ {0, 1}
Multi-Classification
ezi
∑k
i=1 ezi
wT
i x −
∑k
i=1 yt,i log(ŷi) yt,i ∈ {0, 1}

Preliminaries
Brief Illustration of Model
Input
Hidden Layer
Output Activation
Figure: Neural Network
Input
Output Activation
Figure: Linear Model

Preliminaries
Stochastic Optimization
Purpose: Find ŵ = arg min
w
L(w) by stochastic approximation.
SGD (Stochastic Gradient Descent) and Varinats
wt = wt−1 − ηtgt, ηt ∈ R s.t. lim
t→∞
ηt = 0 and lim
t→∞
t
∑
i=1
ηi = ∞ (4)
Vanilla SGD : gt =
d
dw
lt(wt−1)
Momentum : gt = γgt−1 + (1 − γ)
d
dw
lt(wt−1), γ ∈ R
NAG : gt = γgt−1 + (1 − γ)
d
dw
lt(wt−1 − γgt−1)
Minibatch : gt =
1
I
I−1
∑
i=0
d
dw
lt−i(wt−1), I ∈ R.
Suppose lt(w) is diﬀerentiable.

Hessian Approximation
Quasi-Newton Method
Newton Method employs Hessian matrix:
wt = wt−1 − Btgt (5)
Bt = H−1
t , Ht =
d2
L(wt−1)
dw2
.
Quasi-Newton employs Hessian approximation Ĥt instead of Ht.
1 FDM (Finite Diﬀerence Method): SGD-QN [1], AdaDelta [2], VSGD [3, 4]
2 Extended Gauss-Newton Approximation
3 LBFGS
Diagonal approximation is often used in stochastic optimization:
Ĥt = diag (h1,t . . . hn,t) (6)

Hessian Approximation Finite Diﬀerence Method
SGD-QN
SGD-QN [1] employs instantaneous estimator of Hessian:
1
hi,t
: =
η
t
E
[
1
hF DM
i,τ
]t
τ=1
(7)
E
[
1
hF DM
i,τ
]t
τ=1
≈
1
h̄i,t
:= α
1
hF DM
i,t
+ (1 − α)
1
h̄i,t−1
(8)
hF DM
i,t : =
gi,t − gi,t−1
wi,t−1 − wi,t−2
(9)
The FDM approximation (9) is called secant condition in quasi-newton
method context.

AdaDelta
SGD updates (5) can be reformulated
Bt = −(wt − wt−1)g−T
t . (10)
AdaDelta [2] approximates Hessian by (10) :
hi,t := E
[
−
gi,t
wi,t − wi,t−1
]t
τ=1
≈
RMS [gi,t]
RMS [wi,t−1 − wi,t−2]
. (11)
Here wi,t − wi,t−1 is not known at t, so approximated by wi,t−1 − wi,t−2.
With numerical stability parameter ϵ: (sensitive parameter in practice)
RMS [gt] : =
√
E [g2
τ ]
t
τ=1 + ϵ (12)
E
[
g2
τ
]t
τ=1
≈ ḡt := αg2
t + (1 − α)ḡt−1 (13)

VSGD: Quadratic Loss
Quadratic Approximation of Loss
Taylor expansion gives
lt(w) ≈ lt(a) +
dlt(a)
dw
T
(w − a) +
1
2
(w − a)T dl2
t (a)
dw2
(w − a), ∀a ∈ Rn
For ŵt = arg min
w∈Rn
lt(w),
lt(ŵt) = 0,
dlt(ŵt)
dw
= 0. (14)
Then lt(w) can be locally approximated by the quadratic function
lt(w) ≈
1
2
(w − ŵt)T d2
lt(ŵt)
dw2
(w − ŵt). (15)

VSGD: Noisy Quadratic Loss
Noisy Quadratic Approximation of Loss with Diagonal Hessian Approximation
Suppose ŵt ∼ N(ŵ, diag
(
σ2
1, . . . , σ2
n
)
):
lq
t (w) : =
1
2
(w − ŵt)T
Ĥt(w − ŵt)
=
1
2
n
∑
i=1
hi,t(wi − ŵi,t)2
(16)
SGD for lq
t with element-wise learning rate ηi,t:
wi,t = wi,t−1 − ηi,t
dlq
t (wt−1)
dwi
= wi,t−1 − ηi,thi,t(wi,t−1 − ŵi,t)
= wi,t−1 − ηi,thi,t(wi,t−1 − ŵi + ui), ui,t ∼ N(0, σ2
i ). (17)

VSGD: Adaptive Learning Rate
Greedy Optimal Learning Rate
VSGD (variance SGD) [3, 4] choose the learning rate, which minimize
the conditional expectation of loss function:
ηi,t : = arg min
η
{
E [lq
t (wi,t)|wi,t−1]t
}
= arg min
η
{
E
[(
wi,t−1 −
1
η
hi,t(wi,t−1 − ŵi + ui) − ŵi,t
)2
]
t
}
=
1
hi,t
(wi,t−1 − ŵi)2
(wi,t−1 − ŵi)2 + σ2
i
(18)

VSGD: Variance Approximation
In practice, ŵ is unknown so approximated by
(wi,t−1 − ŵi)2
=
(
E [wi,τ − ŵi,τ ]
t−1
τ=1
)2
=
(
E [gi,τ ]
t−1
τ=1
)2
(19)
(wi,t−1 − ŵi)2
+ σ2
i = E
[
(wi,τ − ŵi,τ )2
]t−1
τ=1
= E
[
g2
i,τ
]t−1
τ=1
(20)
where
E [gi,τ ]
t
τ=1 ≈ ḡi,t := αi,tgi,t + (1 − αi,t)ḡi,t−1 (21)
E
[
g2
i,τ
]t
τ=1
≈ v̄i,t := αi,tg2
i,t + (1 − αi,t)v̄i,t−1. (22)

VSGD: Hessian Approximation
Hessian is approximated by hi,t := E [hi,τ ]
t
τ=1.
Two approximation of E [hi,τ ]
t
τ=1 based on FDM:
(Scheme 1) h̄i,t := αi,tĥi,t + (1 − αi,t)h̄i,t (23)
(Scheme 2) h̄i,t :=
E
[(
ĥi,t
)2
]
E
[
ĥi,t
] (24)
E
[(
ĥi,t
)2
]
≈ vi,t := αi,t
(
ĥi,t
)2
+ (1 − αi,t)vi,t
E
[
ĥi,t
]
≈ mi,t := αi,tĥi,t + (1 − αi,t)mi,t
where
ĥi,t :=
gi,t − dlt(wi,t + ḡi,t)/dw
ḡi,t
. (25)

VSGD: Weight Decay
Weight Sequence: Weight is update by following heuristic rule
αi,t :=
(
1 −
ḡ2
i,t
v̄i,t
)
αi,t−1 + 1. (26)

Hessian Approximation Root Mean Square
RMS: AdaGrad, RMSprop, Adam
AdaGrad [14], Adam [16], RMSProp [15] can be summarized as
Bt = ηtRt (27)
Rt : = diag (1/r1,t . . . 1/rn,t) (28)
ri,t : = RMS [gi,t] =
√
E
[
g2
i,t
]
(29)
Approximation of expectation and learning rate is diﬀerent:
(AdaGrad) E
[
g2
i,t
]
≈
1
t
t
∑
τ=1
g2
i,τ , ηt = η/
√
t (30)
(RMSProp) E
[
g2
i,t
]
≈ v̄i,t, ηt = η (31)
(Adam) E
[
g2
i,t
]
≈ v̂i,t :=
v̄i,t
1 − αt
, ηt = η
1
1 − γt
(32)
where v̄i,t := αg2
i,t + (1 − α)v̄i,t−1.

Hessian Approximation Root Mean Square
Adam
Moment Bias
For the momentum sequence:
gt := γgt−1 + (1 − γ)
d
dw
lt(wt−1) = (1 − γ)
t
∑
i=1
γt−i d
dw
li(wi−1),
the expectation of gt includes moment bias (1 − γt
) such as
E [gt]t = E
[
(1 − γ)
t
∑
i=1
γt−i d
dw
li(wi−1)
]
t
= E
[
d
dw
lt(wt−1)
]
t
(1 − γ)
t
∑
i=1
γt−i
= E
[
d
dw
lt(wt−1)
]
t
(1 − γt
).
Adam’s learning rate is aimed to reduce the moment bias.

Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton
Relation to the Gauss-Newton
Gauss-Newton is an approximation of Hessian, which is limited to the
squared loss function.
Extended Gauss-Newton is an extension of Gauss-Newton by [6].
1 Applicable to any loss function.
Multilayer Perceptron Model
The loss function (3) can be seen as
L(ŷ) = E [lt(ŷt)]t (33)
where
Activation : ŷt = M(zt) (1)
Output : zt = Nw(xt). (2)

Extended Gauss-Newton: Hessian Derivation
Hessian of (33) is
H =
d2
L(ŷ)
dw2
=
d
dw
{
dz
dw
(
dŷ
dz
dL(ŷ)
dŷ
)}
=
dz
dw
d
dw
(
dŷ
dz
dL(ŷ)
dŷ
)T
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
=
dz
dw
dz
dw
T
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
= JN JT
N
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
(34)
where JN = dz
dw is Jacobian.

Extended Gauss-Newton: General Case and Regression
Extended Gauss-Newton
Ignore the 2nd order derivation in (34),
HGN = JN JT
N
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
. (35)
Regression: Since dŷ
dz = 1,
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
= Et
[
d
dz
d
dŷ
lt(ŷ)
]
= Et
[
d
dz
(ŷ − yt)
]
= 1. (36)
Extended Gauss-Newton approximation is
HGN = JN JT
N (37)

Extended Gauss-Newton: Classification
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
= Et
[
d
dz
(
dŷ
dz
d
dŷ
lt(ŷ)
)]
= −Et
[
d
dz
(
(ŷ − ŷ2
)
(
yt
ŷ
−
1 − yt
1 − ŷ
))]
= Et
[
d
dz
(ŷ − yt)
]
= ŷ − ŷ2
(38)
where
dŷ
dz
=
e−z
(1 + e−z)2
=
1
1 + e−z
−
1
(1 + e−z)2
= ŷ − ŷ2
d
dŷ
lt(ŷ) = −
(
yt
ŷ
−
1 − yt
1 − ŷ
)
HGN = (ŷ − ŷ2
)JN JT
N (39)

Extended Gauss-Newton: Multi-class Classification
d
dzi
(
dŷi
dzi
d
dŷi
lt(ŷ)
)
=
d
dzi





ezi
∑k
i=1 ezi
−
(
ezi
∑k
i=1 ezi
)2



(
−
∑k
i=1 ezi
ezi
)

=
d
dzi
[
ezi
∑k
i=1 ezi
− 1
]
=
ezi
∑k
i=1 ezi
−
(
ezi
∑k
i=1 ezi
)2
= ŷi − ŷ2
i (40)
HGN,i = (ŷi − ŷ2
i )JN,iJT
N,i (41)

Natural Gradient
Natural Gradient
Fisher Information Matrix
Suppose the observation y is sampled via
y ∼ p(ŷ). (42)
Then fisher information matrix becomes
F =
d log p(y|ŷ)
dw
d log p(y|ŷ)
dw
T
. (43)
Natural Gradient
wt = wt−1 − Btgt (44)
Bt = F−1

Natural Gradient
Natural Gradient: Regression
Regression: Suppose gaussian distribution,
p(y|ŷ, σ) =
1
√
2πσ2
e− (ŷ−y)2
2σ2
. (45)
Fisher information matrix:
F =
(ŷ − y)2
σ4
JN JT
N (46)
where
d log p(y|ŷ, σ)
dw
=
d
dw
{
−
(ŷ − y)2
2σ2
}
= −
ŷ − y
σ2
dŷ
dw
= −
ŷ − y
σ2
dz
dw
dŷ
dz
= −
ŷ − y
σ2
JN (47)

Natural Gradient
Natural Gradient: Classification
Classification: Suppose binomial distribution,
p(y|ŷ) = ŷy
(1 − ŷ)1−y
. (48)
F = (y − ŷ)2
JN JT
N (49)
where
d log p(y|ŷ, σ)
dw
=
dz
dw
dŷ
dz
d
dŷ
{y log ŷ + (1 − y) log (1 − ŷ)}
= JN
dŷ
dz
(
y
ŷ
−
1 − y
1 − ŷ
)
= JN (1 − ŷ)ŷ
(
y
ŷ
−
1 − y
1 − ŷ
)
= JN (y − ŷ) (50)

Natural Gradient
Natural Gradient: Multi-class Classification
Multi-class Classification: Suppose multinomial distribution
p(y1, .., yk|ŷ1, ..., ŷk) =
k
∏
i=1
ŷyi
i (51)
Fi = {(1 − ŷi)yi}2
JN,iJT
N,i (52)
where
d log {p(y1, .., yk|ŷ1, ..., ŷk)}
dwi
=
dzi
dwi
dŷi
dzi
d
∑k
i=1 yi log(ŷi)
dŷi
(53)
= JN,i
{
(ŷi − ŷ2
i )
yi
ŷi
}
= JN,i(1 − ŷi)yi
(54)

Experiment
Experiment: Settings
Data
Regression: Synthetic data, 1000 features.
Classification: MNIST (hand written digits), 764 features, 1 ∼ 9 labels.
Hyperparameters
Grid Search: Employ the best performed hyperparameters.
1 Learning rate.
2 Weight of RMS.
Evaluation
Regression: Mean square error.
Classification: Misclassification rate.

Experiment
Experiment: Regression
0 5000 10000 15000 20000 25000
Iteration
0
20000
40000
60000
Error
Rate
SGD
AdaGrad
RMSprop
Adam
AdaDelta
VSGD
VSGD is the best even though tuning free.

Experiment
Experiment: Classification
0 10000 20000 30000 40000 50000 60000
Iteration
0.2
0.4
0.6
0.8
Error
Rate
SGD
VSGD
AdaGrad
RMSprop
Adam
AdaDelta
AdaDelta is the best even though tuning free.

Conclusion
Conclusion & Future Work
Conclusion
Summarize stochastic optimizers from the view of its metric.
Quasi-Newton type, RMS type and Natural Gradient type.
Conduct brief experiments (classification and regression).
See the eﬃcacies of tuning free algorithm.

reference
Antoine Bordes, Léon Bottou, and Patrick Gallinari,
“Sgd-qn: Careful quasi-newton stochastic gradient descent,”
Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1737–1754, 2009.
Matthew D Zeiler,
“Adadelta: an adaptive learning rate method,”
arXiv preprint arXiv:1212.5701, 2012.
Tom Schaul, Sixin Zhang, and Yann LeCun,
“No more pesky learning rates,”
in International Conference on Machine Learning, 2013, pp. 343–351.
Tom Schaul and Yann LeCun,
“Adaptive learning rates and parallelization for stochastic, sparse, non-smooth
gradients,”
in International Conference on Learning Representations, Scottsdale, AZ,
2013.
Oriol Vinyals and Daniel Povey,
“Krylov subspace descent for deep learning,”
in Artificial Intelligence and Statistics, 2012, pp. 1261–1268.

reference
Nicol N Schraudolph,
“Fast curvature matrix-vector products for second-order gradient descent,”
Neural computation, vol. 14, no. 7, pp. 1723–1738, 2002.
James Martens,
“Deep learning via hessian-free optimization,”
in Proceedings of the 27th International Conference on Machine Learning,
2010, pp. 735–742.
Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer,
“A stochastic quasi-newton method for large-scale optimization,”
SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016.
Nicol N Schraudolph, Jin Yu, and Simon Günter,
“A stochastic quasi-newton method for online convex optimization,”
in Artificial Intelligence and Statistics, 2007, pp. 436–443.
Aryan Mokhtari and Alejandro Ribeiro,
“Res: Regularized stochastic bfgs algorithm,”
IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6089–6104, 2014.

reference
Shun-Ichi Amari,
“Natural gradient works eﬃciently in learning,”
Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio,
“Topmoumoute online natural gradient algorithm,”
in Advances in neural information processing systems, 2008, pp. 849–856.
Nicolas L Roux and Andrew W Fitzgibbon,
“A fast natural newton method,”
in Proceedings of the 27th International Conference on Machine Learning
(ICML-10), 2010, pp. 623–630.
John Duchi, Elad Hazan, and Yoram Singer,
“Adaptive subgradient methods for online learning and stochastic
optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.

reference
Tijmen Tieleman and Geoﬀrey Hinton,
“Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude,”
COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31,
2012.
Diederik Kingma and Jimmy Ba,
“Adam: A method for stochastic optimization,”
in International Conference on Learning and Representations, 2015, pp. 1–13.

2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a 2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

Similar a 2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers (20)

Último

Último (20)

2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers