SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Metric Perspective of Stochastic Optimizers
Asahi Ushio
E-mail: ushio@ykw.elec.keio.ac.jp
Dept. Electronics and Electrical Engineering, Keio University, Japan
Internal Research Seminar
July 19, 2017
Note: the author has moved to Cogent Labs as a researcher from April 2018.
Asahi Ushio Metric Perspective of Stochastic Optimizers 1 / 35
Overview
1 Preliminaries
2 Hessian Approximation
Finite Difference Method
Root Mean Square
Extended Gauss-Newton: Preliminaries
3 Natural Gradient
4 Experiment
5 Conclusion
6 reference
Asahi Ushio Metric Perspective of Stochastic Optimizers 2 / 35
Preliminaries
Outline
Explanation of stochastic optimizers in machine learning,
especially, from the perspective of each metric.
Mostly, stochastic optimizers can be divided into three types of metric.
1 Quasi-Newton Method Type
1 Finite Difference Method (FDM): SGD-QN [1], AdaDelta [2], VSGD [3, 4]
2 Extended Gauss-Newton: KSD [5] , SMD [6], HF [7]
3 LBFGS: stochastic LBFGS [8, 9], RES [10],
2 Natural Gradient Type: Natural Gradient [11], TONGA [12, 13]
3 Root Mean Square (RMS) Type: AdaGrad [14], RMSprop [15], Adam [16]
Asahi Ushio Metric Perspective of Stochastic Optimizers 3 / 35
Preliminaries
Overview of Stochastic Algorithm
-­‐ TONGA
-­‐ Fast	
  Natural	
  Gradient
-­‐ AdaGrad
-­‐ RMSProp
-­‐ Adam
-­‐ SGD-­‐QN
-­‐ vSGD
-­‐ AdaDelta
-­‐ Online	
  LBFGS
-­‐ RES
-­‐ Stochastic	
  LBFGS
Quasi-­‐Newton	
  (Hessian)	
  Approximation
FDM-­‐based BFGS-­‐based
Natural	
  Gradient	
  (Fisher	
  
Information	
  Matrix)	
  
Approximation
RMS
-­‐ Hessian-­‐Free
-­‐ KSD	
  (Kryrov Subspace	
  
Descent)
-­‐ SMD	
  (Stochastic	
  Meta	
  
Descent)
Extended	
  
Gauss-­‐Newton
-­‐ PNLMS
-­‐ IPNLMS
Proportionate	
  
Metric
Asahi Ushio Metric Perspective of Stochastic Optimizers 4 / 35
Preliminaries
Problem Setting
Model: For an input xt ∈ Rn
, the output ŷt ∈ R is derived by
Activation : ŷt = M(zt) (1)
Output : zt = Nw(xt) ∈ R. (2)
Loss: With instantaneous loss function lt(w) of parameter w,
L(w) = E [lt(w)]t . (3)
Problem Setting M(z) Nw(x) lt(ŷt) yt
Regression z wT
x
||ŷt − yt||2
2
yt ∈ R
Classification
1
1 + e−z
wT
x − [yt log(ŷt) + (1 − yt) log(1 − ŷt)] yt ∈ {0, 1}
Multi-Classification
ezi
∑k
i=1 ezi
wT
i x −
∑k
i=1 yt,i log(ŷi) yt,i ∈ {0, 1}
Asahi Ushio Metric Perspective of Stochastic Optimizers 5 / 35
Preliminaries
Brief Illustration of Model
Input
Hidden  Layer
Output Activation
Figure: Neural Network
Input
Output Activation
Figure: Linear Model
Asahi Ushio Metric Perspective of Stochastic Optimizers 6 / 35
Preliminaries
Stochastic Optimization
Purpose: Find ŵ = arg min
w
L(w) by stochastic approximation.
SGD (Stochastic Gradient Descent) and Varinats
wt = wt−1 − ηtgt, ηt ∈ R s.t. lim
t→∞
ηt = 0 and lim
t→∞
t
∑
i=1
ηi = ∞ (4)
Vanilla SGD : gt =
d
dw
lt(wt−1)
Momentum : gt = γgt−1 + (1 − γ)
d
dw
lt(wt−1), γ ∈ R
NAG : gt = γgt−1 + (1 − γ)
d
dw
lt(wt−1 − γgt−1)
Minibatch : gt =
1
I
I−1
∑
i=0
d
dw
lt−i(wt−1), I ∈ R.
Suppose lt(w) is differentiable.
Asahi Ushio Metric Perspective of Stochastic Optimizers 7 / 35
Hessian Approximation
Quasi-Newton Method
Newton Method employs Hessian matrix:
wt = wt−1 − Btgt (5)
Bt = H−1
t , Ht =
d2
L(wt−1)
dw2
.
Quasi-Newton employs Hessian approximation Ĥt instead of Ht.
1 FDM (Finite Difference Method): SGD-QN [1], AdaDelta [2], VSGD [3, 4]
2 Extended Gauss-Newton Approximation
3 LBFGS
Diagonal approximation is often used in stochastic optimization:
Ĥt = diag (h1,t . . . hn,t) (6)
Asahi Ushio Metric Perspective of Stochastic Optimizers 8 / 35
Hessian Approximation Finite Difference Method
SGD-QN
SGD-QN [1] employs instantaneous estimator of Hessian:
1
hi,t
: =
η
t
E
[
1
hF DM
i,τ
]t
τ=1
(7)
E
[
1
hF DM
i,τ
]t
τ=1
≈
1
h̄i,t
:= α
1
hF DM
i,t
+ (1 − α)
1
h̄i,t−1
(8)
hF DM
i,t : =
gi,t − gi,t−1
wi,t−1 − wi,t−2
(9)
The FDM approximation (9) is called secant condition in quasi-newton
method context.
Asahi Ushio Metric Perspective of Stochastic Optimizers 9 / 35
Hessian Approximation Finite Difference Method
AdaDelta
SGD updates (5) can be reformulated
Bt = −(wt − wt−1)g−T
t . (10)
AdaDelta [2] approximates Hessian by (10) :
hi,t := E
[
−
gi,t
wi,t − wi,t−1
]t
τ=1
≈
RMS [gi,t]
RMS [wi,t−1 − wi,t−2]
. (11)
Here wi,t − wi,t−1 is not known at t, so approximated by wi,t−1 − wi,t−2.
With numerical stability parameter ϵ: (sensitive parameter in practice)
RMS [gt] : =
√
E [g2
τ ]
t
τ=1 + ϵ (12)
E
[
g2
τ
]t
τ=1
≈ ḡt := αg2
t + (1 − α)ḡt−1 (13)
Asahi Ushio Metric Perspective of Stochastic Optimizers 10 / 35
Hessian Approximation Finite Difference Method
VSGD: Quadratic Loss
Quadratic Approximation of Loss
Taylor expansion gives
lt(w) ≈ lt(a) +
dlt(a)
dw
T
(w − a) +
1
2
(w − a)T dl2
t (a)
dw2
(w − a), ∀a ∈ Rn
For ŵt = arg min
w∈Rn
lt(w),
lt(ŵt) = 0,
dlt(ŵt)
dw
= 0. (14)
Then lt(w) can be locally approximated by the quadratic function
lt(w) ≈
1
2
(w − ŵt)T d2
lt(ŵt)
dw2
(w − ŵt). (15)
Asahi Ushio Metric Perspective of Stochastic Optimizers 11 / 35
Hessian Approximation Finite Difference Method
VSGD: Noisy Quadratic Loss
Noisy Quadratic Approximation of Loss with Diagonal Hessian Approximation
Suppose ŵt ∼ N(ŵ, diag
(
σ2
1, . . . , σ2
n
)
):
lq
t (w) : =
1
2
(w − ŵt)T
Ĥt(w − ŵt)
=
1
2
n
∑
i=1
hi,t(wi − ŵi,t)2
(16)
SGD for lq
t with element-wise learning rate ηi,t:
wi,t = wi,t−1 − ηi,t
dlq
t (wt−1)
dwi
= wi,t−1 − ηi,thi,t(wi,t−1 − ŵi,t)
= wi,t−1 − ηi,thi,t(wi,t−1 − ŵi + ui), ui,t ∼ N(0, σ2
i ). (17)
Asahi Ushio Metric Perspective of Stochastic Optimizers 12 / 35
Hessian Approximation Finite Difference Method
VSGD: Adaptive Learning Rate
Greedy Optimal Learning Rate
VSGD (variance SGD) [3, 4] choose the learning rate, which minimize
the conditional expectation of loss function:
ηi,t : = arg min
η
{
E [lq
t (wi,t)|wi,t−1]t
}
= arg min
η
{
E
[(
wi,t−1 −
1
η
hi,t(wi,t−1 − ŵi + ui) − ŵi,t
)2
]
t
}
=
1
hi,t
(wi,t−1 − ŵi)2
(wi,t−1 − ŵi)2 + σ2
i
(18)
Asahi Ushio Metric Perspective of Stochastic Optimizers 13 / 35
Hessian Approximation Finite Difference Method
VSGD: Variance Approximation
In practice, ŵ is unknown so approximated by
(wi,t−1 − ŵi)2
=
(
E [wi,τ − ŵi,τ ]
t−1
τ=1
)2
=
(
E [gi,τ ]
t−1
τ=1
)2
(19)
(wi,t−1 − ŵi)2
+ σ2
i = E
[
(wi,τ − ŵi,τ )2
]t−1
τ=1
= E
[
g2
i,τ
]t−1
τ=1
(20)
where
E [gi,τ ]
t
τ=1 ≈ ḡi,t := αi,tgi,t + (1 − αi,t)ḡi,t−1 (21)
E
[
g2
i,τ
]t
τ=1
≈ v̄i,t := αi,tg2
i,t + (1 − αi,t)v̄i,t−1. (22)
Asahi Ushio Metric Perspective of Stochastic Optimizers 14 / 35
Hessian Approximation Finite Difference Method
VSGD: Hessian Approximation
Hessian is approximated by hi,t := E [hi,τ ]
t
τ=1.
Two approximation of E [hi,τ ]
t
τ=1 based on FDM:
(Scheme 1) h̄i,t := αi,tĥi,t + (1 − αi,t)h̄i,t (23)
(Scheme 2) h̄i,t :=
E
[(
ĥi,t
)2
]
E
[
ĥi,t
] (24)
E
[(
ĥi,t
)2
]
≈ vi,t := αi,t
(
ĥi,t
)2
+ (1 − αi,t)vi,t
E
[
ĥi,t
]
≈ mi,t := αi,tĥi,t + (1 − αi,t)mi,t
where
ĥi,t :=
gi,t − dlt(wi,t + ḡi,t)/dw
ḡi,t
. (25)
Asahi Ushio Metric Perspective of Stochastic Optimizers 15 / 35
Hessian Approximation Finite Difference Method
VSGD: Weight Decay
Weight Sequence: Weight is update by following heuristic rule
αi,t :=
(
1 −
ḡ2
i,t
v̄i,t
)
αi,t−1 + 1. (26)
Asahi Ushio Metric Perspective of Stochastic Optimizers 16 / 35
Hessian Approximation Root Mean Square
RMS: AdaGrad, RMSprop, Adam
AdaGrad [14], Adam [16], RMSProp [15] can be summarized as
Bt = ηtRt (27)
Rt : = diag (1/r1,t . . . 1/rn,t) (28)
ri,t : = RMS [gi,t] =
√
E
[
g2
i,t
]
(29)
Approximation of expectation and learning rate is different:
(AdaGrad) E
[
g2
i,t
]
≈
1
t
t
∑
τ=1
g2
i,τ , ηt = η/
√
t (30)
(RMSProp) E
[
g2
i,t
]
≈ v̄i,t, ηt = η (31)
(Adam) E
[
g2
i,t
]
≈ v̂i,t :=
v̄i,t
1 − αt
, ηt = η
1
1 − γt
(32)
where v̄i,t := αg2
i,t + (1 − α)v̄i,t−1.
Asahi Ushio Metric Perspective of Stochastic Optimizers 17 / 35
Hessian Approximation Root Mean Square
Adam
Moment Bias
For the momentum sequence:
gt := γgt−1 + (1 − γ)
d
dw
lt(wt−1) = (1 − γ)
t
∑
i=1
γt−i d
dw
li(wi−1),
the expectation of gt includes moment bias (1 − γt
) such as
E [gt]t = E
[
(1 − γ)
t
∑
i=1
γt−i d
dw
li(wi−1)
]
t
= E
[
d
dw
lt(wt−1)
]
t
(1 − γ)
t
∑
i=1
γt−i
= E
[
d
dw
lt(wt−1)
]
t
(1 − γt
).
Adam’s learning rate is aimed to reduce the moment bias.
Asahi Ushio Metric Perspective of Stochastic Optimizers 18 / 35
Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton
Relation to the Gauss-Newton
Gauss-Newton is an approximation of Hessian, which is limited to the
squared loss function.
Extended Gauss-Newton is an extension of Gauss-Newton by [6].
1 Applicable to any loss function.
Multilayer Perceptron Model
The loss function (3) can be seen as
L(ŷ) = E [lt(ŷt)]t (33)
where
Activation : ŷt = M(zt) (1)
Output : zt = Nw(xt). (2)
Asahi Ushio Metric Perspective of Stochastic Optimizers 19 / 35
Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: Hessian Derivation
Hessian of (33) is
H =
d2
L(ŷ)
dw2
=
d
dw
{
dz
dw
(
dŷ
dz
dL(ŷ)
dŷ
)}
=
dz
dw
d
dw
(
dŷ
dz
dL(ŷ)
dŷ
)T
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
=
dz
dw
dz
dw
T
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
= JN JT
N
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
+
d2
z
dw2
(
dŷ
dz
dL(ŷ)
dŷ
)
(34)
where JN = dz
dw is Jacobian.
Asahi Ushio Metric Perspective of Stochastic Optimizers 20 / 35
Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: General Case and Regression
Extended Gauss-Newton
Ignore the 2nd order derivation in (34),
HGN = JN JT
N
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
. (35)
Regression: Since dŷ
dz = 1,
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
= Et
[
d
dz
d
dŷ
lt(ŷ)
]
= Et
[
d
dz
(ŷ − yt)
]
= 1. (36)
Extended Gauss-Newton approximation is
HGN = JN JT
N (37)
Asahi Ushio Metric Perspective of Stochastic Optimizers 21 / 35
Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: Classification
d
dz
(
dŷ
dz
dL(ŷ)
dŷ
)
= Et
[
d
dz
(
dŷ
dz
d
dŷ
lt(ŷ)
)]
= −Et
[
d
dz
(
(ŷ − ŷ2
)
(
yt
ŷ
−
1 − yt
1 − ŷ
))]
= Et
[
d
dz
(ŷ − yt)
]
= ŷ − ŷ2
(38)
where
dŷ
dz
=
e−z
(1 + e−z)2
=
1
1 + e−z
−
1
(1 + e−z)2
= ŷ − ŷ2
d
dŷ
lt(ŷ) = −
(
yt
ŷ
−
1 − yt
1 − ŷ
)
Extended Gauss-Newton approximation is
HGN = (ŷ − ŷ2
)JN JT
N (39)
Asahi Ushio Metric Perspective of Stochastic Optimizers 22 / 35
Hessian Approximation Extended Gauss-Newton: Preliminaries
Extended Gauss-Newton: Multi-class Classification
d
dzi
(
dŷi
dzi
d
dŷi
lt(ŷ)
)
=
d
dzi





ezi
∑k
i=1 ezi
−
(
ezi
∑k
i=1 ezi
)2



(
−
∑k
i=1 ezi
ezi
)

=
d
dzi
[
ezi
∑k
i=1 ezi
− 1
]
=
ezi
∑k
i=1 ezi
−
(
ezi
∑k
i=1 ezi
)2
= ŷi − ŷ2
i (40)
Extended Gauss-Newton approximation is
HGN,i = (ŷi − ŷ2
i )JN,iJT
N,i (41)
Asahi Ushio Metric Perspective of Stochastic Optimizers 23 / 35
Natural Gradient
Natural Gradient
Fisher Information Matrix
Suppose the observation y is sampled via
y ∼ p(ŷ). (42)
Then fisher information matrix becomes
F =
d log p(y|ŷ)
dw
d log p(y|ŷ)
dw
T
. (43)
Natural Gradient
wt = wt−1 − Btgt (44)
Bt = F−1
Asahi Ushio Metric Perspective of Stochastic Optimizers 24 / 35
Natural Gradient
Natural Gradient: Regression
Regression: Suppose gaussian distribution,
p(y|ŷ, σ) =
1
√
2πσ2
e− (ŷ−y)2
2σ2
. (45)
Fisher information matrix:
F =
(ŷ − y)2
σ4
JN JT
N (46)
where
d log p(y|ŷ, σ)
dw
=
d
dw
{
−
(ŷ − y)2
2σ2
}
= −
ŷ − y
σ2
dŷ
dw
= −
ŷ − y
σ2
dz
dw
dŷ
dz
= −
ŷ − y
σ2
JN (47)
Asahi Ushio Metric Perspective of Stochastic Optimizers 25 / 35
Natural Gradient
Natural Gradient: Classification
Classification: Suppose binomial distribution,
p(y|ŷ) = ŷy
(1 − ŷ)1−y
. (48)
Fisher information matrix:
F = (y − ŷ)2
JN JT
N (49)
where
d log p(y|ŷ, σ)
dw
=
dz
dw
dŷ
dz
d
dŷ
{y log ŷ + (1 − y) log (1 − ŷ)}
= JN
dŷ
dz
(
y
ŷ
−
1 − y
1 − ŷ
)
= JN (1 − ŷ)ŷ
(
y
ŷ
−
1 − y
1 − ŷ
)
= JN (y − ŷ) (50)
Asahi Ushio Metric Perspective of Stochastic Optimizers 26 / 35
Natural Gradient
Natural Gradient: Multi-class Classification
Multi-class Classification: Suppose multinomial distribution
p(y1, .., yk|ŷ1, ..., ŷk) =
k
∏
i=1
ŷyi
i (51)
Fisher information matrix:
Fi = {(1 − ŷi)yi}2
JN,iJT
N,i (52)
where
d log {p(y1, .., yk|ŷ1, ..., ŷk)}
dwi
=
dzi
dwi
dŷi
dzi
d
∑k
i=1 yi log(ŷi)
dŷi
(53)
= JN,i
{
(ŷi − ŷ2
i )
yi
ŷi
}
= JN,i(1 − ŷi)yi
(54)
Asahi Ushio Metric Perspective of Stochastic Optimizers 27 / 35
Experiment
Experiment: Settings
Data
Regression: Synthetic data, 1000 features.
Classification: MNIST (hand written digits), 764 features, 1 ∼ 9 labels.
Hyperparameters
Grid Search: Employ the best performed hyperparameters.
1 Learning rate.
2 Weight of RMS.
Evaluation
Regression: Mean square error.
Classification: Misclassification rate.
Asahi Ushio Metric Perspective of Stochastic Optimizers 28 / 35
Experiment
Experiment: Regression
0 5000 10000 15000 20000 25000
Iteration
0
20000
40000
60000
Error
Rate
SGD
AdaGrad
RMSprop
Adam
AdaDelta
VSGD
VSGD is the best even though tuning free.
Asahi Ushio Metric Perspective of Stochastic Optimizers 29 / 35
Experiment
Experiment: Classification
0 10000 20000 30000 40000 50000 60000
Iteration
0.2
0.4
0.6
0.8
Error
Rate
SGD
VSGD
AdaGrad
RMSprop
Adam
AdaDelta
AdaDelta is the best even though tuning free.
Asahi Ushio Metric Perspective of Stochastic Optimizers 30 / 35
Conclusion
Conclusion & Future Work
Conclusion
Summarize stochastic optimizers from the view of its metric.
Quasi-Newton type, RMS type and Natural Gradient type.
Conduct brief experiments (classification and regression).
See the efficacies of tuning free algorithm.
Asahi Ushio Metric Perspective of Stochastic Optimizers 31 / 35
reference
Antoine Bordes, Léon Bottou, and Patrick Gallinari,
“Sgd-qn: Careful quasi-newton stochastic gradient descent,”
Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1737–1754, 2009.
Matthew D Zeiler,
“Adadelta: an adaptive learning rate method,”
arXiv preprint arXiv:1212.5701, 2012.
Tom Schaul, Sixin Zhang, and Yann LeCun,
“No more pesky learning rates,”
in International Conference on Machine Learning, 2013, pp. 343–351.
Tom Schaul and Yann LeCun,
“Adaptive learning rates and parallelization for stochastic, sparse, non-smooth
gradients,”
in International Conference on Learning Representations, Scottsdale, AZ,
2013.
Oriol Vinyals and Daniel Povey,
“Krylov subspace descent for deep learning,”
in Artificial Intelligence and Statistics, 2012, pp. 1261–1268.
Asahi Ushio Metric Perspective of Stochastic Optimizers 32 / 35
reference
Nicol N Schraudolph,
“Fast curvature matrix-vector products for second-order gradient descent,”
Neural computation, vol. 14, no. 7, pp. 1723–1738, 2002.
James Martens,
“Deep learning via hessian-free optimization,”
in Proceedings of the 27th International Conference on Machine Learning,
2010, pp. 735–742.
Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer,
“A stochastic quasi-newton method for large-scale optimization,”
SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016.
Nicol N Schraudolph, Jin Yu, and Simon Günter,
“A stochastic quasi-newton method for online convex optimization,”
in Artificial Intelligence and Statistics, 2007, pp. 436–443.
Aryan Mokhtari and Alejandro Ribeiro,
“Res: Regularized stochastic bfgs algorithm,”
IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6089–6104, 2014.
Asahi Ushio Metric Perspective of Stochastic Optimizers 33 / 35
reference
Shun-Ichi Amari,
“Natural gradient works efficiently in learning,”
Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio,
“Topmoumoute online natural gradient algorithm,”
in Advances in neural information processing systems, 2008, pp. 849–856.
Nicolas L Roux and Andrew W Fitzgibbon,
“A fast natural newton method,”
in Proceedings of the 27th International Conference on Machine Learning
(ICML-10), 2010, pp. 623–630.
John Duchi, Elad Hazan, and Yoram Singer,
“Adaptive subgradient methods for online learning and stochastic
optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
Asahi Ushio Metric Perspective of Stochastic Optimizers 34 / 35
reference
Tijmen Tieleman and Geoffrey Hinton,
“Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude,”
COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31,
2012.
Diederik Kingma and Jimmy Ba,
“Adam: A method for stochastic optimization,”
in International Conference on Learning and Representations, 2015, pp. 1–13.
Asahi Ushio Metric Perspective of Stochastic Optimizers 35 / 35

Más contenido relacionado

La actualidad más candente

Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsLuc Brun
 
An application of gd
An application of gdAn application of gd
An application of gdgraphhoc
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...Yuko Kuroki (黒木祐子)
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
 
Free vibration analysis of composite plates with uncertain properties
Free vibration analysis of composite plates  with uncertain propertiesFree vibration analysis of composite plates  with uncertain properties
Free vibration analysis of composite plates with uncertain propertiesUniversity of Glasgow
 
Mixed Spectra for Stable Signals from Discrete Observations
Mixed Spectra for Stable Signals from Discrete ObservationsMixed Spectra for Stable Signals from Discrete Observations
Mixed Spectra for Stable Signals from Discrete Observationssipij
 
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...sipij
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniqueDynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniquejournalBEEI
 
On Fuzzy Soft Multi Set and Its Application in Information Systems
On Fuzzy Soft Multi Set and Its Application in Information Systems On Fuzzy Soft Multi Set and Its Application in Information Systems
On Fuzzy Soft Multi Set and Its Application in Information Systems ijcax
 
Linked CP Tensor Decomposition (presented by ICONIP2012)
Linked CP Tensor Decomposition (presented by ICONIP2012)Linked CP Tensor Decomposition (presented by ICONIP2012)
Linked CP Tensor Decomposition (presented by ICONIP2012)Tatsuya Yokota
 
JEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory Based
JEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory BasedJEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory Based
JEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory BasedMiso Study
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsUniversity of Glasgow
 
Digital Signal Processing[ECEG-3171]-Ch1_L04
Digital Signal Processing[ECEG-3171]-Ch1_L04Digital Signal Processing[ECEG-3171]-Ch1_L04
Digital Signal Processing[ECEG-3171]-Ch1_L04Rediet Moges
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
 
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...Tatsuya Yokota
 

La actualidad más candente (19)

Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
 
An application of gd
An application of gdAn application of gd
An application of gd
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
 
Free vibration analysis of composite plates with uncertain properties
Free vibration analysis of composite plates  with uncertain propertiesFree vibration analysis of composite plates  with uncertain properties
Free vibration analysis of composite plates with uncertain properties
 
Mixed Spectra for Stable Signals from Discrete Observations
Mixed Spectra for Stable Signals from Discrete ObservationsMixed Spectra for Stable Signals from Discrete Observations
Mixed Spectra for Stable Signals from Discrete Observations
 
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniqueDynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
 
On Fuzzy Soft Multi Set and Its Application in Information Systems
On Fuzzy Soft Multi Set and Its Application in Information Systems On Fuzzy Soft Multi Set and Its Application in Information Systems
On Fuzzy Soft Multi Set and Its Application in Information Systems
 
Polynomial Matrix Decompositions
Polynomial Matrix DecompositionsPolynomial Matrix Decompositions
Polynomial Matrix Decompositions
 
Linked CP Tensor Decomposition (presented by ICONIP2012)
Linked CP Tensor Decomposition (presented by ICONIP2012)Linked CP Tensor Decomposition (presented by ICONIP2012)
Linked CP Tensor Decomposition (presented by ICONIP2012)
 
JEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory Based
JEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory BasedJEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory Based
JEE Main 2020 Question Paper With Solution 08 Jan 2020 Shift 1 Memory Based
 
USNCCM13
USNCCM13USNCCM13
USNCCM13
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Digital Signal Processing[ECEG-3171]-Ch1_L04
Digital Signal Processing[ECEG-3171]-Ch1_L04Digital Signal Processing[ECEG-3171]-Ch1_L04
Digital Signal Processing[ECEG-3171]-Ch1_L04
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
 

Similar a 2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
 
Ejercicio de fasores
Ejercicio de fasoresEjercicio de fasores
Ejercicio de fasoresdpancheins
 
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...AIST
 
Memory Efficient Adaptive Optimization
Memory Efficient Adaptive OptimizationMemory Efficient Adaptive Optimization
Memory Efficient Adaptive Optimizationrohananil
 
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAn Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAlex (Oleksiy) Varfolomiyev
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Shizuoka Inst. Science and Tech.
 
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...Tomonari Masada
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4Cdiscount
 
State Space Model
State Space ModelState Space Model
State Space ModelCdiscount
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihoodDeep Learning JP
 
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...Elena Cannuccia
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilarWidmar Aguilar Gonzalez
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationMark Chang
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward HeuristicsAndres Mendez-Vazquez
 

Similar a 2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers (20)

Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
Statistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient DescentStatistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient Descent
 
Statistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient DescentStatistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient Descent
 
Ejercicio de fasores
Ejercicio de fasoresEjercicio de fasores
Ejercicio de fasores
 
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
 
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
 
Memory Efficient Adaptive Optimization
Memory Efficient Adaptive OptimizationMemory Efficient Adaptive Optimization
Memory Efficient Adaptive Optimization
 
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAn Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
 
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
 
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systemsAdaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4
 
State Space Model
State Space ModelState Space Model
State Space Model
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
 
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilar
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics
 
CMU_13
CMU_13CMU_13
CMU_13
 

Último

complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
Telephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online LecTelephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online Lecfllcampolet
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docxUlahVanessaBasa
 
Think Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinThink Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinNathan Cone
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 

Último (20)

complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
Telephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online LecTelephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online Lec
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx
 
Think Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinThink Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig Bobchin
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 

2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

  • 1. Metric Perspective of Stochastic Optimizers Asahi Ushio E-mail: ushio@ykw.elec.keio.ac.jp Dept. Electronics and Electrical Engineering, Keio University, Japan Internal Research Seminar July 19, 2017 Note: the author has moved to Cogent Labs as a researcher from April 2018. Asahi Ushio Metric Perspective of Stochastic Optimizers 1 / 35
  • 2. Overview 1 Preliminaries 2 Hessian Approximation Finite Difference Method Root Mean Square Extended Gauss-Newton: Preliminaries 3 Natural Gradient 4 Experiment 5 Conclusion 6 reference Asahi Ushio Metric Perspective of Stochastic Optimizers 2 / 35
  • 3. Preliminaries Outline Explanation of stochastic optimizers in machine learning, especially, from the perspective of each metric. Mostly, stochastic optimizers can be divided into three types of metric. 1 Quasi-Newton Method Type 1 Finite Difference Method (FDM): SGD-QN [1], AdaDelta [2], VSGD [3, 4] 2 Extended Gauss-Newton: KSD [5] , SMD [6], HF [7] 3 LBFGS: stochastic LBFGS [8, 9], RES [10], 2 Natural Gradient Type: Natural Gradient [11], TONGA [12, 13] 3 Root Mean Square (RMS) Type: AdaGrad [14], RMSprop [15], Adam [16] Asahi Ushio Metric Perspective of Stochastic Optimizers 3 / 35
  • 4. Preliminaries Overview of Stochastic Algorithm -­‐ TONGA -­‐ Fast  Natural  Gradient -­‐ AdaGrad -­‐ RMSProp -­‐ Adam -­‐ SGD-­‐QN -­‐ vSGD -­‐ AdaDelta -­‐ Online  LBFGS -­‐ RES -­‐ Stochastic  LBFGS Quasi-­‐Newton  (Hessian)  Approximation FDM-­‐based BFGS-­‐based Natural  Gradient  (Fisher   Information  Matrix)   Approximation RMS -­‐ Hessian-­‐Free -­‐ KSD  (Kryrov Subspace   Descent) -­‐ SMD  (Stochastic  Meta   Descent) Extended   Gauss-­‐Newton -­‐ PNLMS -­‐ IPNLMS Proportionate   Metric Asahi Ushio Metric Perspective of Stochastic Optimizers 4 / 35
  • 5. Preliminaries Problem Setting Model: For an input xt ∈ Rn , the output ŷt ∈ R is derived by Activation : ŷt = M(zt) (1) Output : zt = Nw(xt) ∈ R. (2) Loss: With instantaneous loss function lt(w) of parameter w, L(w) = E [lt(w)]t . (3) Problem Setting M(z) Nw(x) lt(ŷt) yt Regression z wT x ||ŷt − yt||2 2 yt ∈ R Classification 1 1 + e−z wT x − [yt log(ŷt) + (1 − yt) log(1 − ŷt)] yt ∈ {0, 1} Multi-Classification ezi ∑k i=1 ezi wT i x − ∑k i=1 yt,i log(ŷi) yt,i ∈ {0, 1} Asahi Ushio Metric Perspective of Stochastic Optimizers 5 / 35
  • 6. Preliminaries Brief Illustration of Model Input Hidden  Layer Output Activation Figure: Neural Network Input Output Activation Figure: Linear Model Asahi Ushio Metric Perspective of Stochastic Optimizers 6 / 35
  • 7. Preliminaries Stochastic Optimization Purpose: Find ŵ = arg min w L(w) by stochastic approximation. SGD (Stochastic Gradient Descent) and Varinats wt = wt−1 − ηtgt, ηt ∈ R s.t. lim t→∞ ηt = 0 and lim t→∞ t ∑ i=1 ηi = ∞ (4) Vanilla SGD : gt = d dw lt(wt−1) Momentum : gt = γgt−1 + (1 − γ) d dw lt(wt−1), γ ∈ R NAG : gt = γgt−1 + (1 − γ) d dw lt(wt−1 − γgt−1) Minibatch : gt = 1 I I−1 ∑ i=0 d dw lt−i(wt−1), I ∈ R. Suppose lt(w) is differentiable. Asahi Ushio Metric Perspective of Stochastic Optimizers 7 / 35
  • 8. Hessian Approximation Quasi-Newton Method Newton Method employs Hessian matrix: wt = wt−1 − Btgt (5) Bt = H−1 t , Ht = d2 L(wt−1) dw2 . Quasi-Newton employs Hessian approximation Ĥt instead of Ht. 1 FDM (Finite Difference Method): SGD-QN [1], AdaDelta [2], VSGD [3, 4] 2 Extended Gauss-Newton Approximation 3 LBFGS Diagonal approximation is often used in stochastic optimization: Ĥt = diag (h1,t . . . hn,t) (6) Asahi Ushio Metric Perspective of Stochastic Optimizers 8 / 35
  • 9. Hessian Approximation Finite Difference Method SGD-QN SGD-QN [1] employs instantaneous estimator of Hessian: 1 hi,t : = η t E [ 1 hF DM i,τ ]t τ=1 (7) E [ 1 hF DM i,τ ]t τ=1 ≈ 1 h̄i,t := α 1 hF DM i,t + (1 − α) 1 h̄i,t−1 (8) hF DM i,t : = gi,t − gi,t−1 wi,t−1 − wi,t−2 (9) The FDM approximation (9) is called secant condition in quasi-newton method context. Asahi Ushio Metric Perspective of Stochastic Optimizers 9 / 35
  • 10. Hessian Approximation Finite Difference Method AdaDelta SGD updates (5) can be reformulated Bt = −(wt − wt−1)g−T t . (10) AdaDelta [2] approximates Hessian by (10) : hi,t := E [ − gi,t wi,t − wi,t−1 ]t τ=1 ≈ RMS [gi,t] RMS [wi,t−1 − wi,t−2] . (11) Here wi,t − wi,t−1 is not known at t, so approximated by wi,t−1 − wi,t−2. With numerical stability parameter ϵ: (sensitive parameter in practice) RMS [gt] : = √ E [g2 τ ] t τ=1 + ϵ (12) E [ g2 τ ]t τ=1 ≈ ḡt := αg2 t + (1 − α)ḡt−1 (13) Asahi Ushio Metric Perspective of Stochastic Optimizers 10 / 35
  • 11. Hessian Approximation Finite Difference Method VSGD: Quadratic Loss Quadratic Approximation of Loss Taylor expansion gives lt(w) ≈ lt(a) + dlt(a) dw T (w − a) + 1 2 (w − a)T dl2 t (a) dw2 (w − a), ∀a ∈ Rn For ŵt = arg min w∈Rn lt(w), lt(ŵt) = 0, dlt(ŵt) dw = 0. (14) Then lt(w) can be locally approximated by the quadratic function lt(w) ≈ 1 2 (w − ŵt)T d2 lt(ŵt) dw2 (w − ŵt). (15) Asahi Ushio Metric Perspective of Stochastic Optimizers 11 / 35
  • 12. Hessian Approximation Finite Difference Method VSGD: Noisy Quadratic Loss Noisy Quadratic Approximation of Loss with Diagonal Hessian Approximation Suppose ŵt ∼ N(ŵ, diag ( σ2 1, . . . , σ2 n ) ): lq t (w) : = 1 2 (w − ŵt)T Ĥt(w − ŵt) = 1 2 n ∑ i=1 hi,t(wi − ŵi,t)2 (16) SGD for lq t with element-wise learning rate ηi,t: wi,t = wi,t−1 − ηi,t dlq t (wt−1) dwi = wi,t−1 − ηi,thi,t(wi,t−1 − ŵi,t) = wi,t−1 − ηi,thi,t(wi,t−1 − ŵi + ui), ui,t ∼ N(0, σ2 i ). (17) Asahi Ushio Metric Perspective of Stochastic Optimizers 12 / 35
  • 13. Hessian Approximation Finite Difference Method VSGD: Adaptive Learning Rate Greedy Optimal Learning Rate VSGD (variance SGD) [3, 4] choose the learning rate, which minimize the conditional expectation of loss function: ηi,t : = arg min η { E [lq t (wi,t)|wi,t−1]t } = arg min η { E [( wi,t−1 − 1 η hi,t(wi,t−1 − ŵi + ui) − ŵi,t )2 ] t } = 1 hi,t (wi,t−1 − ŵi)2 (wi,t−1 − ŵi)2 + σ2 i (18) Asahi Ushio Metric Perspective of Stochastic Optimizers 13 / 35
  • 14. Hessian Approximation Finite Difference Method VSGD: Variance Approximation In practice, ŵ is unknown so approximated by (wi,t−1 − ŵi)2 = ( E [wi,τ − ŵi,τ ] t−1 τ=1 )2 = ( E [gi,τ ] t−1 τ=1 )2 (19) (wi,t−1 − ŵi)2 + σ2 i = E [ (wi,τ − ŵi,τ )2 ]t−1 τ=1 = E [ g2 i,τ ]t−1 τ=1 (20) where E [gi,τ ] t τ=1 ≈ ḡi,t := αi,tgi,t + (1 − αi,t)ḡi,t−1 (21) E [ g2 i,τ ]t τ=1 ≈ v̄i,t := αi,tg2 i,t + (1 − αi,t)v̄i,t−1. (22) Asahi Ushio Metric Perspective of Stochastic Optimizers 14 / 35
  • 15. Hessian Approximation Finite Difference Method VSGD: Hessian Approximation Hessian is approximated by hi,t := E [hi,τ ] t τ=1. Two approximation of E [hi,τ ] t τ=1 based on FDM: (Scheme 1) h̄i,t := αi,tĥi,t + (1 − αi,t)h̄i,t (23) (Scheme 2) h̄i,t := E [( ĥi,t )2 ] E [ ĥi,t ] (24) E [( ĥi,t )2 ] ≈ vi,t := αi,t ( ĥi,t )2 + (1 − αi,t)vi,t E [ ĥi,t ] ≈ mi,t := αi,tĥi,t + (1 − αi,t)mi,t where ĥi,t := gi,t − dlt(wi,t + ḡi,t)/dw ḡi,t . (25) Asahi Ushio Metric Perspective of Stochastic Optimizers 15 / 35
  • 16. Hessian Approximation Finite Difference Method VSGD: Weight Decay Weight Sequence: Weight is update by following heuristic rule αi,t := ( 1 − ḡ2 i,t v̄i,t ) αi,t−1 + 1. (26) Asahi Ushio Metric Perspective of Stochastic Optimizers 16 / 35
  • 17. Hessian Approximation Root Mean Square RMS: AdaGrad, RMSprop, Adam AdaGrad [14], Adam [16], RMSProp [15] can be summarized as Bt = ηtRt (27) Rt : = diag (1/r1,t . . . 1/rn,t) (28) ri,t : = RMS [gi,t] = √ E [ g2 i,t ] (29) Approximation of expectation and learning rate is different: (AdaGrad) E [ g2 i,t ] ≈ 1 t t ∑ τ=1 g2 i,τ , ηt = η/ √ t (30) (RMSProp) E [ g2 i,t ] ≈ v̄i,t, ηt = η (31) (Adam) E [ g2 i,t ] ≈ v̂i,t := v̄i,t 1 − αt , ηt = η 1 1 − γt (32) where v̄i,t := αg2 i,t + (1 − α)v̄i,t−1. Asahi Ushio Metric Perspective of Stochastic Optimizers 17 / 35
  • 18. Hessian Approximation Root Mean Square Adam Moment Bias For the momentum sequence: gt := γgt−1 + (1 − γ) d dw lt(wt−1) = (1 − γ) t ∑ i=1 γt−i d dw li(wi−1), the expectation of gt includes moment bias (1 − γt ) such as E [gt]t = E [ (1 − γ) t ∑ i=1 γt−i d dw li(wi−1) ] t = E [ d dw lt(wt−1) ] t (1 − γ) t ∑ i=1 γt−i = E [ d dw lt(wt−1) ] t (1 − γt ). Adam’s learning rate is aimed to reduce the moment bias. Asahi Ushio Metric Perspective of Stochastic Optimizers 18 / 35
  • 19. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton Relation to the Gauss-Newton Gauss-Newton is an approximation of Hessian, which is limited to the squared loss function. Extended Gauss-Newton is an extension of Gauss-Newton by [6]. 1 Applicable to any loss function. Multilayer Perceptron Model The loss function (3) can be seen as L(ŷ) = E [lt(ŷt)]t (33) where Activation : ŷt = M(zt) (1) Output : zt = Nw(xt). (2) Asahi Ushio Metric Perspective of Stochastic Optimizers 19 / 35
  • 20. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: Hessian Derivation Hessian of (33) is H = d2 L(ŷ) dw2 = d dw { dz dw ( dŷ dz dL(ŷ) dŷ )} = dz dw d dw ( dŷ dz dL(ŷ) dŷ )T + d2 z dw2 ( dŷ dz dL(ŷ) dŷ ) = dz dw dz dw T d dz ( dŷ dz dL(ŷ) dŷ ) + d2 z dw2 ( dŷ dz dL(ŷ) dŷ ) = JN JT N d dz ( dŷ dz dL(ŷ) dŷ ) + d2 z dw2 ( dŷ dz dL(ŷ) dŷ ) (34) where JN = dz dw is Jacobian. Asahi Ushio Metric Perspective of Stochastic Optimizers 20 / 35
  • 21. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: General Case and Regression Extended Gauss-Newton Ignore the 2nd order derivation in (34), HGN = JN JT N d dz ( dŷ dz dL(ŷ) dŷ ) . (35) Regression: Since dŷ dz = 1, d dz ( dŷ dz dL(ŷ) dŷ ) = Et [ d dz d dŷ lt(ŷ) ] = Et [ d dz (ŷ − yt) ] = 1. (36) Extended Gauss-Newton approximation is HGN = JN JT N (37) Asahi Ushio Metric Perspective of Stochastic Optimizers 21 / 35
  • 22. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: Classification d dz ( dŷ dz dL(ŷ) dŷ ) = Et [ d dz ( dŷ dz d dŷ lt(ŷ) )] = −Et [ d dz ( (ŷ − ŷ2 ) ( yt ŷ − 1 − yt 1 − ŷ ))] = Et [ d dz (ŷ − yt) ] = ŷ − ŷ2 (38) where dŷ dz = e−z (1 + e−z)2 = 1 1 + e−z − 1 (1 + e−z)2 = ŷ − ŷ2 d dŷ lt(ŷ) = − ( yt ŷ − 1 − yt 1 − ŷ ) Extended Gauss-Newton approximation is HGN = (ŷ − ŷ2 )JN JT N (39) Asahi Ushio Metric Perspective of Stochastic Optimizers 22 / 35
  • 23. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: Multi-class Classification d dzi ( dŷi dzi d dŷi lt(ŷ) ) = d dzi      ezi ∑k i=1 ezi − ( ezi ∑k i=1 ezi )2    ( − ∑k i=1 ezi ezi )  = d dzi [ ezi ∑k i=1 ezi − 1 ] = ezi ∑k i=1 ezi − ( ezi ∑k i=1 ezi )2 = ŷi − ŷ2 i (40) Extended Gauss-Newton approximation is HGN,i = (ŷi − ŷ2 i )JN,iJT N,i (41) Asahi Ushio Metric Perspective of Stochastic Optimizers 23 / 35
  • 24. Natural Gradient Natural Gradient Fisher Information Matrix Suppose the observation y is sampled via y ∼ p(ŷ). (42) Then fisher information matrix becomes F = d log p(y|ŷ) dw d log p(y|ŷ) dw T . (43) Natural Gradient wt = wt−1 − Btgt (44) Bt = F−1 Asahi Ushio Metric Perspective of Stochastic Optimizers 24 / 35
  • 25. Natural Gradient Natural Gradient: Regression Regression: Suppose gaussian distribution, p(y|ŷ, σ) = 1 √ 2πσ2 e− (ŷ−y)2 2σ2 . (45) Fisher information matrix: F = (ŷ − y)2 σ4 JN JT N (46) where d log p(y|ŷ, σ) dw = d dw { − (ŷ − y)2 2σ2 } = − ŷ − y σ2 dŷ dw = − ŷ − y σ2 dz dw dŷ dz = − ŷ − y σ2 JN (47) Asahi Ushio Metric Perspective of Stochastic Optimizers 25 / 35
  • 26. Natural Gradient Natural Gradient: Classification Classification: Suppose binomial distribution, p(y|ŷ) = ŷy (1 − ŷ)1−y . (48) Fisher information matrix: F = (y − ŷ)2 JN JT N (49) where d log p(y|ŷ, σ) dw = dz dw dŷ dz d dŷ {y log ŷ + (1 − y) log (1 − ŷ)} = JN dŷ dz ( y ŷ − 1 − y 1 − ŷ ) = JN (1 − ŷ)ŷ ( y ŷ − 1 − y 1 − ŷ ) = JN (y − ŷ) (50) Asahi Ushio Metric Perspective of Stochastic Optimizers 26 / 35
  • 27. Natural Gradient Natural Gradient: Multi-class Classification Multi-class Classification: Suppose multinomial distribution p(y1, .., yk|ŷ1, ..., ŷk) = k ∏ i=1 ŷyi i (51) Fisher information matrix: Fi = {(1 − ŷi)yi}2 JN,iJT N,i (52) where d log {p(y1, .., yk|ŷ1, ..., ŷk)} dwi = dzi dwi dŷi dzi d ∑k i=1 yi log(ŷi) dŷi (53) = JN,i { (ŷi − ŷ2 i ) yi ŷi } = JN,i(1 − ŷi)yi (54) Asahi Ushio Metric Perspective of Stochastic Optimizers 27 / 35
  • 28. Experiment Experiment: Settings Data Regression: Synthetic data, 1000 features. Classification: MNIST (hand written digits), 764 features, 1 ∼ 9 labels. Hyperparameters Grid Search: Employ the best performed hyperparameters. 1 Learning rate. 2 Weight of RMS. Evaluation Regression: Mean square error. Classification: Misclassification rate. Asahi Ushio Metric Perspective of Stochastic Optimizers 28 / 35
  • 29. Experiment Experiment: Regression 0 5000 10000 15000 20000 25000 Iteration 0 20000 40000 60000 Error Rate SGD AdaGrad RMSprop Adam AdaDelta VSGD VSGD is the best even though tuning free. Asahi Ushio Metric Perspective of Stochastic Optimizers 29 / 35
  • 30. Experiment Experiment: Classification 0 10000 20000 30000 40000 50000 60000 Iteration 0.2 0.4 0.6 0.8 Error Rate SGD VSGD AdaGrad RMSprop Adam AdaDelta AdaDelta is the best even though tuning free. Asahi Ushio Metric Perspective of Stochastic Optimizers 30 / 35
  • 31. Conclusion Conclusion & Future Work Conclusion Summarize stochastic optimizers from the view of its metric. Quasi-Newton type, RMS type and Natural Gradient type. Conduct brief experiments (classification and regression). See the efficacies of tuning free algorithm. Asahi Ushio Metric Perspective of Stochastic Optimizers 31 / 35
  • 32. reference Antoine Bordes, Léon Bottou, and Patrick Gallinari, “Sgd-qn: Careful quasi-newton stochastic gradient descent,” Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1737–1754, 2009. Matthew D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012. Tom Schaul, Sixin Zhang, and Yann LeCun, “No more pesky learning rates,” in International Conference on Machine Learning, 2013, pp. 343–351. Tom Schaul and Yann LeCun, “Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients,” in International Conference on Learning Representations, Scottsdale, AZ, 2013. Oriol Vinyals and Daniel Povey, “Krylov subspace descent for deep learning,” in Artificial Intelligence and Statistics, 2012, pp. 1261–1268. Asahi Ushio Metric Perspective of Stochastic Optimizers 32 / 35
  • 33. reference Nicol N Schraudolph, “Fast curvature matrix-vector products for second-order gradient descent,” Neural computation, vol. 14, no. 7, pp. 1723–1738, 2002. James Martens, “Deep learning via hessian-free optimization,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 735–742. Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer, “A stochastic quasi-newton method for large-scale optimization,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016. Nicol N Schraudolph, Jin Yu, and Simon Günter, “A stochastic quasi-newton method for online convex optimization,” in Artificial Intelligence and Statistics, 2007, pp. 436–443. Aryan Mokhtari and Alejandro Ribeiro, “Res: Regularized stochastic bfgs algorithm,” IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6089–6104, 2014. Asahi Ushio Metric Perspective of Stochastic Optimizers 33 / 35
  • 34. reference Shun-Ichi Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998. Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio, “Topmoumoute online natural gradient algorithm,” in Advances in neural information processing systems, 2008, pp. 849–856. Nicolas L Roux and Andrew W Fitzgibbon, “A fast natural newton method,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 623–630. John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011. Asahi Ushio Metric Perspective of Stochastic Optimizers 34 / 35
  • 35. reference Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012. Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning and Representations, 2015, pp. 1–13. Asahi Ushio Metric Perspective of Stochastic Optimizers 35 / 35