SlideShare una empresa de Scribd logo
1 de 72
Descargar para leer sin conexión
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Using Gradient Descent for
Optimization and Learning
Nicolas Le Roux
15 May 2009
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Why having a good
optimizer?
• plenty of data available everywhere
• extract information efficiently
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Why gradient descent?
• cheap
• suitable for large models
• it works
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Optimization vs Learning
Optimization
• function f to minimize
• time T(ρ) to reach error level ρ
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Optimization vs Learning
Optimization
• function f to minimize
• time T(ρ) to reach error level ρ
Learning
• measure of quality f (cost function)
• get training samples x1, . . . , xn from p
• choose a model F with parameters θ
• minimize Ep[f(θ, x)]
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
Stochastic Optimization
2 Learning (Bottou)
3 TONGA
Natural Gradient
Online Natural Gradient
Results
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
The basics
Taylor expansion of f to the first order:
f(θ + ε) = f(θ) + εT
∇θf + o( ε )
Best improvement obtained with
ε = −η∇θf η > 0
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quadratic bowl
η = .1 η = .3
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Optimal learning rate
λmin
ηmin,opt = 1
λmin
ηmin,div = 2
λmin
λmax
ηmax,opt = 1
λmax
ηmax,div = 2
λmax
η = ηmax,opt
κ =
η
ηmin,opt
=
λmax
λmin
T(ρ) = O dκ log
1
ρ
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Limitations of gradient
descent
• Speed of convergence highly dependent on κ
• Not invariant to linear transformations
θ′
= kθ
f(θ′
+ ε) = f(θ′
) + εT
∇θ′ f + o( ε )
ε = −η∇θ′ f
But ∇θ′ f = ∇θf
k
!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Newton method
Second-order Taylor expansion of f around θ:
f(θ + ε) = f(θ) + εT
∇θf +
εT
Hε
2
+ o( ε 2
)
Best improvement obtained with
ε = −ηH−1
∇θf
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quadratic bowl
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Rosenbrock function
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Properties of Newton
method
• Newton method assumes the function is locally
quadratic (Beyond Newton method, Minka)
• H must be positive definite
• storing H and finding ε are in d2
T(ρ) = O d2
log log
1
ρ
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Limitations of Newton
method
• Newton method looks powerful
but...
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Limitations of Newton
method
• Newton method looks powerful
but...
• H may be hard to compute
• it is expensive
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Gauss-Newton
Only works when f is a sum of squared residuals!
f(θ) =
1
2
i
r2
i (θ)
∂f
∂θ
=
i
ri
∂ri
∂θ
∂2
f
∂θ2
=
i
ri
∂2
ri
∂θ2
+
∂ri
∂θ
∂ri
∂θ
T
θ = θ − η
i
∂ri
∂θ
∂ri
∂θ
T −1
∂f
∂θ
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Properties of
Gauss-Newton
θ = θ − η
i
∂ri
∂θ
∂ri
∂θ
T −1
∂f
∂θ
Discarded term: ri
∂2
ri
∂θ2
• Does not require the computation of H
• Only valid close to the optimum where ri = 0
• Computation cost in O(d2
)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Levenberg-Marquardt
θ = θ − η
i
∂ri
∂θ
∂ri
∂θ
T
+ λI
−1
∂f
∂θ
• “Damped” Gauss-Newton
• Intermediate between Gauss-Newton and
Steepest Descent
• Slower optimization but more robust
• Cost in O(d2
)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quasi-Newton methods
• Gauss-Newton and Levenberg-Marquardt can
only be used in special cases
• What about the general case?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quasi-Newton methods
• Gauss-Newton and Levenberg-Marquardt can
only be used in special cases
• What about the general case?
• H characterizes the change in gradient when
moving in parameter space
• Let’s find a matrix which does the same!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS
We look for a matrix B such that
∇θf(θ + ε) − ∇θf(θ) = B−1
t ε : Secant equation
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS
We look for a matrix B such that
∇θf(θ + ε) − ∇θf(θ) = B−1
t ε : Secant equation
• Problem: this is underconstrained
• Solution: set additional constraints: small
Bt+1 − Bt W
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS - (2)
• B0 = I
• while not converged:
• pt = −Bt ∇f(θt )
• ηt = linemin(f, θt , pt )
• st = ηt pt (change in parameter space)
• θt+1 = θt + st
• yt = ∇f(θt+1) − ∇f(θt ) (change in gradient
space)
• ρt = (sT
t yt )−1
• Bt+1 = (I − ρt st yT
t )Bt (I − ρt yt sT
t ) + ρt st sT
t
(stems from Sherman-Morrisson formula)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS - (3)
• Requires a line search
• No matrix inversion required
• Update in O(d2
)
• Can we do better than O(d2
)?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
L-BFGS
• Low-rank estimate of B
• Based on the last m moves in parameters and
gradient spaces
• Cost O(md) per update
• Same ballpark as steepest descent!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conjugate Gradient
We want to solve
Ax = b
• Relies on conjugate directions
• u and v are conjugate if uT
Av = 0.
• d mutually conjugate directions form a base of
Rd
• Goal: to move along conjugate directions close
to steepest descent directions
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Nonlinear Conjugate
Gradient
• Extension to non-quadratic functions
• Requires a line search in every direction
(Important for conjugacy!)
• Various direction updates (Polak-Ribiere)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Going from batch to
stochastic
• dataset composed of n samples
• f(θ) = 1
n i fi(θ, xi )
Do we really need to see all the examples before
making a parameter update?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Stochastic optimization
• Information is redundant amongst samples
• We can afford more frequent, noisier updates
• But problems arise...
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Stochastic (Bottou)
Advantage
• much faster convergence on large redundant
datasets
Disadvantages
• Keeps bouncing around unless η is reduced
• Extremely hard to reach high accuracy
• Theoretical definitions for convergence not as
well defined
• Most second-orders methods will not work
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Batch (Bottou)
Advantage
• Guaranteed convergence to a local minimum
under simple conditions
• Lots of tricks to speed up the learning
Disadvantage
• Painfully slow on large problems
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Problems arising in
stochastic setting
• First order descent: O(dn) =⇒ O(dn)
• Second order methods: O(d2
+ dn) =⇒ O(d2
n)
• Special cases: algorithms requiring line search
• BFGS: not critical, may be replaced by a
one-step update
• Conjugate Gradient: critical, no stochastic
version
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Successful stochastic
methods
• Stochastic gradient descent
• Online BFGS (Schraudolph, 2007)
• Online L-BFGS (Schraudolph, 2007)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conclusions of the
tutorial
Batch methods
• Second-order methods have much faster
convergence
• They are too expensive when d is large
• Except for L-BFGS and Nonlinear Conjugate
Gradient
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conclusions of the
tutorial
Stochastic methods
• Much faster updates
• Terrible convergence rates
• Stochastic Gradient Descent: T(ρ) = O d
ρ
• Second-order Stochastic Descent:
T(ρ) = O d2
ρ
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
Stochastic Optimization
2 Learning (Bottou)
3 TONGA
Natural Gradient
Online Natural Gradient
Results
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Learning
• Measure of quality f (“cost function”)
• Get training samples x1, . . . , xn from p
• Choose a model F with parameters θ
• Minimize Ep[f(θ, x)]
• Time budget T
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Small-scale vs.
Large-scale Learning
• Small-scale learning problem: the active budget
constraint is the number of examples n.
• Large-scale learning problem: the active budget
constraint is the computing time T.
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Which algorithm should
we use?
T(ρ) for various algorithms:
• Gradient Descent: T(ρ) = O ndκ log 1
ρ
• Second Order Gradient Descent:
T(ρ) = O d2
log log 1
ρ
• Stochastic Gradient Descent: T(ρ) = O d
ρ
• Second-order Stochastic Descent:
T(ρ) = O d2
ρ
Second Order Gradient Descent seems a good
choice!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Large-scale learning
• We are limited by the time T
• We can choose between ρ and n
• Better optimization means fewer examples
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Generalization error
E(˜fn) − E(f∗
) = E(f∗
F ) − E(f∗
) Approximation error
+ E(fn) − E(f∗
F ) Estimation error
+ E(˜fn) − E(fn) Optimization error
There is no need to optimize thoroughly if we cannot
process enough data points!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Which algorithm should
we use?
Time to reach E(˜fn) − E(f∗
) < ǫ:
• Gradient Descent: O d2κ
ǫ1/α log2 1
ǫ
• Second Order Gradient Descent:
O d2κ
ǫ1/α log 1
ǫ
log log 1
ǫ
• Stochastic Gradient Descent: O dκ2
ǫ
• Second-order Stochastic Descent: O d2
ǫ
with 1
2
≤ α ≤ 1 (statistical estimation rate).
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
In a nutshell
• Simple stochastic gradient descent is extremely
efficient
• Fast second-order stochastic gradient descent
can win us a constant factor
• Are there other possible factors of improvement?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
What we did not talk
about (yet)
• we have access to x1, . . . , xn ∼ p
• we wish to minimize Ex∼p[f(θ, x)]
• can we use the uncertainty in the dataset to get
information about p?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
Stochastic Optimization
2 Learning (Bottou)
3 TONGA
Natural Gradient
Online Natural Gradient
Results
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Cost functions
Empirical cost function
f(θ) =
1
n
i
f(θ, xi)
True cost function
f∗
(θ) = Ex∼p[f(θ, x)]
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Gradients
Empirical gradient
g =
1
n
i
∇θf(θ, xi )
True gradient
g∗
= Ex∼p[∇θf(θ, x)]
Central-limit theorem:
g|g∗
∼ N g∗
,
C
n
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Posterior over g∗
Using
g∗
∼ N(0, σ2
I)
we have
g∗
|g ∼ N I +
C
nσ2
−1
g,
I
σ2
+ nC−1
−1
and then
εT
g∗
|g ∼ N εT
I +
C
nσ2
−1
g, εT I
σ2
+ nC−1
−1
ε
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Aggressive strategy
εT
g∗
|g ∼ N εT
I +
C
nσ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Eg∗ [εT
g∗
]
• Solution:
ε = −η I +
C
nσ2
−1
g
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Aggressive strategy
εT
g∗
|g ∼ N εT
I +
C
nσ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Eg∗ [εT
g∗
]
• Solution:
ε = −η I +
C
nσ2
−1
g
This is the regularized natural gradient.
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conservative strategy
εT
g∗
|g ∼ N εT
I +
C
σ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Pr(εT
g∗
> 0)
• Solution:
ε = −η
C
n
−1
g
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conservative strategy
εT
g∗
|g ∼ N εT
I +
C
σ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Pr(εT
g∗
> 0)
• Solution:
ε = −η
C
n
−1
g
This is the natural gradient.
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Online Natural Gradient
• Natural gradient has nice properties
• Its cost per iteration is in d2
• Can we make it faster?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Goals
ε = −η I +
C
nσ2
−1
g
We must be able to update and invert C whenever a
new gradient arrives.
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Plan of action
Updating (uncentered) C:
Ct ∝ γCt−1 + (1 − γ)gt gT
t
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Plan of action
Updating (uncentered) C:
Ct ∝ γCt−1 + (1 − γ)gt gT
t
Inverting C:
• maintain low-rank estimates of C using its
eigendecomposition.
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Eigendecomposition of C
• Computing k eigenvectors of C = GGT
is
O(kd2
)
• But computing k eigenvectors of GT
G is O(kp2
)!
• Still too expensive to compute for each new
sample (and p must not grow)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Eigendecomposition of C
• Computing k eigenvectors of C = GGT
is
O(kd2
)
• But computing k eigenvectors of GT
G is O(kp2
)!
• Still too expensive to compute for each new
sample (and p must not grow)
• Done only every b steps
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
To prove I’m not cheating
• eigendecomposition every b steps
Ct = γt
C +
t
k=1
γt−k
gk gT
k + λt t = 1, . . . , b
vt = C−1
t gt
Xt = γ
t
2 U γ
t−1
2 g1 . . . γ
1
2 gt−1 gt
Ct = Xt XT
t + λt vt = Xt αt gt = Xt yt
αt = (XT
t Xt + λt)−1
yt
vt = Xt (XT
t Xt + λt)−1
yt
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Computation complexity
• d dimensions
• k eigenvectors
• b steps between two updates
• Computing the eigendecomposition (every b
steps): k(k + b)2
• Computing the natural gradient:
d(k + b) + (k + b)2
• if p ≪ (k + b), cost per example is O(d(k + b))
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
What next
• Complexity in O(d(k + b))
• We need a small k
• If d > 106
, how large should k be?
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Decomposing C
C is almost block-diagonal!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quality of approximation
200 400 600 800 1000 1200 1400 1600 1800 2000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number k of eigenvectors kept
RatioofthesquaredFrobeniusnorms
Full matrix approximation
Block diagonal approximation
5 10 15 20 25 30 35 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number k of eigenvectors kept
RatioofthesquaredFrobeniusnorms
Full matrix approximation
Block diagonal approximation
Minimum with k = 5
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Evolution of C
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Results - MNIST
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0.05
0.1
0.15
0.2
CPU time (in seconds)
Negativelog−likelihoodonthetestset
Block diagonal TONGA
Stochastic batchsize=1
Stochastic batchsize=400
Stochastic batchsize=1000
Stochastic batchsize=2000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
CPU time (in seconds)
Classificationerroronthetestset
Block diagonal TONGA
Stochastic batchsize=1
Stochastic batchsize=400
Stochastic batchsize=1000
Stochastic batchsize=2000
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Results - Rectangles
0 0.5 1 1.5 2 2.5 3 3.5
x 10
4
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
CPU time (in seconds)
Negativelog−likelihoodonthevalidationset
Stochastic gradient
Block diagonal TONGA
0 0.5 1 1.5 2 2.5 3 3.5
x 10
4
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
CPU time (in seconds)
Classificationerroronthevalidationset
Stochastic gradient
Block diagonal TONGA
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Results - USPS
0 10 20 30 40 50 60 70
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
CPU time (in seconds)
Negativelog−likelihood
Block diagonal TONGA
Stochastic gradient
0 10 20 30 40 50 60 70
0.05
0.1
0.15
0.2
0.25
0.3
CPU time (in seconds)
Classificationerror
Block diagonal TONGA
Stochastic gradient
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
TONGA - Conclusion
• Introducing uncertainty speeds up training
• There exists a fast implementation of the online
natural gradient
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Difference between C
and H
• H accounts for small changes in the parameter
space
• C accounts for small changes in the input space
• C is not just a “cheap cousin” of H
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Future research
Much more to do!
• Can we combine the effects of H and C?
• Are there better approximations of C and H?
• Anything you can think of!
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Thank you!
Questions?

Más contenido relacionado

La actualidad más candente

SOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
SOCG: Linear-Size Approximations to the Vietoris-Rips FiltrationSOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
SOCG: Linear-Size Approximations to the Vietoris-Rips FiltrationDon Sheehy
 
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...Lake Como School of Advanced Studies
 
Introduction to Date and Time API 3
Introduction to Date and Time API 3Introduction to Date and Time API 3
Introduction to Date and Time API 3Kenji HASUNUMA
 
Nurbs (1)
Nurbs (1)Nurbs (1)
Nurbs (1)MAKAUT
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Noise from stray light in interferometric GWs detectors
Noise from stray light in interferometric GWs detectorsNoise from stray light in interferometric GWs detectors
Noise from stray light in interferometric GWs detectorsJose Gonzalez
 
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...Lake Como School of Advanced Studies
 
Short-time homomorphic wavelet estimation
Short-time homomorphic wavelet estimation Short-time homomorphic wavelet estimation
Short-time homomorphic wavelet estimation UT Technology
 

La actualidad más candente (9)

SOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
SOCG: Linear-Size Approximations to the Vietoris-Rips FiltrationSOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
SOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
 
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
 
Bode lect
Bode lectBode lect
Bode lect
 
Introduction to Date and Time API 3
Introduction to Date and Time API 3Introduction to Date and Time API 3
Introduction to Date and Time API 3
 
Nurbs (1)
Nurbs (1)Nurbs (1)
Nurbs (1)
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Noise from stray light in interferometric GWs detectors
Noise from stray light in interferometric GWs detectorsNoise from stray light in interferometric GWs detectors
Noise from stray light in interferometric GWs detectors
 
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
The Analytical/Numerical Relativity Interface behind Gravitational Waves: Lec...
 
Short-time homomorphic wavelet estimation
Short-time homomorphic wavelet estimation Short-time homomorphic wavelet estimation
Short-time homomorphic wavelet estimation
 

Destacado

[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkitde:code 2017
 
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream) Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream) IT Arena
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Modelsbutest
 
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...Petroleum Training Institute
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshopodsc
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
Processor, Compiler and Python Programming Language
Processor, Compiler and Python Programming LanguageProcessor, Compiler and Python Programming Language
Processor, Compiler and Python Programming Languagearumdapta98
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2Park Chunduck
 
Caffe framework tutorial
Caffe framework tutorialCaffe framework tutorial
Caffe framework tutorialPark Chunduck
 
Computer vision, machine, and deep learning
Computer vision, machine, and deep learningComputer vision, machine, and deep learning
Computer vision, machine, and deep learningIgi Ardiyanto
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECFace recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECBAINIDA
 
Semi fragile watermarking
Semi fragile watermarkingSemi fragile watermarking
Semi fragile watermarkingYash Diwakar
 
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...Joe Suzuki
 
Center loss for Face Recognition
Center loss for Face RecognitionCenter loss for Face Recognition
Center loss for Face RecognitionJisung Kim
 
Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)irpycon
 
Rattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense SlidesRattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense SlidesPluribus One
 
怖くない誤差逆伝播法 Chainerを添えて
怖くない誤差逆伝播法 Chainerを添えて怖くない誤差逆伝播法 Chainerを添えて
怖くない誤差逆伝播法 Chainerを添えてmarujirou
 
Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Yusuke Oda
 

Destacado (20)

[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
 
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream) Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
 
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Processor, Compiler and Python Programming Language
Processor, Compiler and Python Programming LanguageProcessor, Compiler and Python Programming Language
Processor, Compiler and Python Programming Language
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
 
Facebook Deep face
Facebook Deep faceFacebook Deep face
Facebook Deep face
 
Caffe framework tutorial
Caffe framework tutorialCaffe framework tutorial
Caffe framework tutorial
 
Computer vision, machine, and deep learning
Computer vision, machine, and deep learningComputer vision, machine, and deep learning
Computer vision, machine, and deep learning
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECFace recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
 
Semi fragile watermarking
Semi fragile watermarkingSemi fragile watermarking
Semi fragile watermarking
 
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
 
Center loss for Face Recognition
Center loss for Face RecognitionCenter loss for Face Recognition
Center loss for Face Recognition
 
портфоліо Бабич О.А.
портфоліо Бабич О.А.портфоліо Бабич О.А.
портфоліо Бабич О.А.
 
Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)
 
Rattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense SlidesRattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense Slides
 
怖くない誤差逆伝播法 Chainerを添えて
怖くない誤差逆伝播法 Chainerを添えて怖くない誤差逆伝播法 Chainerを添えて
怖くない誤差逆伝播法 Chainerを添えて
 
Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3
 

Similar a Using Gradient Descent for Optimization and Learning

A Fast Implicit Gaussian Curvature Filter
A Fast Implicit Gaussian Curvature FilterA Fast Implicit Gaussian Curvature Filter
A Fast Implicit Gaussian Curvature FilterYuanhao Gong
 
Computer Organization1CS1400Feng JiangBoolean al.docx
Computer Organization1CS1400Feng JiangBoolean al.docxComputer Organization1CS1400Feng JiangBoolean al.docx
Computer Organization1CS1400Feng JiangBoolean al.docxladonnacamplin
 
Sparsenet
SparsenetSparsenet
Sparsenetndronen
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Generalised 2nd Order Active Filter Prototype_B
Generalised 2nd Order Active Filter Prototype_BGeneralised 2nd Order Active Filter Prototype_B
Generalised 2nd Order Active Filter Prototype_BAdrian Mostert
 
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptxSunilWork1
 
Focused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open DataFocused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open DataThomas Gottron
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow규영 허
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1Puja Koch
 
Evaluate and quantify the drift of a measuring
Evaluate and quantify the drift of a measuringEvaluate and quantify the drift of a measuring
Evaluate and quantify the drift of a measuringJean-Michel POU
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02mansab MIRZA
 
Control of Uncertain Hybrid Nonlinear Systems Using Particle Filters
Control of Uncertain Hybrid Nonlinear Systems Using Particle FiltersControl of Uncertain Hybrid Nonlinear Systems Using Particle Filters
Control of Uncertain Hybrid Nonlinear Systems Using Particle FiltersLeo Asselborn
 

Similar a Using Gradient Descent for Optimization and Learning (19)

A Fast Implicit Gaussian Curvature Filter
A Fast Implicit Gaussian Curvature FilterA Fast Implicit Gaussian Curvature Filter
A Fast Implicit Gaussian Curvature Filter
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Optim_methods.pdf
Optim_methods.pdfOptim_methods.pdf
Optim_methods.pdf
 
Computer Organization1CS1400Feng JiangBoolean al.docx
Computer Organization1CS1400Feng JiangBoolean al.docxComputer Organization1CS1400Feng JiangBoolean al.docx
Computer Organization1CS1400Feng JiangBoolean al.docx
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
PS
PSPS
PS
 
Generalised 2nd Order Active Filter Prototype_B
Generalised 2nd Order Active Filter Prototype_BGeneralised 2nd Order Active Filter Prototype_B
Generalised 2nd Order Active Filter Prototype_B
 
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptx
 
Focused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open DataFocused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open Data
 
Focused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open DataFocused Exploration of Geospatial Context on Linked Open Data
Focused Exploration of Geospatial Context on Linked Open Data
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
 
Evaluate and quantify the drift of a measuring
Evaluate and quantify the drift of a measuringEvaluate and quantify the drift of a measuring
Evaluate and quantify the drift of a measuring
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02
 
Control of Uncertain Hybrid Nonlinear Systems Using Particle Filters
Control of Uncertain Hybrid Nonlinear Systems Using Particle FiltersControl of Uncertain Hybrid Nonlinear Systems Using Particle Filters
Control of Uncertain Hybrid Nonlinear Systems Using Particle Filters
 

Más de Dr. Volkan OBAN

Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...Dr. Volkan OBAN
 
Covid19py Python Package - Example
Covid19py  Python Package - ExampleCovid19py  Python Package - Example
Covid19py Python Package - ExampleDr. Volkan OBAN
 
Object detection with Python
Object detection with Python Object detection with Python
Object detection with Python Dr. Volkan OBAN
 
Python - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) ParametreleriPython - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) ParametreleriDr. Volkan OBAN
 
Linear Programming wi̇th R - Examples
Linear Programming wi̇th R - ExamplesLinear Programming wi̇th R - Examples
Linear Programming wi̇th R - ExamplesDr. Volkan OBAN
 
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ..."optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...Dr. Volkan OBAN
 
k-means Clustering in Python
k-means Clustering in Pythonk-means Clustering in Python
k-means Clustering in PythonDr. Volkan OBAN
 
Naive Bayes Example using R
Naive Bayes Example using  R Naive Bayes Example using  R
Naive Bayes Example using R Dr. Volkan OBAN
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with RDr. Volkan OBAN
 
Data Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingData Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingDr. Volkan OBAN
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Dr. Volkan OBAN
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonDr. Volkan OBAN
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Dr. Volkan OBAN
 
Pandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetPandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetDr. Volkan OBAN
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleDr. Volkan OBAN
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleDr. Volkan OBAN
 
R-ggplot2 package Examples
R-ggplot2 package ExamplesR-ggplot2 package Examples
R-ggplot2 package ExamplesDr. Volkan OBAN
 
R Machine Learning packages( generally used)
R Machine Learning packages( generally used)R Machine Learning packages( generally used)
R Machine Learning packages( generally used)Dr. Volkan OBAN
 
treemap package in R and examples.
treemap package in R and examples.treemap package in R and examples.
treemap package in R and examples.Dr. Volkan OBAN
 

Más de Dr. Volkan OBAN (20)

Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
 
Covid19py Python Package - Example
Covid19py  Python Package - ExampleCovid19py  Python Package - Example
Covid19py Python Package - Example
 
Object detection with Python
Object detection with Python Object detection with Python
Object detection with Python
 
Python - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) ParametreleriPython - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) Parametreleri
 
Linear Programming wi̇th R - Examples
Linear Programming wi̇th R - ExamplesLinear Programming wi̇th R - Examples
Linear Programming wi̇th R - Examples
 
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ..."optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
 
k-means Clustering in Python
k-means Clustering in Pythonk-means Clustering in Python
k-means Clustering in Python
 
Naive Bayes Example using R
Naive Bayes Example using  R Naive Bayes Example using  R
Naive Bayes Example using R
 
R forecasting Example
R forecasting ExampleR forecasting Example
R forecasting Example
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with R
 
Data Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingData Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision Making
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
Pandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetPandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheet
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an example
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an example
 
R-ggplot2 package Examples
R-ggplot2 package ExamplesR-ggplot2 package Examples
R-ggplot2 package Examples
 
R Machine Learning packages( generally used)
R Machine Learning packages( generally used)R Machine Learning packages( generally used)
R Machine Learning packages( generally used)
 
treemap package in R and examples.
treemap package in R and examples.treemap package in R and examples.
treemap package in R and examples.
 

Último

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Último (20)

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 

Using Gradient Descent for Optimization and Learning

  • 1. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Using Gradient Descent for Optimization and Learning Nicolas Le Roux 15 May 2009
  • 2. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Why having a good optimizer? • plenty of data available everywhere • extract information efficiently
  • 3. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Why gradient descent? • cheap • suitable for large models • it works
  • 4. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Optimization vs Learning Optimization • function f to minimize • time T(ρ) to reach error level ρ
  • 5. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Optimization vs Learning Optimization • function f to minimize • time T(ρ) to reach error level ρ Learning • measure of quality f (cost function) • get training samples x1, . . . , xn from p • choose a model F with parameters θ • minimize Ep[f(θ, x)]
  • 6. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Outline 1 Optimization Basics Approximations to Newton method Stochastic Optimization 2 Learning (Bottou) 3 TONGA Natural Gradient Online Natural Gradient Results
  • 7. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results The basics Taylor expansion of f to the first order: f(θ + ε) = f(θ) + εT ∇θf + o( ε ) Best improvement obtained with ε = −η∇θf η > 0
  • 8. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Quadratic bowl η = .1 η = .3
  • 9. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Optimal learning rate λmin ηmin,opt = 1 λmin ηmin,div = 2 λmin λmax ηmax,opt = 1 λmax ηmax,div = 2 λmax η = ηmax,opt κ = η ηmin,opt = λmax λmin T(ρ) = O dκ log 1 ρ
  • 10. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Limitations of gradient descent • Speed of convergence highly dependent on κ • Not invariant to linear transformations θ′ = kθ f(θ′ + ε) = f(θ′ ) + εT ∇θ′ f + o( ε ) ε = −η∇θ′ f But ∇θ′ f = ∇θf k !
  • 11. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Newton method Second-order Taylor expansion of f around θ: f(θ + ε) = f(θ) + εT ∇θf + εT Hε 2 + o( ε 2 ) Best improvement obtained with ε = −ηH−1 ∇θf
  • 12. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Quadratic bowl
  • 13. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Rosenbrock function
  • 14. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Properties of Newton method • Newton method assumes the function is locally quadratic (Beyond Newton method, Minka) • H must be positive definite • storing H and finding ε are in d2 T(ρ) = O d2 log log 1 ρ
  • 15. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Limitations of Newton method • Newton method looks powerful but...
  • 16. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Limitations of Newton method • Newton method looks powerful but... • H may be hard to compute • it is expensive
  • 17. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Gauss-Newton Only works when f is a sum of squared residuals! f(θ) = 1 2 i r2 i (θ) ∂f ∂θ = i ri ∂ri ∂θ ∂2 f ∂θ2 = i ri ∂2 ri ∂θ2 + ∂ri ∂θ ∂ri ∂θ T θ = θ − η i ∂ri ∂θ ∂ri ∂θ T −1 ∂f ∂θ
  • 18. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Properties of Gauss-Newton θ = θ − η i ∂ri ∂θ ∂ri ∂θ T −1 ∂f ∂θ Discarded term: ri ∂2 ri ∂θ2 • Does not require the computation of H • Only valid close to the optimum where ri = 0 • Computation cost in O(d2 )
  • 19. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Levenberg-Marquardt θ = θ − η i ∂ri ∂θ ∂ri ∂θ T + λI −1 ∂f ∂θ • “Damped” Gauss-Newton • Intermediate between Gauss-Newton and Steepest Descent • Slower optimization but more robust • Cost in O(d2 )
  • 20. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Quasi-Newton methods • Gauss-Newton and Levenberg-Marquardt can only be used in special cases • What about the general case?
  • 21. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Quasi-Newton methods • Gauss-Newton and Levenberg-Marquardt can only be used in special cases • What about the general case? • H characterizes the change in gradient when moving in parameter space • Let’s find a matrix which does the same!
  • 22. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results BFGS We look for a matrix B such that ∇θf(θ + ε) − ∇θf(θ) = B−1 t ε : Secant equation
  • 23. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results BFGS We look for a matrix B such that ∇θf(θ + ε) − ∇θf(θ) = B−1 t ε : Secant equation • Problem: this is underconstrained • Solution: set additional constraints: small Bt+1 − Bt W
  • 24. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results BFGS - (2) • B0 = I • while not converged: • pt = −Bt ∇f(θt ) • ηt = linemin(f, θt , pt ) • st = ηt pt (change in parameter space) • θt+1 = θt + st • yt = ∇f(θt+1) − ∇f(θt ) (change in gradient space) • ρt = (sT t yt )−1 • Bt+1 = (I − ρt st yT t )Bt (I − ρt yt sT t ) + ρt st sT t (stems from Sherman-Morrisson formula)
  • 25. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results BFGS - (3) • Requires a line search • No matrix inversion required • Update in O(d2 ) • Can we do better than O(d2 )?
  • 26. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results L-BFGS • Low-rank estimate of B • Based on the last m moves in parameters and gradient spaces • Cost O(md) per update • Same ballpark as steepest descent!
  • 27. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Conjugate Gradient We want to solve Ax = b • Relies on conjugate directions • u and v are conjugate if uT Av = 0. • d mutually conjugate directions form a base of Rd • Goal: to move along conjugate directions close to steepest descent directions
  • 28. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Nonlinear Conjugate Gradient • Extension to non-quadratic functions • Requires a line search in every direction (Important for conjugacy!) • Various direction updates (Polak-Ribiere)
  • 29. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Going from batch to stochastic • dataset composed of n samples • f(θ) = 1 n i fi(θ, xi ) Do we really need to see all the examples before making a parameter update?
  • 30. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Stochastic optimization • Information is redundant amongst samples • We can afford more frequent, noisier updates • But problems arise...
  • 31. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Stochastic (Bottou) Advantage • much faster convergence on large redundant datasets Disadvantages • Keeps bouncing around unless η is reduced • Extremely hard to reach high accuracy • Theoretical definitions for convergence not as well defined • Most second-orders methods will not work
  • 32. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Batch (Bottou) Advantage • Guaranteed convergence to a local minimum under simple conditions • Lots of tricks to speed up the learning Disadvantage • Painfully slow on large problems
  • 33. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Problems arising in stochastic setting • First order descent: O(dn) =⇒ O(dn) • Second order methods: O(d2 + dn) =⇒ O(d2 n) • Special cases: algorithms requiring line search • BFGS: not critical, may be replaced by a one-step update • Conjugate Gradient: critical, no stochastic version
  • 34. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Successful stochastic methods • Stochastic gradient descent • Online BFGS (Schraudolph, 2007) • Online L-BFGS (Schraudolph, 2007)
  • 35. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Conclusions of the tutorial Batch methods • Second-order methods have much faster convergence • They are too expensive when d is large • Except for L-BFGS and Nonlinear Conjugate Gradient
  • 36. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Conclusions of the tutorial Stochastic methods • Much faster updates • Terrible convergence rates • Stochastic Gradient Descent: T(ρ) = O d ρ • Second-order Stochastic Descent: T(ρ) = O d2 ρ
  • 37. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Outline 1 Optimization Basics Approximations to Newton method Stochastic Optimization 2 Learning (Bottou) 3 TONGA Natural Gradient Online Natural Gradient Results
  • 38. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Learning • Measure of quality f (“cost function”) • Get training samples x1, . . . , xn from p • Choose a model F with parameters θ • Minimize Ep[f(θ, x)] • Time budget T
  • 39. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Small-scale vs. Large-scale Learning • Small-scale learning problem: the active budget constraint is the number of examples n. • Large-scale learning problem: the active budget constraint is the computing time T.
  • 40. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Which algorithm should we use? T(ρ) for various algorithms: • Gradient Descent: T(ρ) = O ndκ log 1 ρ • Second Order Gradient Descent: T(ρ) = O d2 log log 1 ρ • Stochastic Gradient Descent: T(ρ) = O d ρ • Second-order Stochastic Descent: T(ρ) = O d2 ρ Second Order Gradient Descent seems a good choice!
  • 41. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Large-scale learning • We are limited by the time T • We can choose between ρ and n • Better optimization means fewer examples
  • 42. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Generalization error E(˜fn) − E(f∗ ) = E(f∗ F ) − E(f∗ ) Approximation error + E(fn) − E(f∗ F ) Estimation error + E(˜fn) − E(fn) Optimization error There is no need to optimize thoroughly if we cannot process enough data points!
  • 43. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Which algorithm should we use? Time to reach E(˜fn) − E(f∗ ) < ǫ: • Gradient Descent: O d2κ ǫ1/α log2 1 ǫ • Second Order Gradient Descent: O d2κ ǫ1/α log 1 ǫ log log 1 ǫ • Stochastic Gradient Descent: O dκ2 ǫ • Second-order Stochastic Descent: O d2 ǫ with 1 2 ≤ α ≤ 1 (statistical estimation rate).
  • 44. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results In a nutshell • Simple stochastic gradient descent is extremely efficient • Fast second-order stochastic gradient descent can win us a constant factor • Are there other possible factors of improvement?
  • 45. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results What we did not talk about (yet) • we have access to x1, . . . , xn ∼ p • we wish to minimize Ex∼p[f(θ, x)] • can we use the uncertainty in the dataset to get information about p?
  • 46. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Outline 1 Optimization Basics Approximations to Newton method Stochastic Optimization 2 Learning (Bottou) 3 TONGA Natural Gradient Online Natural Gradient Results
  • 47. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Cost functions Empirical cost function f(θ) = 1 n i f(θ, xi) True cost function f∗ (θ) = Ex∼p[f(θ, x)]
  • 48. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Gradients Empirical gradient g = 1 n i ∇θf(θ, xi ) True gradient g∗ = Ex∼p[∇θf(θ, x)] Central-limit theorem: g|g∗ ∼ N g∗ , C n
  • 49. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Posterior over g∗ Using g∗ ∼ N(0, σ2 I) we have g∗ |g ∼ N I + C nσ2 −1 g, I σ2 + nC−1 −1 and then εT g∗ |g ∼ N εT I + C nσ2 −1 g, εT I σ2 + nC−1 −1 ε
  • 50. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Aggressive strategy εT g∗ |g ∼ N εT I + C nσ2 −1 g, εT I σ2 + nC−1 −1 ε • we want to minimize Eg∗ [εT g∗ ] • Solution: ε = −η I + C nσ2 −1 g
  • 51. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Aggressive strategy εT g∗ |g ∼ N εT I + C nσ2 −1 g, εT I σ2 + nC−1 −1 ε • we want to minimize Eg∗ [εT g∗ ] • Solution: ε = −η I + C nσ2 −1 g This is the regularized natural gradient.
  • 52. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Conservative strategy εT g∗ |g ∼ N εT I + C σ2 −1 g, εT I σ2 + nC−1 −1 ε • we want to minimize Pr(εT g∗ > 0) • Solution: ε = −η C n −1 g
  • 53. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Conservative strategy εT g∗ |g ∼ N εT I + C σ2 −1 g, εT I σ2 + nC−1 −1 ε • we want to minimize Pr(εT g∗ > 0) • Solution: ε = −η C n −1 g This is the natural gradient.
  • 54. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Online Natural Gradient • Natural gradient has nice properties • Its cost per iteration is in d2 • Can we make it faster?
  • 55. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Goals ε = −η I + C nσ2 −1 g We must be able to update and invert C whenever a new gradient arrives.
  • 56. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Plan of action Updating (uncentered) C: Ct ∝ γCt−1 + (1 − γ)gt gT t
  • 57. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Plan of action Updating (uncentered) C: Ct ∝ γCt−1 + (1 − γ)gt gT t Inverting C: • maintain low-rank estimates of C using its eigendecomposition.
  • 58. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Eigendecomposition of C • Computing k eigenvectors of C = GGT is O(kd2 ) • But computing k eigenvectors of GT G is O(kp2 )! • Still too expensive to compute for each new sample (and p must not grow)
  • 59. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Eigendecomposition of C • Computing k eigenvectors of C = GGT is O(kd2 ) • But computing k eigenvectors of GT G is O(kp2 )! • Still too expensive to compute for each new sample (and p must not grow) • Done only every b steps
  • 60. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results To prove I’m not cheating • eigendecomposition every b steps Ct = γt C + t k=1 γt−k gk gT k + λt t = 1, . . . , b vt = C−1 t gt Xt = γ t 2 U γ t−1 2 g1 . . . γ 1 2 gt−1 gt Ct = Xt XT t + λt vt = Xt αt gt = Xt yt αt = (XT t Xt + λt)−1 yt vt = Xt (XT t Xt + λt)−1 yt
  • 61. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Computation complexity • d dimensions • k eigenvectors • b steps between two updates • Computing the eigendecomposition (every b steps): k(k + b)2 • Computing the natural gradient: d(k + b) + (k + b)2 • if p ≪ (k + b), cost per example is O(d(k + b))
  • 62. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results What next • Complexity in O(d(k + b)) • We need a small k • If d > 106 , how large should k be?
  • 63. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Decomposing C C is almost block-diagonal!
  • 64. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Quality of approximation 200 400 600 800 1000 1200 1400 1600 1800 2000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number k of eigenvectors kept RatioofthesquaredFrobeniusnorms Full matrix approximation Block diagonal approximation 5 10 15 20 25 30 35 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number k of eigenvectors kept RatioofthesquaredFrobeniusnorms Full matrix approximation Block diagonal approximation Minimum with k = 5
  • 65. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Evolution of C
  • 66. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Results - MNIST 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0.05 0.1 0.15 0.2 CPU time (in seconds) Negativelog−likelihoodonthetestset Block diagonal TONGA Stochastic batchsize=1 Stochastic batchsize=400 Stochastic batchsize=1000 Stochastic batchsize=2000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 CPU time (in seconds) Classificationerroronthetestset Block diagonal TONGA Stochastic batchsize=1 Stochastic batchsize=400 Stochastic batchsize=1000 Stochastic batchsize=2000
  • 67. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Results - Rectangles 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 CPU time (in seconds) Negativelog−likelihoodonthevalidationset Stochastic gradient Block diagonal TONGA 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 CPU time (in seconds) Classificationerroronthevalidationset Stochastic gradient Block diagonal TONGA
  • 68. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Results - USPS 0 10 20 30 40 50 60 70 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 CPU time (in seconds) Negativelog−likelihood Block diagonal TONGA Stochastic gradient 0 10 20 30 40 50 60 70 0.05 0.1 0.15 0.2 0.25 0.3 CPU time (in seconds) Classificationerror Block diagonal TONGA Stochastic gradient
  • 69. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results TONGA - Conclusion • Introducing uncertainty speeds up training • There exists a fast implementation of the online natural gradient
  • 70. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Difference between C and H • H accounts for small changes in the parameter space • C accounts for small changes in the input space • C is not just a “cheap cousin” of H
  • 71. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Future research Much more to do! • Can we combine the effects of H and C? • Are there better approximations of C and H? • Anything you can think of!
  • 72. Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Thank you! Questions?