Using Gradient Descent for Optimization and Learning

Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Using Gradient Descent for
Optimization and Learning
Nicolas Le Roux
15 May 2009

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Why having a good
optimizer?
• plenty of data available everywhere
• extract information efﬁciently

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Why gradient descent?
• cheap
• suitable for large models
• it works

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Optimization vs Learning
Optimization
• function f to minimize
• time T(ρ) to reach error level ρ

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Optimization vs Learning
Optimization
• function f to minimize
• time T(ρ) to reach error level ρ
Learning
• measure of quality f (cost function)
• get training samples x1, . . . , xn from p
• choose a model F with parameters θ
• minimize Ep[f(θ, x)]

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
2 Learning (Bottou)
3 TONGA
Natural Gradient
Results

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
The basics
Taylor expansion of f to the ﬁrst order:
f(θ + ε) = f(θ) + εT
∇θf + o( ε )
Best improvement obtained with
ε = −η∇θf η > 0

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Quadratic bowl
η = .1 η = .3

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Optimal learning rate
λmin
ηmin,opt = 1
λmin
ηmin,div = 2
λmin
λmax
ηmax,opt = 1
λmax
ηmax,div = 2
λmax
η = ηmax,opt
κ =
η
ηmin,opt
=
λmax
λmin
T(ρ) = O dκ log
1
ρ

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Limitations of gradient
descent
• Speed of convergence highly dependent on κ
• Not invariant to linear transformations
θ′
= kθ
f(θ′
+ ε) = f(θ′
) + εT
∇θ′ f + o( ε )
ε = −η∇θ′ f
But ∇θ′ f = ∇θf
k
!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Newton method
Second-order Taylor expansion of f around θ:
f(θ + ε) = f(θ) + εT
∇θf +
εT
Hε
2
+ o( ε 2
)
Best improvement obtained with
ε = −ηH−1
∇θf

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Quadratic bowl

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Rosenbrock function

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Properties of Newton
method
• Newton method assumes the function is locally
quadratic (Beyond Newton method, Minka)
• H must be positive deﬁnite
• storing H and ﬁnding ε are in d2
T(ρ) = O d2
log log
1
ρ

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Limitations of Newton
method
• Newton method looks powerful
but...

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Limitations of Newton
method
• Newton method looks powerful
but...
• H may be hard to compute
• it is expensive

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Gauss-Newton
Only works when f is a sum of squared residuals!
f(θ) =
1
2
i
r2
i (θ)
∂f
∂θ
=
i
ri
∂ri
∂θ
∂2
f
∂θ2
=
i
ri
∂2
ri
∂θ2
+
∂ri
∂θ
∂ri
∂θ
T
θ = θ − η
i
∂ri
∂θ
∂ri
∂θ
T −1
∂f
∂θ

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Properties of
Gauss-Newton
θ = θ − η
i
∂ri
∂θ
∂ri
∂θ
T −1
∂f
∂θ
Discarded term: ri
∂2
ri
∂θ2
• Does not require the computation of H
• Only valid close to the optimum where ri = 0
• Computation cost in O(d2
)

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Levenberg-Marquardt
θ = θ − η
i
∂ri
∂θ
∂ri
∂θ
T
+ λI
−1
∂f
∂θ
• “Damped” Gauss-Newton
• Intermediate between Gauss-Newton and
Steepest Descent
• Slower optimization but more robust
• Cost in O(d2
)

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Quasi-Newton methods
• Gauss-Newton and Levenberg-Marquardt can
only be used in special cases
• What about the general case?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Quasi-Newton methods
• Gauss-Newton and Levenberg-Marquardt can
only be used in special cases
• What about the general case?
• H characterizes the change in gradient when
moving in parameter space
• Let’s ﬁnd a matrix which does the same!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
BFGS
We look for a matrix B such that
∇θf(θ + ε) − ∇θf(θ) = B−1
t ε : Secant equation

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
BFGS
We look for a matrix B such that
∇θf(θ + ε) − ∇θf(θ) = B−1
t ε : Secant equation
• Problem: this is underconstrained
• Solution: set additional constraints: small
Bt+1 − Bt W

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
BFGS - (2)
• B0 = I
• while not converged:
• pt = −Bt ∇f(θt )
• ηt = linemin(f, θt , pt )
• st = ηt pt (change in parameter space)
• θt+1 = θt + st
• yt = ∇f(θt+1) − ∇f(θt ) (change in gradient
space)
• ρt = (sT
t yt )−1
• Bt+1 = (I − ρt st yT
t )Bt (I − ρt yt sT
t ) + ρt st sT
t
(stems from Sherman-Morrisson formula)

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
BFGS - (3)
• Requires a line search
• No matrix inversion required
• Update in O(d2
)
• Can we do better than O(d2
)?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
L-BFGS
• Low-rank estimate of B
• Based on the last m moves in parameters and
gradient spaces
• Cost O(md) per update
• Same ballpark as steepest descent!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Conjugate Gradient
We want to solve
Ax = b
• Relies on conjugate directions
• u and v are conjugate if uT
Av = 0.
• d mutually conjugate directions form a base of
Rd
• Goal: to move along conjugate directions close
to steepest descent directions

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Nonlinear Conjugate
Gradient
• Extension to non-quadratic functions
• Requires a line search in every direction
(Important for conjugacy!)
• Various direction updates (Polak-Ribiere)

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Going from batch to
stochastic
• dataset composed of n samples
• f(θ) = 1
n i fi(θ, xi )
Do we really need to see all the examples before
making a parameter update?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Stochastic optimization
• Information is redundant amongst samples
• We can afford more frequent, noisier updates
• But problems arise...

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Stochastic (Bottou)
Advantage
• much faster convergence on large redundant
datasets
Disadvantages
• Keeps bouncing around unless η is reduced
• Extremely hard to reach high accuracy
• Theoretical deﬁnitions for convergence not as
well deﬁned
• Most second-orders methods will not work

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Batch (Bottou)
Advantage
• Guaranteed convergence to a local minimum
under simple conditions
• Lots of tricks to speed up the learning
Disadvantage
• Painfully slow on large problems

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Problems arising in
stochastic setting
• First order descent: O(dn) =⇒ O(dn)
• Second order methods: O(d2
+ dn) =⇒ O(d2
n)
• Special cases: algorithms requiring line search
• BFGS: not critical, may be replaced by a
one-step update
• Conjugate Gradient: critical, no stochastic
version

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Successful stochastic
methods
• Stochastic gradient descent
• Online BFGS (Schraudolph, 2007)
• Online L-BFGS (Schraudolph, 2007)

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Conclusions of the
tutorial
Batch methods
• Second-order methods have much faster
convergence
• They are too expensive when d is large
• Except for L-BFGS and Nonlinear Conjugate
Gradient

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Conclusions of the
tutorial
Stochastic methods
• Much faster updates
• Terrible convergence rates
• Stochastic Gradient Descent: T(ρ) = O d
ρ
• Second-order Stochastic Descent:
T(ρ) = O d2
ρ

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Learning
• Measure of quality f (“cost function”)
• Get training samples x1, . . . , xn from p
• Choose a model F with parameters θ
• Minimize Ep[f(θ, x)]
• Time budget T

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Small-scale vs.
Large-scale Learning
• Small-scale learning problem: the active budget
constraint is the number of examples n.
• Large-scale learning problem: the active budget
constraint is the computing time T.

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Which algorithm should
we use?
T(ρ) for various algorithms:
• Gradient Descent: T(ρ) = O ndκ log 1
ρ
• Second Order Gradient Descent:
T(ρ) = O d2
log log 1
ρ
• Stochastic Gradient Descent: T(ρ) = O d
ρ
• Second-order Stochastic Descent:
T(ρ) = O d2
ρ
Second Order Gradient Descent seems a good
choice!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Large-scale learning
• We are limited by the time T
• We can choose between ρ and n
• Better optimization means fewer examples

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Generalization error
E(˜fn) − E(f∗
) = E(f∗
F ) − E(f∗
) Approximation error
+ E(fn) − E(f∗
F ) Estimation error
+ E(˜fn) − E(fn) Optimization error
There is no need to optimize thoroughly if we cannot
process enough data points!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Which algorithm should
we use?
Time to reach E(˜fn) − E(f∗
) < ǫ:
• Gradient Descent: O d2κ
ǫ1/α log2 1
ǫ
• Second Order Gradient Descent:
O d2κ
ǫ1/α log 1
ǫ
log log 1
ǫ
• Stochastic Gradient Descent: O dκ2
ǫ
• Second-order Stochastic Descent: O d2
ǫ
with 1
2
≤ α ≤ 1 (statistical estimation rate).

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
In a nutshell
• Simple stochastic gradient descent is extremely
efﬁcient
• Fast second-order stochastic gradient descent
can win us a constant factor
• Are there other possible factors of improvement?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
What we did not talk
about (yet)
• we have access to x1, . . . , xn ∼ p
• we wish to minimize Ex∼p[f(θ, x)]
• can we use the uncertainty in the dataset to get
information about p?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Cost functions
Empirical cost function
f(θ) =
1
n
i
f(θ, xi)
True cost function
f∗
(θ) = Ex∼p[f(θ, x)]

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Gradients
Empirical gradient
g =
1
n
i
∇θf(θ, xi )
True gradient
g∗
= Ex∼p[∇θf(θ, x)]
Central-limit theorem:
g|g∗
∼ N g∗
,
C
n

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Posterior over g∗
Using
g∗
∼ N(0, σ2
I)
we have
g∗
|g ∼ N I +
C
nσ2
−1
g,
I
σ2
+ nC−1
−1
and then
εT
g∗
|g ∼ N εT
I +
C
nσ2
−1
g, εT I
σ2
+ nC−1
−1
ε

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Aggressive strategy
εT
g∗
|g ∼ N εT
I +
C
nσ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Eg∗ [εT
g∗
]
• Solution:
ε = −η I +
C
nσ2
−1
g

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Aggressive strategy
εT
g∗
|g ∼ N εT
I +
C
nσ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Eg∗ [εT
g∗
]
• Solution:
ε = −η I +
C
nσ2
−1
g
This is the regularized natural gradient.

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Conservative strategy
εT
g∗
|g ∼ N εT
I +
C
σ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Pr(εT
g∗
> 0)
• Solution:
ε = −η
C
n
−1
g

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Conservative strategy
εT
g∗
|g ∼ N εT
I +
C
σ2
−1
g, εT I
σ2
+ nC−1
−1
ε
• we want to minimize Pr(εT
g∗
> 0)
• Solution:
ε = −η
C
n
−1
g
This is the natural gradient.

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
• Natural gradient has nice properties
• Its cost per iteration is in d2
• Can we make it faster?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Goals
ε = −η I +
C
nσ2
−1
g
We must be able to update and invert C whenever a
new gradient arrives.

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Plan of action
Updating (uncentered) C:
Ct ∝ γCt−1 + (1 − γ)gt gT
t

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Plan of action
Updating (uncentered) C:
Ct ∝ γCt−1 + (1 − γ)gt gT
t
Inverting C:
• maintain low-rank estimates of C using its
eigendecomposition.

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Eigendecomposition of C
• Computing k eigenvectors of C = GGT
is
O(kd2
)
• But computing k eigenvectors of GT
G is O(kp2
)!
• Still too expensive to compute for each new
sample (and p must not grow)

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Eigendecomposition of C
• Computing k eigenvectors of C = GGT
is
O(kd2
)
• But computing k eigenvectors of GT
G is O(kp2
)!
• Still too expensive to compute for each new
sample (and p must not grow)
• Done only every b steps

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
To prove I’m not cheating
• eigendecomposition every b steps
Ct = γt
C +
t
k=1
γt−k
gk gT
k + λt t = 1, . . . , b
vt = C−1
t gt
Xt = γ
t
2 U γ
t−1
2 g1 . . . γ
1
2 gt−1 gt
Ct = Xt XT
t + λt vt = Xt αt gt = Xt yt
αt = (XT
t Xt + λt)−1
yt
vt = Xt (XT
t Xt + λt)−1
yt

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Computation complexity
• d dimensions
• k eigenvectors
• b steps between two updates
• Computing the eigendecomposition (every b
steps): k(k + b)2
• Computing the natural gradient:
d(k + b) + (k + b)2
• if p ≪ (k + b), cost per example is O(d(k + b))

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
What next
• Complexity in O(d(k + b))
• We need a small k
• If d > 106
, how large should k be?

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Decomposing C
C is almost block-diagonal!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Quality of approximation
200 400 600 800 1000 1200 1400 1600 1800 2000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number k of eigenvectors kept
RatioofthesquaredFrobeniusnorms
Full matrix approximation
Block diagonal approximation
5 10 15 20 25 30 35 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number k of eigenvectors kept
RatioofthesquaredFrobeniusnorms
Full matrix approximation
Block diagonal approximation
Minimum with k = 5

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Evolution of C

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Results - MNIST
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0.05
0.1
0.15
0.2
CPU time (in seconds)
Negativelog−likelihoodonthetestset
Block diagonal TONGA
Stochastic batchsize=1
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
Classificationerroronthetestset

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Results - Rectangles
0 0.5 1 1.5 2 2.5 3 3.5
x 10
4
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Negativelog−likelihoodonthevalidationset
Stochastic gradient
0 0.5 1 1.5 2 2.5 3 3.5
x 10
4
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Classificationerroronthevalidationset
Stochastic gradient

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Results - USPS
0 10 20 30 40 50 60 70
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Negativelog−likelihood
Stochastic gradient
0 10 20 30 40 50 60 70
0.05
0.1
0.15
0.2
0.25
0.3
Classificationerror
Stochastic gradient

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
TONGA - Conclusion
• Introducing uncertainty speeds up training
• There exists a fast implementation of the online
natural gradient

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Difference between C
and H
• H accounts for small changes in the parameter
space
• C accounts for small changes in the input space
• C is not just a “cheap cousin” of H

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Future research
Much more to do!
• Can we combine the effects of H and C?
• Are there better approximations of C and H?
• Anything you can think of!

Gradient Descent
Nicolas Le Roux
Optimization
Basics
method
Learning (Bottou)
TONGA
Natural Gradient
Results
Thank you!
Questions?

Using Gradient Descent for Optimization and Learning

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (9)

Destacado

Destacado (20)

Similar a Using Gradient Descent for Optimization and Learning

Similar a Using Gradient Descent for Optimization and Learning (19)

Más de Dr. Volkan OBAN

Más de Dr. Volkan OBAN (20)

Último

Último (20)

Using Gradient Descent for Optimization and Learning