Optimization
General optimization problem
minimizex∈Rd f(x)
with x ∈ X ⊆ Rd
candidate solutions, variables, parameters x ∈ Rd
objective function f : Rd
→ R
typically: technical assumption: f is continuous and differentiable
1/18
Optimization
General optimization problem
minimizex∈Rd f(x)
with x ∈ X ⊆ Rd
candidate solutions, variables, parameters x ∈ Rd
objective function f : Rd
→ R
typically: technical assumption: f is continuous and differentiable
Q: Is this problem easy? OR When is this easy?
1/18
Optimization
General optimization problem
minimizex∈Rd f(x)
with x ∈ X ⊆ Rd
candidate solutions, variables, parameters x ∈ Rd
objective function f : Rd
→ R
typically: technical assumption: f is continuous and differentiable
Q: Is this problem easy? OR When is this easy?
Q: How to find the best solution (optimal solution)?
1/18
Why? And How?
Optimization is everywhere
machine learning, big data, statistics, data analysis of all kinds, finance, logistics, planning, control
theory, mathematics, search engines, simulations, and many other applications ...
Mathematical Modeling:
defining & modeling the optimization problem
Computational Optimization:
running an (appropriate) optimization algorithm
3/18
Optimization for Machine Learning
Mathematical Modeling:
defining & and measuring the machine learning model
Computational Optimization:
learning the model parameters
Theory vs. practice:
libraries are available, algorithms treated as “black box” by most practitioners
4/18
Optimization for Machine Learning
Mathematical Modeling:
defining & and measuring the machine learning model
Computational Optimization:
learning the model parameters
Theory vs. practice:
libraries are available, algorithms treated as “black box” by most practitioners
”.... just use Adam....”
4/18
Optimization for Machine Learning
Mathematical Modeling:
defining & and measuring the machine learning model
Computational Optimization:
learning the model parameters
Theory vs. practice:
libraries are available, algorithms treated as “black box” by most practitioners
”.... just use Adam....”
Not here: we look inside the algorithms and try to understand why and how fast they work!
4/18
Optimization Algorithms
Optimization at large scale: simplicity rules!
In special cases (f is smooth, X = Rd) we have ”basic”/”simple” algorithms:
Gradient Descent
Stochastic Gradient Descent (SGD)
Coordinate Descent
5/18
Optimization Algorithms
Optimization at large scale: simplicity rules!
In special cases (f is smooth, X = Rd) we have ”basic”/”simple” algorithms:
Gradient Descent
Stochastic Gradient Descent (SGD)
Coordinate Descent
History:
1847: Cauchy proposes gradient descent
1950s: Linear Programs, soon followed by non-linear, SGD
1980s: General optimization, convergence theory
2005-today: Large scale optimization, convergence of SGD
5/18
Example: Coordinate Descent
Goal: Find x? ∈ Rd minimizing f(x). (Example: d = 2)
x?
x1
x2
Idea: Update one coordinate at a time, while keeping others fixed.
6/18
Example: Coordinate Descent
Goal: Find x? ∈ Rd minimizing f(x).
x?
x1
x2
Idea: Update one coordinate at a time, while keeping others fixed.
7/18
Example: Coordinate Descent
Goal: Find x? ∈ Rd minimizing f(x).
x?
x1
x2
Idea: Update one coordinate at a time, while keeping others fixed.
Q: How to pick coordinate direction? How to find out how far to go? Does it always
7/18
Definitions
minimizex∈Rd f(x)
with x ∈ X ⊆ Rd
P is an optimization problem (from class of problems P ∈ P)
Oracle O answers questions for some optimization method M
Q: What kind of questions we would need to be answered?
9/18
Definitions
minimizex∈Rd f(x)
with x ∈ X ⊆ Rd
P is an optimization problem (from class of problems P ∈ P)
Oracle O answers questions for some optimization method M
Q: What kind of questions we would need to be answered?
A: what is f(x) for given x? Is the x ∈ X? can we compute ∇f(x) and what is it?
The performance
The performance of M on P is the total amount of computational effort required by
method M to solve the problem P
9/18
Questions
The performance
The performance of M on P is the total amount of computational effort required by
method M to solve the problem P
... to solve the problem.... Q: what does it mean?
10/18
Questions
The performance
The performance of M on P is the total amount of computational effort required by
method M to solve the problem P
... to solve the problem.... Q: what does it mean?
Example: minx
1
2x2 and M be such that given x, it returns x − x
2 . Q: Will we even
solve the problem?
10/18
Questions
The performance
The performance of M on P is the total amount of computational effort required by
method M to solve the problem P
... to solve the problem.... Q: what does it mean?
Example: minx
1
2x2 and M be such that given x, it returns x − x
2 . Q: Will we even
solve the problem?
Approximate solution to P
in many areas of numerical analysis, it is impossible to find exact solution
10/18
Questions
The performance
The performance of M on P is the total amount of computational effort required by
method M to solve the problem P
... to solve the problem.... Q: what does it mean?
Example: minx
1
2x2 and M be such that given x, it returns x − x
2 . Q: Will we even
solve the problem?
Approximate solution to P
in many areas of numerical analysis, it is impossible to find exact solution
relaxed goal: find an approximate solution to P with some accuracy 0!
10/18
Questions
The performance
The performance of M on P is the total amount of computational effort required by
method M to solve the problem P
... to solve the problem.... Q: what does it mean?
Example: minx
1
2x2 and M be such that given x, it returns x − x
2 . Q: Will we even
solve the problem?
Approximate solution to P
in many areas of numerical analysis, it is impossible to find exact solution
relaxed goal: find an approximate solution to P with some accuracy 0!
let T be some termination criteria
10/18
Complexity of General Iterative Scheme [N+
18]
Analytical complexity
number of calls of the
oracle necessary to solve
problem P to accuracy
Arithmetical complexity
total number of arithmetic
operations (including the
work of oracle and work of
method) which is
necessary for solving
problem P up to accuracy
11/18
Standard Oracles
Zero-order oracle
returns the function value f(x)
First-order oracle
returns the function value f(x), ∇f(x)
Second-order oracle
returns the function value f(x), ∇f(x), ∇2f(x)
12/18
Complexity Bounds for Global Optimization
Assume a simple problem
min
x∈Bd
f(x)
where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1}
Q: Can we find the 0 solutions? How many times do we need to call zero-order
oracle O?
13/18
Complexity Bounds for Global Optimization
Assume a simple problem
min
x∈Bd
f(x)
where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1}
Q: Can we find the 0 solutions? How many times do we need to call zero-order
oracle O?
We need some assumptions on f to derive some complexity bounds + we need an algorithm!
13/18
Complexity Bounds for Global Optimization
Assume a simple problem
min
x∈Bd
f(x)
where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1}
Q: Can we find the 0 solutions? How many times do we need to call zero-order
oracle O?
We need some assumptions on f to derive some complexity bounds + we need an algorithm!
Lipschitz Continuity of f
The f : Rd → R is Lipschitz continuous on Bd: |f(x) − f(y)| ≤ Lkx − yk∞ ∀x, y ∈ Bd
Q: How can it help us?
13/18
Complexity Bounds for Global Optimization
Assume a simple problem
min
x∈Bd
f(x)
where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1}
Q: Can we find the 0 solutions? How many times do we need to call zero-order
oracle O?
We need some assumptions on f to derive some complexity bounds + we need an algorithm!
Lipschitz Continuity of f
The f : Rd → R is Lipschitz continuous on Bd: |f(x) − f(y)| ≤ Lkx − yk∞ ∀x, y ∈ Bd
Q: How can it help us?
A: Assume we split Bd into small grid points. Let ∆ is the size of the grid. If we
return the ”best” grid point, what has to be ∆ to guarantee solutions?
13/18
Uniform Grid Method [N+
18]
note that Nesterov uses n for dimension of the problem
any two neighboring points
x, y in the grid have
kx − yk∞ ≤
1
p
for x∗, there is a grid point
x̄ such that kx∗ − x̄k∞ ≤ 1
p
14/18
Uniform Grid Method [N+
18]
note that Nesterov uses n for dimension of the problem
any two neighboring points
x, y in the grid have
kx − yk∞ ≤
1
p
for x∗, there is a grid point
x̄ such that kx∗ − x̄k∞ ≤ 1
p
|f(x̄) − f(x∗)| ≤
Lkx̄ − x∗k∞ ≤ L
2p
Q: How many Oracle
class does the method
need?
Q: How to pick p to
guarantee solution?
14/18
Final Complexity
to find solution, we need
L
2p
≤ ⇒ p =
L
2
+ 1
Analytical Complexity
Q: How many calls of zero-order oracle do we need?
15/18
Final Complexity
to find solution, we need
L
2p
≤ ⇒ p =
L
2
+ 1
Analytical Complexity
Q: How many calls of zero-order oracle do we need?
A: We need
L
2
+ 1
d
zero-order oracle calls
15/18
Final Complexity
to find solution, we need
L
2p
≤ ⇒ p =
L
2
+ 1
Analytical Complexity
Q: How many calls of zero-order oracle do we need?
A: We need
L
2
+ 1
d
zero-order oracle calls
Q: Is this also the worst-case behaviour (lower-bound) OR we are just using ”very
naı̈ve” algorithm?
15/18
Lower-Bound and Computational Need for Tiny Problem
Lower-Bound
We can build a L-Lipchitz function that requires any method to explore ( L
2 )d points before it
can identify solution.
Example
Assume L = 2, d = 10 and = 0.01
If we change d to d + 1, then the
estimate is multiplied by one
hundred
if we multiply by two, we
reduce the complexity by a
factor of a thousand
if = 8%, then we need only
two weeks
16/18
Conclusion
a simple example above shows that optimization in hard!
Q: What can save us?
we can assume some special properties of the problems
use different oracle (e.g., use gradients)
17/18
Bibliography
Yurii Nesterov et al.
Lectures on convex optimization, volume 137.
Springer, 2018.
Thanks also to Prof. Martin Jaggi and Prof. Mark Schmidt for their slides and lectures and
[N+18].
18/18