# Introduction to optimizxation

26 de Mar de 2023                                       1 de 39

### Introduction to optimizxation

1. MTH702 Optimization Nonlinear Optimization
2. Optimization General optimization problem minimizex∈Rd f(x) with x ∈ X ⊆ Rd 1/18
3. Optimization General optimization problem minimizex∈Rd f(x) with x ∈ X ⊆ Rd candidate solutions, variables, parameters x ∈ Rd objective function f : Rd → R typically: technical assumption: f is continuous and differentiable 1/18
4. Optimization General optimization problem minimizex∈Rd f(x) with x ∈ X ⊆ Rd candidate solutions, variables, parameters x ∈ Rd objective function f : Rd → R typically: technical assumption: f is continuous and differentiable Q: Is this problem easy? OR When is this easy? 1/18
5. Optimization General optimization problem minimizex∈Rd f(x) with x ∈ X ⊆ Rd candidate solutions, variables, parameters x ∈ Rd objective function f : Rd → R typically: technical assumption: f is continuous and differentiable Q: Is this problem easy? OR When is this easy? Q: How to find the best solution (optimal solution)? 1/18
6. Questions minimizex∈Rd f(x) with x ∈ X ⊆ Rd Q: why do we study optimization? Q: what are you hoping to learn? 2/18
7. Why? And How? Optimization is everywhere machine learning, big data, statistics, data analysis of all kinds, finance, logistics, planning, control theory, mathematics, search engines, simulations, and many other applications ... Mathematical Modeling: defining & modeling the optimization problem Computational Optimization: running an (appropriate) optimization algorithm 3/18
8. Optimization for Machine Learning Mathematical Modeling: defining & and measuring the machine learning model Computational Optimization: learning the model parameters Theory vs. practice: libraries are available, algorithms treated as “black box” by most practitioners 4/18
9. Optimization for Machine Learning Mathematical Modeling: defining & and measuring the machine learning model Computational Optimization: learning the model parameters Theory vs. practice: libraries are available, algorithms treated as “black box” by most practitioners ”.... just use Adam....” 4/18
10. Optimization for Machine Learning Mathematical Modeling: defining & and measuring the machine learning model Computational Optimization: learning the model parameters Theory vs. practice: libraries are available, algorithms treated as “black box” by most practitioners ”.... just use Adam....” Not here: we look inside the algorithms and try to understand why and how fast they work! 4/18
11. Optimization Algorithms Optimization at large scale: simplicity rules! In special cases (f is smooth, X = Rd) we have ”basic”/”simple” algorithms: Gradient Descent Stochastic Gradient Descent (SGD) Coordinate Descent 5/18
12. Optimization Algorithms Optimization at large scale: simplicity rules! In special cases (f is smooth, X = Rd) we have ”basic”/”simple” algorithms: Gradient Descent Stochastic Gradient Descent (SGD) Coordinate Descent History: 1847: Cauchy proposes gradient descent 1950s: Linear Programs, soon followed by non-linear, SGD 1980s: General optimization, convergence theory 2005-today: Large scale optimization, convergence of SGD 5/18
13. Example: Coordinate Descent Goal: Find x? ∈ Rd minimizing f(x). (Example: d = 2) x? x1 x2 Idea: Update one coordinate at a time, while keeping others fixed. 6/18
14. Example: Coordinate Descent Goal: Find x? ∈ Rd minimizing f(x). x? x1 x2 Idea: Update one coordinate at a time, while keeping others fixed. 7/18
15. Example: Coordinate Descent Goal: Find x? ∈ Rd minimizing f(x). x? x1 x2 Idea: Update one coordinate at a time, while keeping others fixed. Q: How to pick coordinate direction? How to find out how far to go? Does it always 7/18
16. Oracle
17. Definitions minimizex∈Rd f(x) with x ∈ X ⊆ Rd P is an optimization problem (from class of problems P ∈ P) Oracle O answers questions for some optimization method M Q: What kind of questions we would need to be answered? 9/18
18. Definitions minimizex∈Rd f(x) with x ∈ X ⊆ Rd P is an optimization problem (from class of problems P ∈ P) Oracle O answers questions for some optimization method M Q: What kind of questions we would need to be answered? A: what is f(x) for given x? Is the x ∈ X? can we compute ∇f(x) and what is it? The performance The performance of M on P is the total amount of computational effort required by method M to solve the problem P 9/18
19. Questions The performance The performance of M on P is the total amount of computational effort required by method M to solve the problem P ... to solve the problem.... Q: what does it mean? 10/18
20. Questions The performance The performance of M on P is the total amount of computational effort required by method M to solve the problem P ... to solve the problem.... Q: what does it mean? Example: minx 1 2x2 and M be such that given x, it returns x − x 2 . Q: Will we even solve the problem? 10/18
21. Questions The performance The performance of M on P is the total amount of computational effort required by method M to solve the problem P ... to solve the problem.... Q: what does it mean? Example: minx 1 2x2 and M be such that given x, it returns x − x 2 . Q: Will we even solve the problem? Approximate solution to P in many areas of numerical analysis, it is impossible to find exact solution 10/18
22. Questions The performance The performance of M on P is the total amount of computational effort required by method M to solve the problem P ... to solve the problem.... Q: what does it mean? Example: minx 1 2x2 and M be such that given x, it returns x − x 2 . Q: Will we even solve the problem? Approximate solution to P in many areas of numerical analysis, it is impossible to find exact solution relaxed goal: find an approximate solution to P with some accuracy 0! 10/18
23. Questions The performance The performance of M on P is the total amount of computational effort required by method M to solve the problem P ... to solve the problem.... Q: what does it mean? Example: minx 1 2x2 and M be such that given x, it returns x − x 2 . Q: Will we even solve the problem? Approximate solution to P in many areas of numerical analysis, it is impossible to find exact solution relaxed goal: find an approximate solution to P with some accuracy 0! let T be some termination criteria 10/18
24. Complexity of General Iterative Scheme [N+ 18] Analytical complexity number of calls of the oracle necessary to solve problem P to accuracy Arithmetical complexity total number of arithmetic operations (including the work of oracle and work of method) which is necessary for solving problem P up to accuracy 11/18
25. Standard Oracles Zero-order oracle returns the function value f(x) First-order oracle returns the function value f(x), ∇f(x) Second-order oracle returns the function value f(x), ∇f(x), ∇2f(x) 12/18
26. Complexity Bounds for Global Optimization Assume a simple problem min x∈Bd f(x) where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1} Q: Can we find the 0 solutions? How many times do we need to call zero-order oracle O? 13/18
27. Complexity Bounds for Global Optimization Assume a simple problem min x∈Bd f(x) where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1} Q: Can we find the 0 solutions? How many times do we need to call zero-order oracle O? We need some assumptions on f to derive some complexity bounds + we need an algorithm! 13/18
28. Complexity Bounds for Global Optimization Assume a simple problem min x∈Bd f(x) where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1} Q: Can we find the 0 solutions? How many times do we need to call zero-order oracle O? We need some assumptions on f to derive some complexity bounds + we need an algorithm! Lipschitz Continuity of f The f : Rd → R is Lipschitz continuous on Bd: |f(x) − f(y)| ≤ Lkx − yk∞ ∀x, y ∈ Bd Q: How can it help us? 13/18
29. Complexity Bounds for Global Optimization Assume a simple problem min x∈Bd f(x) where Bd = {x ∈ Rd : ∀i : 0 ≤ xi ≤ 1} Q: Can we find the 0 solutions? How many times do we need to call zero-order oracle O? We need some assumptions on f to derive some complexity bounds + we need an algorithm! Lipschitz Continuity of f The f : Rd → R is Lipschitz continuous on Bd: |f(x) − f(y)| ≤ Lkx − yk∞ ∀x, y ∈ Bd Q: How can it help us? A: Assume we split Bd into small grid points. Let ∆ is the size of the grid. If we return the ”best” grid point, what has to be ∆ to guarantee solutions? 13/18
30. Uniform Grid Method [N+ 18] note that Nesterov uses n for dimension of the problem any two neighboring points x, y in the grid have kx − yk∞ ≤ 1 p for x∗, there is a grid point x̄ such that kx∗ − x̄k∞ ≤ 1 p 14/18
31. Uniform Grid Method [N+ 18] note that Nesterov uses n for dimension of the problem any two neighboring points x, y in the grid have kx − yk∞ ≤ 1 p for x∗, there is a grid point x̄ such that kx∗ − x̄k∞ ≤ 1 p |f(x̄) − f(x∗)| ≤ Lkx̄ − x∗k∞ ≤ L 2p Q: How many Oracle class does the method need? Q: How to pick p to guarantee solution? 14/18
32. Final Complexity to find solution, we need L 2p ≤ ⇒ p = L 2 + 1 Analytical Complexity Q: How many calls of zero-order oracle do we need? 15/18
33. Final Complexity to find solution, we need L 2p ≤ ⇒ p = L 2 + 1 Analytical Complexity Q: How many calls of zero-order oracle do we need? A: We need L 2 + 1 d zero-order oracle calls 15/18
34. Final Complexity to find solution, we need L 2p ≤ ⇒ p = L 2 + 1 Analytical Complexity Q: How many calls of zero-order oracle do we need? A: We need L 2 + 1 d zero-order oracle calls Q: Is this also the worst-case behaviour (lower-bound) OR we are just using ”very naı̈ve” algorithm? 15/18
35. Lower-Bound and Computational Need for Tiny Problem Lower-Bound We can build a L-Lipchitz function that requires any method to explore ( L 2 )d points before it can identify solution. Example Assume L = 2, d = 10 and = 0.01 If we change d to d + 1, then the estimate is multiplied by one hundred if we multiply by two, we reduce the complexity by a factor of a thousand if = 8%, then we need only two weeks 16/18
36. Conclusion a simple example above shows that optimization in hard! Q: What can save us? 17/18
37. Conclusion a simple example above shows that optimization in hard! Q: What can save us? we can assume some special properties of the problems use different oracle (e.g., use gradients) 17/18
38. Bibliography Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018. Thanks also to Prof. Martin Jaggi and Prof. Mark Schmidt for their slides and lectures and [N+18]. 18/18
39. mbzuai.ac.ae Mohamed bin Zayed University of Artificial Intelligence Masdar City Abu Dhabi United Arab Emirates