Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

44 visualizaciones

Publicado el

Slides for the paper "On the Convergence of Single-Call Stochastic Extra-Gradient Methods" appearing in NeurIPS 2019
https://arxiv.org/abs/1908.08465

Publicado en: Ciencias
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

  1. 1. On the Convergence of Single-Call Stochastic Extra-Gradient Methods Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos NeurIPS, December 2019 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22
  2. 2. Outline: 1 Variational Inequality 2 Extra-Gradient 3 Single-call Extra-Gradient [Main Focus] 4 Conclusion Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  3. 3. Variational Inequality Variational Inequality Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  4. 4. Variational Inequality Introduction: Variational Inequalities in Machine Learning Generative adversarial network (GAN) min θ max φ Ex∼pdata [log(Dφ(x))] + Ez∼pZ [log(1 − Dφ(Gθ(z)))]. More min-max (saddle point) problems: distributionally robust learning, primal-dual formulation in optimization, . . . Seach of equilibrium: games, multi-agent reinforcement learning, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 2 / 22
  5. 5. Variational Inequality Definition Stampacchia variational inequality Find x⋆ ∈ X such that ⟨V (x⋆ ),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (SVI) Minty variational inequality Find x⋆ ∈ X such that ⟨V (x),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (MVI) With closed convex set X ⊆ Rd and vector field V Rd → Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 3 / 22
  6. 6. Variational Inequality Illustration SVI: V (x⋆ ) belongs to the dual cone DC(x⋆ ) of X at x⋆ [ local ] MVI: V (x) forms an acute angle with tangent vector x − x⋆ ∈ TC(x⋆ ) [ global ] Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 4 / 22
  7. 7. Variational Inequality Example: Function Minimization min x f(x) subject to x ∈ X f X → R differentiable function to minimize. Let V = ∇f. (SVI) ∀x ∈ X, ⟨∇f(x⋆ ),x − x⋆ ⟩ ≥ 0 [ first-order optimality ] (MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆ ⟩ ≥ 0 [x⋆ is a minimizer of f ] If f is convex, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 5 / 22
  8. 8. Variational Inequality Example: Saddle point Problem Find x⋆ = (θ⋆ ,φ⋆ ) such that L(θ⋆ ,φ) ≤ L(θ⋆ ,φ⋆ ) ≤ L(θ,φ⋆ ) for all θ ∈ Θ and all φ ∈ Φ. X ≡ Θ × Φ and L X → R differentiable function. Let V = (∇θL,−∇φL). (SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆ ),θ − θ⋆ ⟩ − ⟨∇φL(x⋆ ),φ − φ⋆ ⟩ ≥ 0 [ stationary ] (MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆ ⟩ − ⟨∇φL(x),φ − φ⋆ ⟩ ≥ 0 [ saddle point ] If L is convex-concave, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 6 / 22
  9. 9. Variational Inequality Monoticity The solutions of (SVI) and (MVI) coincide when V is continuous and monotone, i.e., ⟨V (x′ ) − V (x),x′ − x⟩ ≥ 0 for all x,x′ ∈ Rd . In the above two examples, this corresponds to either f being convex or L being convex-concave. The operator analogue of strong convexity is strong monoticity ⟨V (x′ ) − V (x),x′ − x⟩ ≥ α x′ − x 2 for some α > 0 and all x,x′ ∈ Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  10. 10. Extra-Gradient Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  11. 11. Extra-Gradient From Forward-backward to Extra-Gradient Forward-backward Xt+1 = ΠX (Xt − γtV (Xt)) (FB) Extra-Gradient [Korpelevich 1976] Xt+ 1 2 = ΠX (Xt − γtV (Xt)) Xt+1 = ΠX (Xt − γtV (Xt+ 1 2 )) (EG) The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step to reach the leading state Xt+ 1 2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 8 / 22
  12. 12. Extra-Gradient From Forward-backward to Extra-Gradient Forward-backward does not converge in bilinear games, while Extra-Gradient does. min θ∈R max φ∈R θφ Left: Forward-backward Right: Extra-Gradient 4 2 0 2 4 6 6 4 2 0 2 4 6 0.5 0.0 0.5 1.0 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 9 / 22
  13. 13. Extra-Gradient Stochastic Oracle If a stochastic oracle is involved: Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+1 2 ) With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1 2 ) a) Zero-mean: E[Zt Ft] = 0. b) Bounded variance: E[ Zt 2 Ft] ≤ σ2 . (Ft)t∈N/2 is the natural filtration associated to the stochastic process (Xt)t∈N/2. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 10 / 22
  14. 14. Extra-Gradient Convergence Metrics Ergodic convergence: restricted error function ErrR(ˆx) = max x∈XR ⟨V (x), ˆx − x⟩, where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}. Last iterate convergence: squared distance dist(ˆx,X⋆ )2 . Lemma [Nesterov 2007] Assume V is monotone. If x⋆ is a solution of (SVI), we have ErrR(x⋆ ) = 0 for all sufficiently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR, then ˆx is a solution of (SVI). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 11 / 22
  15. 15. Extra-Gradient Literature Review We further suppose that V is β-Lipschitz. Convergence type Hypothesis Korpelevich 1976 Last iterate asymptotic Pseudo monotone Tseng 1995 Last iterate geometric Monotone + error bound (e.g., strongly monotone, affine) Nemirovski 2004 Ergodic in O(1/t) Monotone Juditsky et al. 2011 Ergodic in O(1/ √ t) Stochastic monotone Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 12 / 22
  16. 16. Extra-Gradient In Deep Learning Extra-Gradient (EG) needs two oracle calls per iteration, while gradient computations can be very costly for deep models: And if we drop one oracle call per iteration? Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  17. 17. Single-call Extra-Gradient [Main Focus] Single-call Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  18. 18. Single-call Extra-Gradient [Main Focus] Algorithms 1 Past Extra-Gradient [Popov 1980] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (PEG) 2 Reflected Gradient [Malitsky 2015] Xt+ 1 2 = Xt − (Xt−1 − Xt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (RG) 3 Optimistic Gradient [Daskalakis et al. 2018] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = Xt+1 2 + γt ˆVt−1 2 − γt ˆVt+ 1 2 (OG) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 14 / 22
  19. 19. Single-call Extra-Gradient [Main Focus] A First Result Proxy PEG: [Step 1] ˆVt ← ˆVt− 1 2 RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection OG: [Step 1] ˆVt ← ˆVt− 1 2 [Step 2] Xt ← Xt+ 1 2 + γt ˆVt− 1 2 ; no projection Proposition Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If X = Rd , the generated iterates Xt coincide for all t ≥ 1. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 15 / 22 Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 )
  20. 20. Single-call Extra-Gradient [Main Focus] Global Convergence Rate Always with Lipschitz continuity. Stochastic strongly monotone: step size in O(1/t). New results! Monotone Strongly Monotone Ergodic Last Iterate Ergodic Last Iterate Deterministic 1/t Unknown 1/t e−ρt Stochastic 1/ √ t Unknown 1/t 1/t Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 16 / 22
  21. 21. Single-call Extra-Gradient [Main Focus] Proof Ingredients Descent Lemma [Deterministic + Monotone] There exists (µt)t∈N ∈ RN + such that for all p ∈ X, Xt+1 − p 2 + µt+1 ≤ Xt − p 2 − 2γ⟨V (Xt+1 2 ),Xt+ 1 2 − p⟩ + µt. Descent Lemma [Stochastic + Strongly Monotone] Let x⋆ be the unique solution of (SVI). There exists (µt)t∈N ∈ RN + ,M ∈ R+ such that E[ Xt+1 − x⋆ 2 ] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2 ] + µt) + Mγ2 t σ2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 17 / 22
  22. 22. Single-call Extra-Gradient [Main Focus] Regular Solution Definition [Regular Solution] We say that x⋆ is a regular solution of (SVI) if V is C1 -smooth in a neighborhood of x⋆ and the Jacobian JacV (x⋆ ) is positive-definite along rays emanating from x⋆ , i.e., z⊺ JacV (x⋆ )z ≡ d ∑ i,j=1 zi ∂Vi ∂xj (x⋆ )zj > 0 for all z ∈ Rd ∖{0} that are tangent to X at x⋆ . To be compared with positive definiteness of the Hessian along qualified constraints in minimization; differential equilibrium in games. Localization of strong monoticity. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 18 / 22
  23. 23. Single-call Extra-Gradient [Main Focus] Local Convergence Theorem [Local convergence for stochastic non-monotone operators] Let x⋆ be a regular solution of (SVI) and fix a tolerance level δ > 0. Suppose (PEG) is run with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then: (a) There are neighborhoods U and U1 of x⋆ in X such that, if X1/2 ∈ U,X1 ∈ U1, the event E∞ = {Xt+ 1 2 ∈ U for all t = 1,2,...} occurs with probability at least 1 − δ. (b) Conditioning on the above, we have: E[ Xt − x⋆ 2 E∞] = O ( 1 t ). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 19 / 22
  24. 24. Single-call Extra-Gradient [Main Focus] Experiments L(θ,φ) = 2 1θ⊺ A1θ + 2(θ⊺ A2θ) 2 − 2 1φ⊺ B1φ − 2(φ⊺ B2φ) 2 + 4θ⊺ Cφ 0 20 40 60 80 100 # Oracle Calls 10 −16 10 −13 10 −10 10 −7 10 −4 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 (a) Strongly monotone ( 1 = 1, 2 = 0) Deterministic Last iterate 10 0 10 1 10 2 10 3 # Oracle Calls 10 −3 10 −2 10 −1 10 0 ∥x−x*∥2 EG 1-EG γ = 0.2 γ = 0.4 γ = 0.6 γ = 0.8 (b) Monotone ( 1 = 0, 2 = 1) Deterministic Ergodic 10 0 10 1 10 2 10 3 10 4 # Oracle Calls 10 −3 10 −2 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.3 γ = 0.6 γ = 0.9 γ = 1.2 (c) Non monotone ( 1 = 1, 2 = −1) Stochastic Zt iid ∼ N (0, σ 2 = .01) Last iterate (b = 15) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  25. 25. Conclusion Conclusion and Perspectives Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  26. 26. Conclusion Conclusion Single-call rates ∼ Two-call rates. Localization of stochastic guarantee. Last iterate convergence: a first step to the non-monotone world. Some research directions: Bregman, universal, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 21 / 22
  27. 27. Conclusion Bibliography Daskalakis, Constantinos et al. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the 2018 International Conference on Learning Representations. Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58. Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In: Èkonom. i Mat. Metody 12, pp. 747–756. Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In: SIAM Journal on Optimization 25.1, pp. 502–520. Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In: SIAM Journal on Optimization 15.1, pp. 229–251. Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related problems”. In: Mathematical Programming 109.2, pp. 319–344. Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848. Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”. In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 22 / 22

×