automating the analysis and design of large-scale optimization algorithms

Size: px

Start display at page:

Download "automating the analysis and design of large-scale optimization algorithms"

Scot Flynn
5 years ago
Views:

1 automating the analysis and design of large-scale optimization algorithms Laurent Lessard University of Wisconsin Madison Joint work with Ben Recht, Andy Packard, Bin Hu, Bryan Van Scoy, Saman Cyrus October, 2018

2 2 Unconstrained optimization: minimize subject to f(x) x R N need algorithms that are fast and simple currently favored family: first-order methods

3 Gradient method x k+1 = x k α f(x k ) Heavy ball method x k+1 = x k α f(x k ) + β(x k x k 1 ) Error Nesterov s accelerated method y k = x k + β(x k x k 1 ) x k+1 = y k α f(y k ) x 0 x 1 contours of f(x) (quadratic) 3

4 4 1. Many algorithms can be viewed as dynamical systems with feedback (control systems!). algorithm convergence system stability 2. By solving a small convex program, we can recover state-of-the-art convergence results for these algorithms, automatically and efficiently. 3. The ultimate goal: to move from analysis to design.

5 5 Robust algorithm selection G G : f S : algorithm to use function to minimize ( ) G opt = arg min max cost(f, G) G G f S Similar problem for a finite number of iterations: Drori, Teboulle (2012) Taylor, Hendrickx, Glineur (2016)

6 6 Gradient method x k+1 = x k α f(x k ) Heavy ball method G G x k+1 = x k α f(x k ) + β(x k x k 1 ) Nesterov s accelerated method x k+1 = x k α f ( x k + β(x k x k 1 ) ) + β(x k x k 1 ) Analytically solvable: f S Quadratic functions: f(x) = 1 2 xt Qx p T x with the constraint: m λ(q) L

7 7 Convergence rate ρ Convergence rate on quadratic functions Gradient (quadratic) 0.2 Nesterov (quadratic) Heavy ball (quadratic) Condition ratio L/m Iterations to convergence Iterations to convergence for Gradient method Condition ratio L/m 1 1/2 Convergence rate : x k x Cρ k x 0 x Iterations to convergence 1 log ρ

8 8 Robust algorithm selection G G : f S : algorithm to use function to minimize ( ) G opt = arg min max cost(f, G) G G f S 1. mathematical representation for G 2. mathematical representation for S 3. main robustness result

9 9 Dynamical system interpretation Heavy ball: x k+1 = x k α f(x k ) + β(x k x k 1 ) Define u k := f(x k ) and p k := x k 1 u [ xk+1 p k+1 algorithm (linear, known, decoupled) ] [ ] [ ] (1 + β)i βi xk = + I 0 p k y k = [ I 0 ] [ ] x k p k [ αi 0 ] u k y u k = f(y k ) function (nonlinear, uncertain, coupled)

10 10 Dynamical system interpretation Heavy ball: x k+1 = x k α f(x k ) + β(x k x k 1 ) Define u k := f(x k ) and p k := x k 1 u algorithm (linear, known, decoupled) [ [ ] [ ] [ ] [ ] (xk+1 [ ) i ] 1 β β (xk (xk+1 ) ) i α i ] [ [ ] 1 β β ] [ [ ] (xk ) i ] [ [ ] α ] (u (p (xk+1 k+1 ) ) i i 1 + β 0 β (p (xk k ) ) i i 0 α (u k ) i (p k+1 ) i = 1 (y k ) i [ 1 0 ] [ 0 (p ] k ) i + 0 (u k ) i (x k ) i (y k ) i [ 1 0 ] [ ] (p k+1 ) i 1 (x k ) i 0 (p k ) i 0 k ) i (p (p k ) (y k ) i = [ 1 0 ] [ ] (x k ) i i (p k ) i k ) i i = 1,..., N y u k = f(y k ) function (nonlinear, uncertain, coupled)

11 11 u G ξ y ξ k+1 = Aξ k + Bu k y k = Cξ k f u k = f(y k ) [ A B C 0 [ ] 1 α 1 0 ] 1 + β β α = β β α β β 0 Gradient Heavy ball Nesterov

12 u f y {}}{ f(x) f(x) f(x) f(x) : x x x linear sector bounded + slope restricted sector bounded f(x) f(x) f(x) f(x) : x x x quadratic strongly convex + Lipschitz gradients radially quasiconvex 12

13 13 u f y Representing function classes express as quadratic constraints on (y, u) u k f is a passive function: y k u k y k 0 sector bounded

14 14 u f y Representing function classes express as quadratic constraints on (y, u) u k L m y k [ yk u k f is sector-bounded: ] T [ ] [ ] 2mL m + L yk 0 m + L 2 u k sector bounded

15 15 u f y Representing function classes express as quadratic constraints on (y, u) u k L m y k f is sector-bounded + slope-restricted: constraint on (y k, u k ) depends on history (y 0,..., y k 1, u 0,..., u k 1 ). sector bounded + slope restricted

16 16 u f y Ψ ζ z Introduce extra dynamics Design dynamics Ψ and multiplier matrix M. Instead of using q(u k, y k ), use z T k Mz k. Systematic way of doing this for strong convexity (Zames & Falb, 1968) General theory: Integral Quadratic Constraints (Megretski & Rantzer, 1997)

17 17 u G f y [ 1 ] α 1 0 [ 1 + β β α [ 1 + β β α β β 0 ] ] Gradient Heavy ball Nesterov [ A B C 0 ] f(x) f(x) f(x) x x x f is quadratic f is strongly convex f is quasiconvex } {{ } (Ψ, M)

18 Main result 18 Problem data: G (the algorithm) Ψ (what we know about f) Auxiliary quantities: Compute (Â, ˆB, Ĉ, ˆD) matrices from (G, Ψ) Choose a candidate rate 0 < ρ < 1. Size of LMI does not grow with problem dimension! e.g. P S 3 3, LMI S 4 4 If there exists P 0 such that [ÂT P Â ρ2 P Â T P ˆB ] ˆB T P Â ˆBT P ˆB + [ Ĉ ˆD ] T [Ĉ M ˆD] 0 then x k x cond(p ) ρ k x 0 x for all k.

19 main results: analytic and numerical 19

20 Gradient method 20 x k+1 = x k α f(x k ) Convergence rate ρ Convergence rate for Gradient method Gradient (all functions) 0.2 Nesterov (quadratic) Heavy ball (quadratic) Condition ratio L/m Iterations to convergence Iterations to convergence for Gradient method / Condition ratio L/m analytic solution! Same rate for: quadratics, strongly convex, or quasiconvex functions.

21 Nesterov s method 21 x k+1 = x k α f(x k + β(x k x k 1 )) + β(x k x k 1 ) 1 Nesterov rate bounds 10 3 Nesterov iterations Convergence rate ρ IQC (quasiconvex) 0.4 IQC (strongly convex) Nesterov (strongly convex) 0.2 Nesterov (quadratic) Heavy ball (quadratic) Condition ratio L/m Iterations to convergence Condition ratio L/m Cannot certify stability for quasiconvex functions IQC bound improves upon best known bound!

22 Heavy ball method 22 x k+1 = x k α f(x k ) + β(x k x k 1 ) 1 Heavy ball rate bounds 10 3 Heavy ball iterations Convergence rate ρ IQC (quasiconvex) IQC (strongly convex) 0.2 Nesterov (quadratic) Heavy ball (quadratic) Condition ratio L/m Iterations to convergence Condition ratio L/m Cannot certify stability for quasiconvex functions Cannot certify stability for strongly convex functions

23 The heavy ball method is not stable! 25 2 x2 x < 1 1 counterexample: f(x) = 2 x2 + 24x 12 1 x < x2 24x + 36 x 2 and start the heavy ball iteration at x 0 = x 1 [3.07, 3.46] f(x) L/m = 25 heavy ball iterations converge to a limit cycle simple counterexample to the Aizerman (1949) and Kalman (1957) conjectures 23

24 uncharted territory: noise robustness and algorithm design 24

25 Noise robustness 25 δ u G ξ y The δ block is uncertain multiplicative noise: u k w k δ w k w f How does an algorithm perform in the presence of noise?

26 26 Gradient method, α = 2 L+m (optimal stepsize with no noise) Convergence rate ρ Rates for different δ 0.2 δ {.01,.02,.05,.1,.2,.5} Gradient method Iterations to convergence Iterations for different δ δ {.01,.02,.05,.1,.2,.5} Gradient method Gradient method, α = 1 L 1 (more conservative stepsize) 10 3 Convergence rate ρ Rates for different δ 0.2 δ {.01,.02,.05,.1,.2,.5} Gradient method Iterations to convergence Iterations for different δ δ {.01,.02,.05,.1,.2,.5} Gradient method

27 27 Nesterov s method (strongly convex f, with noise) Convergence rate ρ Rates for different δ 0.2 δ {.05,.1,.2,.3,.4,.5} Nesterov (quadratic) Iterations to convergence Iterations for different δ δ {.05,.1,.2,.3,.4,.5} Nesterov (quadratic) Condition ratio L/m Condition ratio L/m Nesterov s method is not robust to noise. can we have it all? (robustness AND performance)

28 28 Design approach parameterize all proper G of degree 2 parameterization in terms of (α, β, η): x k+1 = x k α f ( ) y k + β(xk x k 1 ) y k = x k + η(x k x k 1 ) Special cases: (α, 0, 0) (α, β, η) = (α, β, 0) (α, β, β) Gradient Heavy ball Nesterov

29 29 Robust Momentum Method (ACC 18) x k+1 = x k α f ( y k ) + β(xk x k 1 ) y k = x k + η(x k x k 1 ) Parameters designed via root locus technique: α = κ(1 ρ)2 (1+ρ) L, β = κρ3 κ 1, η = ρ 3 (κ 1)(1 ρ) 2 (1+ρ) tuning parameter: 1 1 κ }{{} fast + fragile ρ 1 1 κ }{{} slow + robust When ρ = 1 1 κ, recover Gradient with α = 1 L When ρ = 1 1 κ, recover Triple Momentum Method with optimal tuning (Van Scoy et al, 2018)

30 30 Trade-off: performance vs robustness 1 Convergence rate (ρ) 1 1 κ κ 1 κ κ Noise strength (δ) Gradient (α = 2/(L + m)) Gradient (α = 1/L) Gradient (min over α) Nesterov method RMM (ρ = 1 1/ κ) RMM (intermediate 1) RMM (intermediate 2) RMM (min over ρ)

31 Related works 31 In this talk Unified analysis framework (SIOPT 16) Robust momentum method (ACC 18) Other families of algorithms Operator-splitting methods, e.g. ADMM (ICML 15) Weakly convex functions (1/k and 1/k 2 rates) (ICML 17) Distributed optimization algorithms (ALLER 17) Stochastic variance reduction, e.g. SVRG (ICML 18)

32 32 Thank you! Manuscripts + code available:

Dissipativity Theory for Nesterov s Accelerated Method

Dissipativity Theory for Nesterov s Accelerated Method Bin Hu Laurent Lessard Abstract In this paper, we adapt the control theoretic concept of dissipativity theory to provide a natural understanding of Nesterov s accelerated method. Our theory ties rigorous