Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM SPRING 2016 AFOSR YIP
Notation Function f : R n R is α-convex and β-smooth if where q x f Q x Q x (y) = f (x) + f (x), y x + β y x 2 2 q x (y) = f (x) + f (x), y x + α y x 2 2 Q x f q x x 2/15
Notation Function f : R n R is α-convex and β-smooth if where q x f Q x Q x (y) = f (x) + f (x), y x + β y x 2 2 q x (y) = f (x) + f (x), y x + α y x 2 2 Q x f q x Condition number: x κ = β α 2/15
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) 3/15
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) 3/15
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent β-smooth β ɛ β-smooth & α-convex κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ 3/15
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ 3/15
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) 3/15
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) Optimal methods have downsides: Not intuitive Non-monotone Difficult to augment with memory 3/15
Optimal method by optimal averaging Idea: Maximize lower models of f. 4/15
Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α 4/15
Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 4/15
Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 4/15
Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 = for any λ [0, 1] new lower-model Q λ := λq A + (1 λ)q B = v λ + α 2 x λ 2 4/15
Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 = for any λ [0, 1] new lower-model Q λ := λq A + (1 λ)q B = v λ + α 2 x λ 2 Key observation: v λ f 4/15
Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 5/15
Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 5/15
Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). Related to cutting plane, bundle methods, geometric descent (Bubeck-Lee-Singh 15) 5/15
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) 6/15
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) Optimal Linear Rate (D-Fazel-Roy 16): f (x k + ) v k ɛ after O β α ln 1 ɛ iterations. 6/15
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) Optimal Linear Rate (D-Fazel-Roy 16): f (x k + ) v k ɛ after O β α ln 1 ɛ iterations. Intuitive Monotone in f (x + k ) and in v k. Memory by optimally averaging (Q, Q k 1,..., Q k t ). 6/15
Optimal method by optimal averaging Figure: Logistic regression with regularization α = 10 4. 7/15
Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. 8/15
Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. For convenience, set µ = Lβ. 8/15
Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. For convenience, set µ = Lβ. Main examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Exact penalty subproblem: min g(x) + dist K (c(x)) x (Burke 85, 91, Fletcher 82, Powell 84, Wright 90, Yuan 83) 8/15
Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). 9/15
Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. 9/15
Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods 9/15
Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods Convergence rate: G(x k ) 2 < ɛ after ( ) µ 2 O ɛ iterations 9/15
Stopping criterion What does G(x) 2 < ɛ actually mean? 10/15
Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) 10/15
Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) Thm: (D-Lewis 16) x + is nearly stationary because (ˆx, ˆv) with where ˆv g(ˆx) + c(ˆx) h(c(ˆx)) ˆx x + µ G(x) and ˆv 5 G(x) 10/15
Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) Thm: (D-Lewis 16) x + is nearly stationary because (ˆx, ˆv) with where ˆv g(ˆx) + c(ˆx) h(c(ˆx)) ˆx x + µ G(x) and ˆv 5 G(x) (pf: Ekeland s variational principle) 10/15
Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x 11/15
Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) 11/15
Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) The following are essentially equivalent (D-Lewis 16): EB property Subregularity: dist(x; {soln. set}) 1 dist(0; F(x)) α for x near x Quadratic growth: F(x) F( x) + α 2 dist2 (x, {soln. set}) for x near x 11/15
Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) The following are essentially equivalent (D-Lewis 16): EB property Subregularity: dist(x; {soln. set}) 1 dist(0; F(x)) α for x near x Quadratic growth: F(x) F( x) + α 2 dist2 (x, {soln. set}) for x near x Rate becomes α µ under tilt-stability (Poliquin-Rockafellar 98) 11/15
Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. 12/15
Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} 12/15
Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} Fact: x w, c(x) + µ 2 x 2 convex w dom h 12/15
Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} Fact: x w, c(x) + µ 2 x 2 convex w dom h Defn: Parameter ρ [0, 1] such that x w, c(x) + ρ µ 2 x 2 is convex w dom h 12/15
Acceleration Accelerated prox-linear method! 13/15
Acceleration Accelerated prox-linear method! Thm: (D-Kempton 16) min G(x i) 2 O i=1,...,k where R = diam (dom g) ( ) ( µ 2 µ 2 R 2 ) k 3 + ρ O k 13/15
Acceleration Accelerated prox-linear method! Thm: (D-Kempton 16) min G(x i) 2 O i=1,...,k where R = diam (dom g) ( ) ( µ 2 µ 2 R 2 ) k 3 + ρ O k Generalizes (Ghadimi, Lan 16) for additively composite. 13/15
Conclusions Balanced approach: Computational complexity Acceleration Variational analysis 14/15
Happy Birthday, Jim! 15/15