Expanding the reach of optimal methods

Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM SPRING 2016 AFOSR YIP

Notation Function f : R n R is α-convex and β-smooth if where q x f Q x Q x (y) = f (x) + f (x), y x + β y x 2 2 q x (y) = f (x) + f (x), y x + α y x 2 2 Q x f q x x 2/15

Notation Function f : R n R is α-convex and β-smooth if where q x f Q x Q x (y) = f (x) + f (x), y x + β y x 2 2 q x (y) = f (x) + f (x), y x + α y x 2 2 Q x f q x Condition number: x κ = β α 2/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent β-smooth β ɛ β-smooth & α-convex κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ 3/15

Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f (x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f (x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) 3/15

Optimal method by optimal averaging Idea: Maximize lower models of f. 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α 4/15

Optimal method by optimal averaging Idea: Maximize lower models of f. Notation: x + = x 1 β f (x) and x++ = x 1 f (x) α Convexity bound f q x in canonical form: ( ) f (x) 2 f (y) f (x) + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 4/15

Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 5/15

Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). Related to cutting plane, bundle methods, geometric descent (Bubeck-Lee-Singh 15) 5/15

Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f (x k ) f (x k) 2 end ) 2α + α 2 x x k ++ 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging The line-search is due to (Bubeck-Lee-Singh 15) ) Optimal Linear Rate (D-Fazel-Roy 16): f (x k + ) v k ɛ after O β α ln 1 ɛ iterations. 6/15

Optimal method by optimal averaging Figure: Logistic regression with regularization α = 10 4. 7/15

Nonsmooth & Nonconvex minimization Convex composition min g(x) + h(c(x)) x where g : R n R {+ } is closed, convex. h : R m R is convex and L-Lipschitz. c : R n R m is C 1 -smooth and c is β-lipschitz. For convenience, set µ = Lβ. Main examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Exact penalty subproblem: min g(x) + dist K (c(x)) x (Burke 85, 91, Fletcher 82, Powell 84, Wright 90, Yuan 83) 8/15

Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). 9/15

Prox-linear algorithm Prox-linear mapping x + = argmin y g(y) + h and the prox-gradient ( ) c(x) + c(x)(y x) + µ y x 2 2 G(x) = µ(x x + ). Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods 9/15

Stopping criterion What does G(x) 2 < ɛ actually mean? 10/15

Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) 10/15

Stopping criterion What does G(x) 2 < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G(x) g(x + ) + c(x) h ( c(x) + c(x)(x + x) ) Thm: (D-Lewis 16) x + is nearly stationary because (ˆx, ˆv) with where ˆv g(ˆx) + c(ˆx) h(c(ˆx)) ˆx x + µ G(x) and ˆv 5 G(x) 10/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x 11/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) 11/15

Local linear convergence Error bound property (Luo-Tseng 93) dist(x, {soln. set}) 1 α G(x) for x near x = local linear convergence ( F(x k+1 ) F 1 α2 µ 2 ) (F(x k ) F ) The following are essentially equivalent (D-Lewis 16): EB property Subregularity: dist(x; {soln. set}) 1 dist(0; F(x)) α for x near x Quadratic growth: F(x) F( x) + α 2 dist2 (x, {soln. set}) for x near x 11/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. 12/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} 12/15

Acceleration For nonconvex problems, the rate ( ) µ G(x k ) 2 2 < ɛ after O ɛ iterations is essentially best possible. Measuring non-convexity: h c(x) = sup w { w, c(x) h (w)} Fact: x w, c(x) + µ 2 x 2 convex w dom h 12/15

Acceleration Accelerated prox-linear method! 13/15

Acceleration Accelerated prox-linear method! Thm: (D-Kempton 16) min G(x i) 2 O i=1,...,k where R = diam (dom g) ( ) ( µ 2 µ 2 R 2 ) k 3 + ρ O k 13/15

Acceleration Accelerated prox-linear method! Thm: (D-Kempton 16) min G(x i) 2 O i=1,...,k where R = diam (dom g) ( ) ( µ 2 µ 2 R 2 ) k 3 + ρ O k Generalizes (Ghadimi, Lan 16) for additively composite. 13/15

Conclusions Balanced approach: Computational complexity Acceleration Variational analysis 14/15

Happy Birthday, Jim! 15/15