Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW) Cornell ORIE 2017 AFOSR: FA9550-15-1-0237 NSF: DMS 1651851, CCF 1740551
Outline 1. Fast-gradient methods Complexity theory (review) New viewpoint: optimal quadratic averaging 2. Composite nonlinear models: F (x) = h(c(x)) Global complexity Regularity and local rapid convergence Illustration: phase retrieval 2/24
Notation Function f : R d R is α-convex and β-smooth if where q x f Q x Q x (y) = f(x) + f(x), y x + β y x 2 2 q x (y) = f(x) + f(x), y x + α y x 2 2 Q x f q x x 3/24
Notation Function f : R d R is α-convex and β-smooth if where q x f Q x Q x (y) = f(x) + f(x), y x + β y x 2 2 q x (y) = f(x) + f(x), y x + α y x 2 2 Q x f q x Condition number: x κ = β α 3/24
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f(x k) 4/24
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f(x k) Majorization view: x k+1 = argmin Q xk ( ) 4/24
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f(x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent β-smooth β ɛ β-smooth & α-convex κ ln 1 ɛ Table: Iterations until f(x k ) f < ɛ 4/24
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f(x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f(x k ) f < ɛ 4/24
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f(x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f(x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) 4/24
Complexity of first-order methods Gradient descent: x k+1 = x k 1 β f(x k) Majorization view: x k+1 = argmin Q xk ( ) Gradient Descent Optimal Methods β-smooth β ɛ β ɛ β-smooth & α-convex κ ln 1 ɛ κ ln 1 ɛ Table: Iterations until f(x k ) f < ɛ (Nesterov 83, Yudin-Nemirovsky 83) Optimal methods have downsides: Not intuitive Not naturally monotone Difficult to augment with memory 4/24
Optimal quadratic averaging 5/24
Optimal method by optimal averaging Idea: Use lower models of f instead. 6/24
Optimal method by optimal averaging Idea: Use lower models of f instead. Notation: x + = x 1 β f(x) and x++ = x 1 α f(x) 6/24
Optimal method by optimal averaging Idea: Use lower models of f instead. Notation: x + = x 1 β f(x) and x++ = x 1 α f(x) Convexity bound f q x in canonical form: ( ) f(y) f(x) f(x) 2 + α 2α 2 y x++ 2 6/24
Optimal method by optimal averaging Idea: Use lower models of f instead. Notation: x + = x 1 β f(x) and x++ = x 1 α f(x) Convexity bound f q x in canonical form: ( ) f(y) f(x) f(x) 2 + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 6/24
Optimal method by optimal averaging Idea: Use lower models of f instead. Notation: x + = x 1 β f(x) and x++ = x 1 α f(x) Convexity bound f q x in canonical form: ( ) f(y) f(x) f(x) 2 + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 = for any λ [0, 1] new lower-model Q λ := λq A + (1 λ)q B = v λ + α 2 x λ 2 6/24
Optimal method by optimal averaging Idea: Use lower models of f instead. Notation: x + = x 1 β f(x) and x++ = x 1 α f(x) Convexity bound f q x in canonical form: ( ) f(y) f(x) f(x) 2 + α 2α 2 y x++ 2 Lower models: Q A (x) = v A + α 2 x x A 2 Q B (x) = v B + α 2 x x B 2 = for any λ [0, 1] new lower-model Q λ := λq A + (1 λ)q B = v λ + α 2 x λ 2 Key observation: v λ f 6/24
Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 7/24
Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). 7/24
Optimal method by optimal averaging The minimum v λ is maximized when ( 1 λ = proj [0,1] 2 + v ) A v B α x A x B 2. The quadratic Q λ is the optimal averaging of (Q A, Q B ). Related to cutting plane, bundle methods. 7/24
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f(x k ) f(x k) 2 end ) 2α + α 2 x x ++ k 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging ) 8/24
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f(x k ) f(x k) 2 end ) 2α + α 2 x x ++ k 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging equivalent to geometric descent (Bubeck-Lee-Singh 15) ) 8/24
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f(x k ) f(x k) 2 end ) 2α + α 2 x x ++ k 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging equivalent to geometric descent (Bubeck-Lee-Singh 15) Optimal Rate (Bubeck-Lee-Singh 15, D-Fazel-Roy 16): f(x + k ) v k ɛ after O β α ln 1 ɛ iterations. ) 8/24
Optimal method by optimal averaging for k = 1,..., K do Set Q(x) = (f(x k ) f(x k) 2 end ) 2α + α 2 x x ++ k 2 Let Q k (x) = v k + α 2 x c k 2 be optim. average of (Q, Q k 1 ). Set x k+1 = line_search ( c k, x + k Algorithm: Optimal averaging equivalent to geometric descent (Bubeck-Lee-Singh 15) Optimal Rate (Bubeck-Lee-Singh 15, D-Fazel-Roy 16): f(x + k ) v k ɛ after O β α ln 1 ɛ iterations. Intuitive Monotone in f(x + k ) and in v k. Memory by optimally averaging (Q, Q k 1,..., Q k t ). ) 8/24
Optimal method by optimal averaging Figure: Logistic regression with regularization α = 10 4. 9/24
Optimal method by optimal averaging Figure: Logistic regression with regularization α = 10 4. proximal extensions (Chen-Ma 16) underestimate sequences (Ma et al. 17) 9/24
Composite nonlinear models 10/24
Nonsmooth & Nonconvex minimization Convex composition: min x F (x) = h(c(x)) 11/24
Nonsmooth & Nonconvex minimization Convex composition: min x F (x) = h(c(x)) where h: R m R is convex and 1-Lipschitz. c: R d R m is C 1 -smooth and c is β-lipschitz. 11/24
Nonsmooth & Nonconvex minimization Convex composition: min x F (x) = h(c(x)) where h: R m R is convex and 1-Lipschitz. c: R d R m is C 1 -smooth and c is β-lipschitz. (Burke 85, Cartis-Gould-Toint 11, Fletcher 82, Lewis-Wright 15, Nesterov 06, Powell 84, Wright 90, Yuan 83,... ) 11/24
Examples min x h(c(x)) Examples: Robust Phase Retrieval: min (Ax) 2 b i 1 x Robust PCA: min XY D 1 X R d r,y R r k Nonneg. Factorization: min X,Y 0 XY D 12/24
Prox-linear algorithm min x F (x) = h ( c(x) ) 13/24
Prox-linear algorithm min x F (x) = h ( c(x) ) Local Model: ( ) F x (y) := h c(x) + c(x)(y x) 13/24
Prox-linear algorithm min x F (x) = h ( c(x) ) Local Model: ( ) F x (y) := h c(x) + c(x)(y x) Accuracy: F x (y) F (y) β 2 y x 2 x, y 13/24
Prox-linear algorithm min x F (x) = h ( c(x) ) Local Model: ( ) F x (y) := h c(x) + c(x)(y x) Accuracy: F x (y) F (y) β 2 y x 2 x, y Prox-linear method (Burke, Fletcher, Nesterov, Powell,... ): x + = argmin y {F x (y) + β 2 y x 2} 13/24
Prox-linear algorithm min x F (x) = h ( c(x) ) Local Model: ( ) F x (y) := h c(x) + c(x)(y x) Accuracy: F x (y) F (y) β 2 y x 2 x, y Prox-linear method (Burke, Fletcher, Nesterov, Powell,... ): x + = argmin y {F x (y) + β 2 y x 2} Big assumption: x + is computable (for now) 13/24
0.75 0.5 Figure: f(x) = x 2 1 14/24
0.75 f x 0.5 Figure: f(x) = x 2 1 14/24
f x + (x 0.5) 2 0.75 0.5 Figure: f(x) = x 2 1 14/24
f x + (x 0.5) 2 0.75 0.5 Figure: f(x) = x 2 1 14/24
f x + (x 0.5) 2 0.75 0.5 Figure: f(x) = x 2 1 No finite termination 14/24
Sublinear rate Prox-gradient: G(x) := β(x x + ) 15/24
Sublinear rate Prox-gradient: G(x) := β(x x + ) Philosophy (Nesterov 13): G(x) F (x) 15/24
Sublinear rate Prox-gradient: G(x) := β(x x + ) Philosophy (Nesterov 13): G(x) F (x) Thm: (D-Paquette 16) Define the Moreau envelope F t (x) := inf {F (y) + t } y 2 y x 2. Then F 2β is smooth with G(x) F 2β (x) 15/24
Sublinear rate Prox-gradient: G(x) := β(x x + ) Philosophy (Nesterov 13): G(x) F (x) Thm: (D-Paquette 16) Define the Moreau envelope F t (x) := inf {F (y) + t } y 2 y x 2. Then F 2β is smooth with G(x) F 2β (x) Iterations Basic Operations β ε 2 (F (x 0 ) F ) β c ε 3 (F (x 0 ) F ) Likely optimal (Carmon, Duchi, Hinder, Sidford 17) 15/24
Two regularity conditions Fix x S := {x : 0 F (x)}. 1) Tilt-stability: (Poliquin-Rockafellar 98) v argmin x B r( x) 1 {F (x) v, x } is α-lipschitz near v = 0 16/24
Two regularity conditions Fix x S := {x : 0 F (x)}. 1) Tilt-stability: (Poliquin-Rockafellar 98) v argmin x B r( x) 1 {F (x) v, x } is α-lipschitz near v = 0 2) Sharpness: (Burke-Ferris 93) F (x) F ( x) α dist(x, S) for x B r ( x) 16/24
Two regularity conditions Fix x S := {x : 0 F (x)}. 1) Tilt-stability: (Poliquin-Rockafellar 98) v argmin x B r( x) 1 {F (x) v, x } is α-lipschitz near v = 0 2) Sharpness: (Burke-Ferris 93) F (x) F ( x) α dist(x, S) for x B r ( x) Convergence rates: Regularity Tilt-stability Guarantee F (x k+1 ) F F (x k ) F 1 α β Sharpness x k+1 x O( x k x 2 ) (Nesterov 06, D-Lewis 15) 16/24
Illustration: phase retrieval 17/24
Example: Robust phase retrieval Problem: Find x R d satisfying (a T i x) 2 b i for a 1,..., a m R d and b 1,..., b m R. 18/24
Example: Robust phase retrieval Problem: Find x R d satisfying (a T i x) 2 b i for a 1,..., a m R d and b 1,..., b m R. Composite formulation: min x F (x) := 1 m (Ax)2 b 1 18/24
Example: Robust phase retrieval Problem: Find x R d satisfying (a T i x) 2 b i for a 1,..., a m R d and b 1,..., b m R. Composite formulation: min x F (x) := 1 m (Ax)2 b 1 Assume a i N(0, I) independently and b = (A x) 2. 18/24
Example: Robust phase retrieval Problem: Find x R d satisfying (a T i x) 2 b i for a 1,..., a m R d and b 1,..., b m R. Composite formulation: min x F (x) := 1 m (Ax)2 b 1 Assume a i N(0, I) independently and b = (A x) 2. Two key consequences: constants β, α > 0 such that w.h.p. Approximation: (Duchi-Ruan 17) F (y) F x (y) β 2 y x 2 2 Sharpness: (Eldar-Mendelson 14) 1 m (Ax)2 (Ay) 2 1 α x y 2 x + y 2. 18/24
Intuition F approximates the population objective: FP (x) = Ea [(at x)2 (at x )2 ] 19/24
Intuition F approximates the population objective: FP (x) = Ea [(at x)2 (at x )2 ] = conv_func λ(xxt x x T ). 19/24
Intuition F approximates the population objective: FP (x) = Ea [(at x)2 (at x )2 ] 19/24
Intuition F approximates the population objective: FP (x) = Ea [(at x)2 (at x )2 ] 1.0 0.5 0.0-0.5-1.0-1.0-0.5 0.0 0.5 1.0 Figure: Contour plot of x 7 k FP (x)k. 19/24
Stationary point consistency Thm: (Davis-D-Paquette 17) Whenever m Cd, all stationary x satisfy ( x x x x + x x 3 4 d m or x x c 4 d m 1 + x x x, x x x 4 d m x x with high probability. ), 20/24
Prox-linear and subgradient methods Prox-linear method: x + = argmin y F x (y) + β 2 y x 2 21/24
Prox-linear and subgradient methods Prox-linear method: x + = argmin y F x (y) + β 2 y x 2 Thm: There exists R > 0 such that if x 0 x x R, then w.h.p. prox-linear iterates converge quadratically to x (Duchi-Ruan 17) 21/24
Prox-linear and subgradient methods Prox-linear method: x + = argmin y Polyak F x (y) + β 2 y x 2 subgradient: ( ) x + = x F (x) inf F F (x) F (x) 2 Thm: There exists R > 0 such that if x 0 x x R, then w.h.p. prox-linear iterates converge quadratically to x (Duchi-Ruan 17) 21/24
Prox-linear and subgradient methods Prox-linear method: x + = argmin y Polyak F x (y) + β 2 y x 2 subgradient: ( ) x + = x F (x) inf F F (x) F (x) 2 Thm: There exists R > 0 such that if x 0 x x R, then w.h.p. prox-linear iterates converge quadratically to x (Duchi-Ruan 17) subgradient iterates converge to x at constant linear rate (Davis-D-Paquette 17) 21/24
Prox-linear and subgradient methods Prox-linear method: x + = argmin y Polyak F x (y) + β 2 y x 2 subgradient: ( ) x + = x F (x) inf F F (x) F (x) 2 Thm: There exists R > 0 such that if x 0 x x R, then w.h.p. prox-linear iterates converge quadratically to x (Duchi-Ruan 17) subgradient iterates converge to x at constant linear rate (Davis-D-Paquette 17) Spectral initialization (Wang at al. 16), (Candès et al. 15). 21/24
Prox-linear and subgradient methods Prox-linear method: x + = argmin y Polyak F x (y) + β 2 y x 2 subgradient: ( ) x + = x F (x) inf F F (x) F (x) 2 Thm: There exists R > 0 such that if x 0 x x R, then w.h.p. prox-linear iterates converge quadratically to x (Duchi-Ruan 17) subgradient iterates converge to x at constant linear rate (Davis-D-Paquette 17) Spectral initialization (Wang at al. 16), (Candès et al. 15). Convex approach: (Candès-Strohmer-Voroniski 13) Other nonconvex approaches: (Candès-Li-Soltanolkotabi 15), (Tan-Vershynin 17), (Wang-Giannakis-Eldar 16), (Zhang-Chi-Liang 16), (Sun-Qu-Wright 17), Nonconvex subgradient methods: (Davis-Grimmer 17) 21/24
Figure: (d, m) (2 9, 2 11 ). 22/24
Figure: (d, m) (2 9, 2 11 ). 22/24
10 1 10 2 xk x / x 10 3 10 4 10 5 10 6 10 7 0 10 20 30 40 50 60 70 Iteration k 22/24
Figure: (d, m) (2 22, 2 24 ) (left) and (d, m) (2 24, 2 25 ) (right). 23/24
Figure: iterates vs. x k x / x. 23/24
Summary: optimal quadratic averaging composite nonlinear models: complexity and regularity illustration: phase retrieval References: 1. The nonsmooth lanscape of phase retrieval, Davis-D-Paquette, arxiv:1711.03247, 2017. 2. Error bounds, quadratic growth, and linear convergence of proximal methods, D-Lewis, Math. Oper. Res., 2017. 3. Efficiency of minimizing compositions of convex functions and smooth maps, D-Paquette, arxiv:1605.00125, 2016. 4. An optimal first order method based on optimal quadratic averaging, D-Fazel-Roy, SIAM J. Optim., 2016. 24/24
Summary: optimal quadratic averaging composite nonlinear models: complexity and regularity illustration: phase retrieval References: 1. The nonsmooth lanscape of phase retrieval, Davis-D-Paquette, arxiv:1711.03247, 2017. 2. Error bounds, quadratic growth, and linear convergence of proximal methods, D-Lewis, Math. Oper. Res., 2017. 3. Efficiency of minimizing compositions of convex functions and smooth maps, D-Paquette, arxiv:1605.00125, 2016. 4. An optimal first order method based on optimal quadratic averaging, D-Fazel-Roy, SIAM J. Optim., 2016. Thank you! 24/24