Taylor-like models in nonsmooth optimization

Taylor-like models in nonsmooth optimization Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Ioffe (Technion), Lewis (Cornell), and Paquette (UW) SIAM Optimization 2017 AFOSR, NSF

Fix a closed function f : R n R. 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease f ( x) := limsup x x (f ( x) f (x)) +. x x 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18

Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: Deficiency: f is discontinuous f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18

Basic question: Is there a computable continuous surrogate G for f? 3/18

Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. 3/18

Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. Various contexts: cutting planes (Kelley), bundle (Lemaréchal, Noll, Sagastizábal, Wolfe), gradient sampling (Goldstein, Burke-Lewis-Overton) 3/18

Outline 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 4/18

Taylor-like models Task: Determine quality of x R n for min f (y). y 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: f x (y) f (y) η 2 x y 2 y. 5/18

Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx G (x). 5/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. 6/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. Slope EB phenomenon underling linear rates. 6/18

Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. 6/18

Convex composite minimization h c + g 7/18

Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) 8/18

Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. 8/18

Composite examples Convex composition min x f (x) = g(x) + h(c(x)) Examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Nonnegative Matrix Factorization: min X,Y XY T D s.t. X, Y 0 Robust Phase Retrieval: (Duchi-Ruan 17) min a, x 2 b 1 x Exact penalty subproblem: min g(x) + dist K (c(x)) x 9/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h x + = argmin y ( ) c(x) + c(x)(y x) + η y x 2 2 f x (y) G (x) = η x x + 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f (y) f x (y) x, y 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y 10/18

Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. 10/18

Stopping criterion What does G (x) < ɛ actually mean? 11/18

Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) 11/18

Local quadratic convergence Let S = {stationary points} and fix x S. 12/18

Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, 12/18

Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). Growth interpretation: Weak sharp minimum = f (x) f (proj(x; S)) + α dist(x, S) x near x. 12/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x 13/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) 13/18

Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x 13/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. 14/18

Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 14/18

RNA reconstruction (Duchi-Ruan 17) n = 222, m = 3n (a) x0, (b) 10 inaccurate solves, (c) one accurate solve, (d) original image. 15/18

Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 16/18

Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates Other recent works: 1. First-order complexity & Acceleration (Paquette 2:00-2:25) 2. Stochastic prox-linear algorithms (Duchi-Ruan 17) 3. Robust phase retrieval (Duchi-Ruan 17) 16/18

Thank you! 17/18

References Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria D-Ioffe-Lewis, 2016, arxiv:1610.03446. Error bounds, quadratic growth, and linear convergence of proximal methods D-Lewis, 2016, arxiv:1602.06661. Efficiency of minimizing compositions of convex functions and smooth maps D-Paquette, 2016, arxiv:1605.00125. 18/18