Taylor-like models in nonsmooth optimization Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Ioffe (Technion), Lewis (Cornell), and Paquette (UW) SIAM Optimization 2017 AFOSR, NSF
Fix a closed function f : R n R. 2/18
Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease 2/18
Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease f ( x) := limsup x x (f ( x) f (x)) +. x x 2/18
Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). 2/18
Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18
Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: Deficiency: f is discontinuous f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18
Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. Deficiency: f is discontinuous = can not be used to terminate 2/18
Basic question: Is there a computable continuous surrogate G for f? 3/18
Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. 3/18
Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. Various contexts: cutting planes (Kelley), bundle (Lemaréchal, Noll, Sagastizábal, Wolfe), gradient sampling (Goldstein, Burke-Lewis-Overton) 3/18
Outline 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 4/18
Taylor-like models Task: Determine quality of x R n for min f (y). y 5/18
Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: f x (y) f (y) η 2 x y 2 y. 5/18
Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y 5/18
Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx G (x). 5/18
Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx (value proximity) 1 η f (ˆx) f (x) G (x). 5/18
Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx (value proximity) 1 η f (ˆx) f (x) G (x). 1 (near stationarity) η f (ˆx) 5/18
Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. 6/18
Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. Slope EB phenomenon underling linear rates. 6/18
Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. 6/18
Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. Step-size EB aids linear rate analysis (Luo-Tseng 93). 6/18
Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. Step-size EB aids linear rate analysis (Luo-Tseng 93). Rem: Similar for the surrogate G (x) := f (x) f x (x + ) 6/18
Convex composite minimization h c + g 7/18
Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) 8/18
Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. 8/18
Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. (Burke 85, 91, Cartis-Gould-Toint 11, Fletcher 82, Lewis-Wright 15, Powell 84, Wright 90, Yuan 83) 8/18
Composite examples Convex composition min x f (x) = g(x) + h(c(x)) Examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Nonnegative Matrix Factorization: min X,Y XY T D s.t. X, Y 0 Robust Phase Retrieval: (Duchi-Ruan 17) min a, x 2 b 1 x Exact penalty subproblem: min g(x) + dist K (c(x)) x 9/18
Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h x + = argmin y ( ) c(x) + c(x)(y x) + η y x 2 2 f x (y) G (x) = η x x + 10/18
Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f (y) f x (y) x, y 10/18
Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y 10/18
Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. 10/18
Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods 10/18
Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods Convergence rate: G (x k ) < ɛ after ( ) η O ɛ 2 iterations 10/18
Stopping criterion What does G (x) < ɛ actually mean? 11/18
Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) 11/18
Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) 11/18
Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) Thm: (D-Paquette 16) G (x) 2η(x prox F/2η (x)). 11/18
Local quadratic convergence Let S = {stationary points} and fix x S. 12/18
Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, 12/18
Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). 12/18
Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). Growth interpretation: Weak sharp minimum = f (x) f (proj(x; S)) + α dist(x, S) x near x. 12/18
Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x 13/18
Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) 13/18
Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x 13/18
Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x Rate becomes α η under tilt-stability (Poliquin-Rockafellar 98) 13/18
Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. 14/18
Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. 14/18
Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. 14/18
Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 14/18
Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 = (Ax) 2 (Ax ) 2 1 has a weak sharp minimum = local quadratic convergence! 14/18
Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 = (Ax) 2 (Ax ) 2 1 has a weak sharp minimum = local quadratic convergence! 2) Can find x 0 in attraction region w.h.p. using spectrum. 14/18
RNA reconstruction (Duchi-Ruan 17) n = 222, m = 3n (a) x0, (b) 10 inaccurate solves, (c) one accurate solve, (d) original image. 15/18
Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 16/18
Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates Other recent works: 1. First-order complexity & Acceleration (Paquette 2:00-2:25) 2. Stochastic prox-linear algorithms (Duchi-Ruan 17) 3. Robust phase retrieval (Duchi-Ruan 17) 16/18
Thank you! 17/18
References Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria D-Ioffe-Lewis, 2016, arxiv:1610.03446. Error bounds, quadratic growth, and linear convergence of proximal methods D-Lewis, 2016, arxiv:1602.06661. Efficiency of minimizing compositions of convex functions and smooth maps D-Paquette, 2016, arxiv:1605.00125. 18/18