Taylor-like models in nonsmooth optimization

Size: px

Start display at page:

Download "Taylor-like models in nonsmooth optimization"

Helena Charles
5 years ago
Views:

1 Taylor-like models in nonsmooth optimization Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Ioffe (Technion), Lewis (Cornell), and Paquette (UW) SIAM Optimization 2017 AFOSR, NSF

2 Fix a closed function f : R n R. 2/18

3 Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease 2/18

4 Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease f ( x) := limsup x x (f ( x) f (x)) +. x x 2/18

5 Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). 2/18

6 Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18

7 Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: Deficiency: f is discontinuous f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. 2/18

8 Fix a closed function f : R n R. Slope: Fastest instantaneous rate of decrease If f is convex, then Critical points: f ( x) := limsup x x (f ( x) f (x)) +. x x f (x) = dist(0; f (x)). x is critical for f f (x) = 0. Deficiency: f is discontinuous = can not be used to terminate 2/18

9 Basic question: Is there a computable continuous surrogate G for f? 3/18

10 Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. 3/18

11 Basic question: Is there a computable continuous surrogate G for f? Desirable properties: 1. G is continuous, 2. G (x) = 0 f (x) = 0, 3. epi G and epi f are close. Various contexts: cutting planes (Kelley), bundle (Lemaréchal, Noll, Sagastizábal, Wolfe), gradient sampling (Goldstein, Burke-Lewis-Overton) 3/18

12 Outline 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 4/18

13 Taylor-like models Task: Determine quality of x R n for min f (y). y 5/18

14 Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: f x (y) f (y) η 2 x y 2 y. 5/18

15 Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y 5/18

16 Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx G (x). 5/18

17 Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx (value proximity) 1 η f (ˆx) f (x) G (x). 5/18

18 Taylor-like models Task: Determine quality of x R n for min f (y). y Structural assumption: Taylor-like model f x available: Slope surrogate: f x (y) f (y) η 2 x y 2 y. x + argmin f x (y) and G (x) := x x + y Thm: (D-Ioffe-Lewis 16) There exists ˆx satisfying 1 (point proximity) 2 x ˆx (value proximity) 1 η f (ˆx) f (x) G (x). 1 (near stationarity) η f (ˆx) 5/18

19 Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. 6/18

20 Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): dist(x; S) κ f (x) x near x. Slope EB phenomenon underling linear rates. 6/18

21 Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. 6/18

22 Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. Step-size EB aids linear rate analysis (Luo-Tseng 93). 6/18

23 Error bounds and linear rates Thm: (D-Ioffe-Lewis 16) Let S R n be arbitrary and x S arbitrary. Suppose (Slope EB): Then it holds: (Step-size EB) dist(x; S) κ f (x) dist(x, S) (3κη + 2) G (x) x near x. x near x. Slope EB phenomenon underling linear rates. Step-size EB aids linear rate analysis (Luo-Tseng 93). Rem: Similar for the surrogate G (x) := f (x) f x (x + ) 6/18

24 Convex composite minimization h c + g 7/18

25 Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) 8/18

26 Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. 8/18

27 Nonsmooth & Nonconvex minimization Convex composition min x f (x) = g(x) + h(c(x)) where g : R d R is closed, convex. h : R m R is convex and L-Lipschitz. c : R d R m is C 1 -smooth and c is β-lipschitz. For convenience, set η = Lβ. (Burke 85, 91, Cartis-Gould-Toint 11, Fletcher 82, Lewis-Wright 15, Powell 84, Wright 90, Yuan 83) 8/18

28 Composite examples Convex composition min x f (x) = g(x) + h(c(x)) Examples: Additive composite minimization: min g(x) + c(x) x Nonlinear least squares: min { c(x) : l i x i u i for i = 1,..., m} x Nonnegative Matrix Factorization: min X,Y XY T D s.t. X, Y 0 Robust Phase Retrieval: (Duchi-Ruan 17) min a, x 2 b 1 x Exact penalty subproblem: min g(x) + dist K (c(x)) x 9/18

29 Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h x + = argmin y ( ) c(x) + c(x)(y x) + η y x 2 2 f x (y) G (x) = η x x + 10/18

30 Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f (y) f x (y) x, y 10/18

31 Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y 10/18

32 Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. 10/18

33 Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods 10/18

34 Prox-linear algorithm Prox-linear model: f x (y) = g(y) + h ( ) c(x) + c(x)(y x) + η y x 2 2 x + = argmin y f x (y) G (x) = η x x + Justification: f x (y) η y x 2 f (y) f x (y) x, y Prox-linear method (Burke, Fletcher, Osborne, Powell,... 80s): x k+1 = x + k. Eg: proximal gradient, Levenberg-Marquardt methods Convergence rate: G (x k ) < ɛ after ( ) η O ɛ 2 iterations 10/18

35 Stopping criterion What does G (x) < ɛ actually mean? 11/18

36 Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) 11/18

37 Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) 11/18

38 Stopping criterion What does G (x) < ɛ actually mean? Stationarity for target problem: 0 g(x) + c(x) h(c(x)) Stationarity for prox-subproblem: G (x) dist (0; g(x + ) + c(x) h ( c(x) + c(x)(x + x) )) Thm: (D-Lewis 16) x + is nearly stationary because ˆx with 1 η ˆx x G (x) and f (ˆx) G (x) Thm: (D-Paquette 16) G (x) 2η(x prox F/2η (x)). 11/18

39 Local quadratic convergence Let S = {stationary points} and fix x S. 12/18

40 Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, 12/18

41 Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). 12/18

42 Local quadratic convergence Let S = {stationary points} and fix x S. Thm: (Burke-Ferris 95) Weak sharp minimum 0 < α f (x) x / S near x, = local quadratic convergence dist(x k+1 ; S) O(dist 2 (x k ; S)). Growth interpretation: Weak sharp minimum = f (x) f (proj(x; S)) + α dist(x, S) x near x. 12/18

43 Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x 13/18

44 Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) 13/18

45 Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x 13/18

46 Local linear convergence Thm: (D-Lewis 16) Error bound property dist(x; S) 1 f (x) α for x near x = local linear convergence ( f (x k+1 ) f 1 α2 η 2 ) (f (x k ) f ) Growth interpretation: (D-Mordukhovich-Nghia 14) EB property = f (x) f (proj(x, S)) + α 2 dist2 (x, S) for x near x Rate becomes α η under tilt-stability (Poliquin-Rockafellar 98) 13/18

47 Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. 14/18

48 Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. 14/18

49 Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. 14/18

50 Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 14/18

51 Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 = (Ax) 2 (Ax ) 2 1 has a weak sharp minimum = local quadratic convergence! 14/18

52 Robust phase retrieval (Duchi-Ruan 17) Problem: Find x R n satisfying a i, x 2 b i for a 1,..., a m R n and b 1,..., b m R +. Defn: (Eldar-Mendelson 12) A R m n is stable if (Ax) 2 (Ay) 2 1 λ x y x + y. Thm: (Duchi-Ruan 17) If a i N (0, I n ) and m/n 1 then A is stable with high probability. Two ingredients: 1) Problem min x (Ax) 2 b 1 = (Ax) 2 (Ax ) 2 1 has a weak sharp minimum = local quadratic convergence! 2) Can find x 0 in attraction region w.h.p. using spectrum. 14/18

53 RNA reconstruction (Duchi-Ruan 17) n = 222, m = 3n (a) x0, (b) 10 inaccurate solves, (c) one accurate solve, (d) original image. 15/18

54 Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates 16/18

55 Summary 1. Taylor-like models step-size, stationarity, error-bounds 2. Convex composite g + h c prox-linear method local linear/quadratic rates Other recent works: 1. First-order complexity & Acceleration (Paquette 2:00-2:25) 2. Stochastic prox-linear algorithms (Duchi-Ruan 17) 3. Robust phase retrieval (Duchi-Ruan 17) 16/18

56 Thank you! 17/18

57 References Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria D-Ioffe-Lewis, 2016, arxiv: Error bounds, quadratic growth, and linear convergence of proximal methods D-Lewis, 2016, arxiv: Efficiency of minimizing compositions of convex functions and smooth maps D-Paquette, 2016, arxiv: /18

Expanding the reach of optimal methods

Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM