Algorithms for Nonsmooth Optimization

Size: px

Start display at page:

Download "Algorithms for Nonsmooth Optimization"

Mitchell Hines
5 years ago
Views:

1 Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization 1 of 55

2 Outline Motivating Examples Subdifferential Theory Fundamental Algorithms Nonconvex Nonsmooth Functions General Framework Algorithms for Nonsmooth Optimization 2 of 55

3 Outline Motivating Examples Subdifferential Theory Fundamental Algorithms Nonconvex Nonsmooth Functions General Framework Algorithms for Nonsmooth Optimization 3 of 55

4 Nonsmooth optimization In mathematical optimization, one wants to i.e., minimize an objective subject to constraints Why nonsmooth optimization? min x X f(x) Algorithms for Nonsmooth Optimization 4 of 55

5 Nonsmooth optimization In mathematical optimization, one wants to i.e., minimize an objective subject to constraints min x X f(x) Why nonsmooth optimization? Nonsmoothness can arise for different reasons: physical technological methodological numerical (Bagirov, Karmitsa, Mäkelä (2014)) Algorithms for Nonsmooth Optimization 4 of 55

6 Nonsmooth optimization In mathematical optimization, one wants to i.e., minimize an objective subject to constraints min x X f(x) Why nonsmooth optimization? Nonsmoothness can arise for different reasons: physical (phenomena can be nonsmooth) phase changes in materials technological methodological numerical (Bagirov, Karmitsa, Mäkelä (2014)) Algorithms for Nonsmooth Optimization 4 of 55

7 Nonsmooth optimization In mathematical optimization, one wants to i.e., minimize an objective subject to constraints min x X f(x) Why nonsmooth optimization? Nonsmoothness can arise for different reasons: physical (phenomena can be nonsmooth) phase changes in materials technological (constraints impose nonsmoothness) obstacles in shape design methodological numerical (Bagirov, Karmitsa, Mäkelä (2014)) Algorithms for Nonsmooth Optimization 4 of 55

8 Nonsmooth optimization In mathematical optimization, one wants to i.e., minimize an objective subject to constraints min x X f(x) Why nonsmooth optimization? Nonsmoothness can arise for different reasons: physical (phenomena can be nonsmooth) phase changes in materials technological (constraints impose nonsmoothness) obstacles in shape design methodological (nonsmoothness introduced by solution method) decompositions; penalty formulations numerical (Bagirov, Karmitsa, Mäkelä (2014)) Algorithms for Nonsmooth Optimization 4 of 55

9 Nonsmooth optimization In mathematical optimization, one wants to i.e., minimize an objective subject to constraints min x X f(x) Why nonsmooth optimization? Nonsmoothness can arise for different reasons: physical (phenomena can be nonsmooth) phase changes in materials technological (constraints impose nonsmoothness) obstacles in shape design methodological (nonsmoothness introduced by solution method) decompositions; penalty formulations numerical (analytically smooth, but practically nonsmooth) stiff problems (Bagirov, Karmitsa, Mäkelä (2014)) Algorithms for Nonsmooth Optimization 4 of 55

10 Data fitting min x R n θ(x) + ψ(x) where, e.g., θ(x) = Ax b 2 2 and ψ(x) = φ 1 (t) = n φ(x i ) with i=1 α t 1 + α t, φ 2 (t) = log(α t + 1), φ 3 (t) = t q, or φ 4 (t) = α (α t)2 + α Algorithms for Nonsmooth Optimization 5 of 55

11 Clusterwise linear regression (CLR) Given a dataset of pairs A := {(a i, b i )} l i=1, the goal of CLR is to simultaneously partition the dataset into k disjoint clusters, and find regression coefficients {(x j, y j )} k j=1 for each cluster in order to minimize overall error in the fit; e.g., min f k({x j, y j }), where f k ({x j, y j }) = {(x j,y j )} l min j {1,...,k} xt j a i y j b i p. i=1 This objective is nonconvex (though it is a difference of convex functions). Algorithms for Nonsmooth Optimization 6 of 55

12 Decomposition Various types of decomposition strategies introduce nonsmoothness. Primal decomposition can be used for min f 1(x 1, y) + f 2 (x 2, y), (x 1,x 2,y) where y is the complicating/linking variable; equivalent to min φ 1 (y) + φ 2 (y) where y This master problem may be nonsmooth in y. φ 1 (y) := min x 1 f 1 (x 1, y) φ 2 (y) := min x 2 f 2 (x 2, y) Algorithms for Nonsmooth Optimization 7 of 55

13 Decomposition Various types of decomposition strategies introduce nonsmoothness. Primal decomposition can be used for min f 1(x 1, y) + f 2 (x 2, y), (x 1,x 2,y) where y is the complicating/linking variable; equivalent to min φ 1 (y) + φ 2 (y) where y φ 1 (y) := min x 1 f 1 (x 1, y) φ 2 (y) := min x 2 f 2 (x 2, y) This master problem may be nonsmooth in y. Dual decomposition can be used for same problem, reformulating as min (x 1,x 2,y) f 1(x 1, y 1 ) + f 2 (x 2, y 2 ) s.t. y 1 = y 2 The Lagrangian is separable, meaning the dual function decomposes: g 1 (λ) = inf (f 1(x 1, y 1 ) + λ T y 1 ) (x 1,y 1 ) g 2 (λ) = inf (f 2(x 2, y 2 ) λ T y 2 ) (x 2,y 2 ) Dual problem to maximize g(λ) = g 1 (λ) + g 2 (λ) may be nonsmooth in λ. Algorithms for Nonsmooth Optimization 7 of 55

14 Dual decomposition with constraints Consider the nearly separable problem min (x 1,...,x m) m f i (x i ) i=1 s.t. x i X i for all i {1,..., m} m A i x i b (e.g., shared resource constraint) i=1 where the last are complicating/linking constraints; dualizing leads to ( m m ) g(λ) := min f i (x i ) + λ T A i x i b (x 1,...,x m) i=1 i=1 s.t. x i X i for all i {1,..., m}. Given λ R m, the value g(λ) comes from solving separable problems; the dual max λ 0 g(λ) is typically nonsmooth (and people often use poor algorithms!). Algorithms for Nonsmooth Optimization 8 of 55

15 Control of dynamical systems Consider the discrete time linear dynamical system: y k+1 = Ay k + Bu k z k = Cy k (state equation) (observation equation) Supposing we want to design a control such that u k = XCy k (where X is our variable) consider the closed loop system given by y k+1 = Ay k + Bu k = Ay k + BXCy k = (A + BXC)y k. Common objectives are to minimize a stability measure ρ(a + BXC), which are often functions of the eigenvalues of A + BXC. Algorithms for Nonsmooth Optimization 9 of 55

16 Eigenvalue optimization Plots of ordered eigenvalues as matrix is perturbed along a given direction: Algorithms for Nonsmooth Optimization 10 of 55

17 Other sources of nonsmooth optimization problems Lagrangian relaxation Composite optimization (e.g., penalty methods for soft constraints ) Parametric optimization (e.g., for model predictive control) Multilevel optimization Algorithms for Nonsmooth Optimization 11 of 55

18 Outline Motivating Examples Subdifferential Theory Fundamental Algorithms Nonconvex Nonsmooth Functions General Framework Algorithms for Nonsmooth Optimization 12 of 55

19 Derivatives When I teach an optimization class, I always start with the same question: What is a derivative? (f : R R) Algorithms for Nonsmooth Optimization 13 of 55

20 Derivatives When I teach an optimization class, I always start with the same question: What is a derivative? (f : R R) Answer I get: slope of the tangent line f(x) slope = f (x) x Algorithms for Nonsmooth Optimization 13 of 55

21 Gradients Then I ask: What is a gradient? (f : R n R) Algorithms for Nonsmooth Optimization 14 of 55

22 Gradients Then I ask: What is a gradient? (f : R n R) Answer I get: direction along which the function increases at the fastest rate Algorithms for Nonsmooth Optimization 14 of 55

23 Derivative vs. gradient So if a derivative is a magnitude (here, a slope), then why does it generalize in multiple dimensions to something that is a direction? (n = 1) f (x) = df f (x) = dx x (x) f (x) x 1 (n 1) f(x) =. f (x) x n What s important? Magnitude? direction? Algorithms for Nonsmooth Optimization 15 of 55

24 Derivative vs. gradient So if a derivative is a magnitude (here, a slope), then why does it generalize in multiple dimensions to something that is a direction? (n = 1) f (x) = df f (x) = dx x (x) f (x) x 1 (n 1) f(x) =. f (x) x n What s important? Magnitude? direction? Answer: The gradient is a vector in R n, which has magnitude (e.g., its 2-norm) can be viewed as a direction and gives us a way to compute directional derivatives Algorithms for Nonsmooth Optimization 15 of 55

25 Differentiable f How should we think about the gradient? If f is continuously differentiable (i.e., f C 1 ), then f(x) is the unique vector in the linear (Taylor) approximation of f at x. f(x) Both are graphs of functions of x! f(x) + f(x) T (x x) x x Algorithms for Nonsmooth Optimization 16 of 55

26 Differentiable and convex f If f C 1 is convex, then f(x) f(x) + f(x) T (x x) for all (x, x) R n R n f(x) f(x) + f(x) T (x x) x x Algorithms for Nonsmooth Optimization 17 of 55

27 Graphs and epigraphs There is another interpretation of a gradient that is also useful. First... What is a graph? Algorithms for Nonsmooth Optimization 18 of 55

28 Graphs and epigraphs There is another interpretation of a gradient that is also useful. First... What is a graph? A set of points in R n+1, namely, {(x, z) : f(x) = z} {(x, f(x))} x Algorithms for Nonsmooth Optimization 18 of 55

29 Graphs and epigraphs There is another interpretation of a gradient that is also useful. First... What is a graph? A set of points in R n+1, namely, {(x, z) : f(x) = z} A related quantity, another set, is the epigraph: {(x, z) : f(x) z} {(x, f(x))} x Algorithms for Nonsmooth Optimization 18 of 55

30 Differentiable and convex f If f C 1 is convex, then, for all (x, x) R n R n, f(x) f(x) + f(x) T (x x) f(x) f(x) T x f(x) f(x) T x [ ] T [ ] [ ] T [ ] f(x) x f(x) x 1 f(x) 1 f(x) Algorithms for Nonsmooth Optimization 19 of 55

31 Differentiable and convex f If f C 1 is convex, then, for all (x, x) R n R n, f(x) f(x) + f(x) T (x x) f(x) f(x) T x f(x) f(x) T x [ ] T [ ] [ ] T [ ] f(x) x f(x) x 1 f(x) 1 f(x) Note: Given x, the vector [ ] f(x) is given, 1 so the inequality above involves a linear function over R n+1 and says [ ] [ ] x x the value at any point in the graph is at least the value at f(x) f(x) Algorithms for Nonsmooth Optimization 19 of 55

32 Linearization f(x) f(x) + f(x) T (x x) x x Algorithms for Nonsmooth Optimization 20 of 55

33 Linearization and supporting hyperplane for epigraph [ ] [ ] x f(x) + f(x) 1 {(x, f(x))} f(x) + f(x) T (x x) x x Algorithms for Nonsmooth Optimization 20 of 55

34 Subgradients (convex f) Why was that useful? We can generalize this idea when the function is not differentiable somewhere. [ ] [ ] x g + f(x) 1 [ ] x f(x) A vector g R n is a subgradient of a convex f : R n R at x R n if f(x) f(x) + g T (x x) [ ] T [ ] [ ] T [ ] g x g x 1 f(x) 1 f(x) Algorithms for Nonsmooth Optimization 21 of 55

35 Subdifferentials Theorem If f is convex and differentiable at x, then f(x) is its unique subgradient at x. But in general, the set of all subgradients for a convex f at x is the subdifferential of f at x: f(x) := {g R n : g is a subgradient of f at x}. From the definition, it is easily seen that x is a minimizer of f if and only if 0 f(x ) Algorithms for Nonsmooth Optimization 22 of 55

36 What about nonconvex, nonsmooth? We need to generalize the idea of a subgradient further. Directional derivatives Subgradients Subdifferentials Let s return to this after we discuss some algorithms... Algorithms for Nonsmooth Optimization 23 of 55

37 Outline Motivating Examples Subdifferential Theory Fundamental Algorithms Nonconvex Nonsmooth Functions General Framework Algorithms for Nonsmooth Optimization 24 of 55

38 A fundamental iteration Thinking of f(x k ), we have a vector that directs us in a direction of descent, and vanishes as we approach a minimizer Algorithms for Nonsmooth Optimization 25 of 55

39 A fundamental iteration Thinking of f(x k ), we have a vector that directs us in a direction of descent, and vanishes as we approach a minimizer Algorithm : Gradient Descent 1: Choose an initial point x 0 R n and stepsize α (0, 1/L] 2: for k = 0, 1, 2,... do 3: if f(x k ) 0, then return x k 4: else set x k+1 x k α f(x k ) I call this a fundamental iteration. Algorithms for Nonsmooth Optimization 25 of 55

40 A fundamental iteration Thinking of f(x k ), we have a vector that directs us in a direction of descent, and vanishes as we approach a minimizer Algorithm : Gradient Descent 1: Choose an initial point x 0 R n and stepsize α (0, 1/L] 2: for k = 0, 1, 2,... do 3: if f(x k ) 0, then return x k 4: else set x k+1 x k α f(x k ) I call this a fundamental iteration. Here, we suppose f is Lipschitz continuous, i.e., there exists L 0 such that f(x) f(x) 2 L x x 2 for all (x, x) R n R n = f(x) f(x) + f(x) T (x x) L x x 2 2 for all (x, x) R n R n. Algorithms for Nonsmooth Optimization 25 of 55

41 Convergence of gradient descent f(x k ) x k x Algorithms for Nonsmooth Optimization 26 of 55

42 Convergence of gradient descent f(x k ) f(x)? f(x)? x k x Algorithms for Nonsmooth Optimization 26 of 55

43 Convergence of gradient descent f(x k ) f(x k ) + f(x k ) T (x x k ) L x x k 2 2 x k x Algorithms for Nonsmooth Optimization 26 of 55

44 Gradient descent for f Theorem If f is Lipschitz continuous with constant L > 0 and α (0, 1/L], then f(x j ) 2 2 < which implies { f(x j )} 0. j=0 Algorithms for Nonsmooth Optimization 27 of 55

45 Gradient descent for f Theorem If f is Lipschitz continuous with constant L > 0 and α (0, 1/L], then f(x j ) 2 2 < which implies { f(x j )} 0. j=0 Proof. Let k N and recall that x k+1 x k = α f(x k ). Then, since α (0, 1/L], f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) L x k+1 x k 2 2 = f(x k ) α f(x k ) α2 L f(x k ) 2 2 = f(x k ) α(1 1 2 αl) f(x k) 2 2 f(x k ) 1 2 α f(x k) 2 2. Thus, summing over j {0,..., k}, one finds > f(x 0 ) f inf f(x 0 ) f(x k+1 ) 1 2 α k j=0 f(x j) 2 2. Algorithms for Nonsmooth Optimization 27 of 55

46 Strong convexity Now suppose that f is c-strongly convex, which means that f(x) f(x) + f(x) T (x x) c x x 2 2 for all (x, x) Rn R n. Important consequences of this are that f has a unique global minimizer, call it x with f := f(x ), and the gradient norm grows with the optimality error in that 2c(f(x) f ) f(x) 2 2 for all x R n. Algorithms for Nonsmooth Optimization 28 of 55

47 Strong convexity, lower bound f(x k ) f(x k ) + f(x k ) T (x x k ) L x x k 2 2 x k x Algorithms for Nonsmooth Optimization 29 of 55

48 Strong convexity, lower bound f(x k ) f(x k ) + f(x k ) T (x x k ) L x x k 2 2 f(x k ) + f(x k ) T (x x k ) c x x k 2 2 x k x Algorithms for Nonsmooth Optimization 29 of 55

49 Gradient descent for strongly convex f Theorem If f is Lipschitz with L > 0, f is c-strongly convex, and α (0, 1/L], then f(x j+1 ) f (1 αc) j+1 (f(x 0 ) f ) for all j N. Algorithms for Nonsmooth Optimization 30 of 55

50 Gradient descent for strongly convex f Theorem If f is Lipschitz with L > 0, f is c-strongly convex, and α (0, 1/L], then f(x j+1 ) f (1 αc) j+1 (f(x 0 ) f ) for all j N. Proof. Let k N. Following the previous proof, one finds f(x k+1 ) f(x k ) 1 2 α f(x k) 2 2 f(x k ) αc(f(x k ) f ). Subtracting f from both sides, one finds f(x k+1 ) f (1 αc)(f(x k ) f ). Applying the result repeatedly over j {0,..., k} yields the result. Algorithms for Nonsmooth Optimization 30 of 55

51 A fundamental iteration when f is nonsmooth? What is a fundamental iteration for nonsmooth optimization? Algorithms for Nonsmooth Optimization 31 of 55

52 A fundamental iteration when f is nonsmooth? What is a fundamental iteration for nonsmooth optimization? Steepest descent! For convex f, the directional derivative of f at x along s is f (x; s) = max g f(x) gt s Along which direction is f decreasing at the fastest rate? Algorithms for Nonsmooth Optimization 31 of 55

53 A fundamental iteration when f is nonsmooth? What is a fundamental iteration for nonsmooth optimization? Steepest descent! For convex f, the directional derivative of f at x along s is f (x; s) = max g f(x) gt s Along which direction is f decreasing at the fastest rate? The solution of an optimization problem! min f (x; s) = min max s 2 1 s 2 1 g f(x) gt s = max min g f(x) s 2 1 gt s (von Neumann minimax theorem) = max g f(x) ( g 2) = min g f(x) g 2 = (need minimum norm subgradient) Algorithms for Nonsmooth Optimization 31 of 55

54 Main challenge But, typically, we can only access g f(x), not all of f(x) I would argue: no practical fundamental iteration for general nonsmooth optimization (no computable descent direction that vanishes near a minimizer) What are our options? Algorithms for Nonsmooth Optimization 32 of 55

55 Main challenge But, typically, we can only access g f(x), not all of f(x) I would argue: no practical fundamental iteration for general nonsmooth optimization (no computable descent direction that vanishes near a minimizer) What are our options? There are a few ways to design a convergent algorithm: algorithmically (e.g., subgradient method) iteratively (e.g., cutting plane / bundle methods) randomly (e.g., gradient sampling) Algorithms for Nonsmooth Optimization 32 of 55

56 Subgradient method Algorithm : Subgradient method (not descent) 1: Choose an initial point x 0 R n. 2: for k = 0, 1, 2,... do 3: if a termination condition is satisfied, then return x k 4: else compute g k f(x k ), choose α k R >0, and set x k+1 x k α k g k Algorithms for Nonsmooth Optimization 33 of 55

Why not subgradient descent? Consider min f(x), where f(x 1, x 2 ) := x 1 + x 2 + max{0, x 2 1 + x2 2 4}.

57 Why not subgradient descent? Consider min f(x), where f(x 1, x 2 ) := x 1 + x 2 + max{0, x x2 2 4}. x R 2 At x = (0, 2), we have {[ [ ]} [ f(x) = conv,, but 1] 3 1] are both directions of ascent for f from x! and [ ] 1 3 Algorithms for Nonsmooth Optimization 34 of 55

58 Decreasing the distance to a solution The objective f is not the only measure of progress. Given an arbitrary subgradient g k for f at x k, we have f(x) f(x k ) + gk T (x x k) for all x R n, (1) which means that all points with an objective value lower than f(x k ) lie in H k := {x R n : gk T (x x k) 0} Thus, a small step along g k should decrease the distance to a solution (Convexity is crucial for this idea) Algorithms for Nonsmooth Optimization 35 of 55

59 Algorithmic convergence Theorem If f has a minimizer, g k 2 G R >0 for all k N, and the stepsizes satisfy α k = k=1 and α 2 k <, (2) k=1 then { lim k min j {0,...,k} f j } = f. An example sequence satisfying (2) is α k = α/k for k = 1, 2,... Algorithms for Nonsmooth Optimization 36 of 55

60 Proof, lim k { minj {0,...,k} f j } = f, part 1. Let k N. By (1), the iterates satisfy x k+1 x 2 2 = x k α k g k x 2 2 Applying this inequality recursively, we have = x k x 2 2 2α kg T k (x k x ) + α 2 k g k 2 2 x k x 2 2 2α k (f k f ) + α 2 k g k 2 2. k k 0 x k+1 x 2 2 x 0 x α j (f j f ) + α 2 j g j 2 2, which implies that j=0 j=0 k k 2 α j (f j f ) x 0 x α 2 j g j 2 2 j=0 j=1 min f j f x 0 x k G2 j=1 α2 j j {0,...,k} 2 k j=0 α. (3) j Algorithms for Nonsmooth Optimization 37 of 55

61 Proof, lim k { minj {0,...,k} f j } = f, part 2. Now consider an arbitrary scalar ɛ > 0. By (2), there exists a nonnegative integer K such that, for all k > K, α k ɛ k G 2 and α j 1 K x 0 x G 2 α 2 j. ɛ j=0 j=0 Then, by (3), it follows that for all k > K we have min f j f x 0 x G2 K j=0 α2 j j {0,...,k} 2 k j=0 α + j 2 ɛ x 0 x G2 K j=0 α2 j ( x 0 x G2 K j=0 α2 j = ɛ 2 + ɛ 2 = ɛ. The result follows since ɛ > 0 was chosen arbitrarily. G 2 k j=k+1 α2 j 2 K j=0 α j + 2 k j=k+1 α j ) + G2 k j=k+1 ɛ G 2 α j 2 k j=k+1 α j Algorithms for Nonsmooth Optimization 38 of 55

62 Cutting plane method Subgradient methods lose previously computed information in every iteration. Algorithms for Nonsmooth Optimization 39 of 55

63 Cutting plane method Subgradient methods lose previously computed information in every iteration. Suppose, after a sequence of iterates, we have the affine underestimators f i (x) = f(x i ) + gi T (x x i) for all i {0,..., k}. f(x 1 ) + g1 T (x x 1) f(x) f(x 0 ) + g T 0 (x x 0) x 0 x 2 x 1 x At iteration k, we can compute the next iterate by solving the master problem x k+1 arg min ˆf k (x), where ˆf k (x) := max (f(x i) + gi T (x x i)). x X i {1,...,k} Algorithms for Nonsmooth Optimization 39 of 55

64 Cutting plane method convergence The iterates of the cutting plane method yield lower bounds of the optimal value: v k+1 := min x X ˆf k (x) min f(x) =: f. x X Therefore, if v k+1 = f(x k+1 ), then we terminate since f(x k+1 ) = f. Algorithms for Nonsmooth Optimization 40 of 55

65 Cutting plane method convergence The iterates of the cutting plane method yield lower bounds of the optimal value: v k+1 := min x X ˆf k (x) min f(x) =: f. x X Therefore, if v k+1 = f(x k+1 ), then we terminate since f(x k+1 ) = f. If f is piecewise linear, then convergence occurs in finitely many iterations! f(x 1 ) + g T 1 (x x 1) f(x) f(x 0 ) + g T 0 (x x 0) x 0 x 2 x 1 x However, in general, we have the following theorem. Theorem The cutting plane method yields {x k } satisfying {f(x k )} f. Algorithms for Nonsmooth Optimization 40 of 55

66 Bundle method A bundle method attempts to combine the practical advantages of a cutting plane method with the theoretical strengths of a proximal point method. Given x k, consider the regularized master problem ( min ˆfk (x) + γ ) x R n 2 x x k 2 2, where ˆf k (x) := max(f(x i ) + gi T (x x i)). i I k Here, I k {1,..., k 1} indicates a subset of previous iterations. This problem is equivalent to the quadratic optimization problem min (x,v) R n R v + γ 2 x x k 2 2 s.t. f(x i ) + g T i (x x i) v for all i I k. Only move to a new point when a sufficient decrease is obtained. Convergence rate analyses are limited; O( 1 ɛ log( 1 )) for strongly convex f ɛ Algorithms for Nonsmooth Optimization 41 of 55

67 Bundle method convergence Analysis makes use of the Moreau-Yosida regularization function f γ(x) = min x R n ( f(x) γ x x 2 2). Theorem If x k is not a minimizer, then f γ(x k ) < f(x k ). Algorithms for Nonsmooth Optimization 42 of 55

68 Bundle method convergence Theorem For all (k, j) N N in a bundle method, v k,j γ x k,j x k 2 2 f γ(x k ) < f(x k ). Algorithms for Nonsmooth Optimization 43 of 55

69 Outline Motivating Examples Subdifferential Theory Fundamental Algorithms Nonconvex Nonsmooth Functions General Framework Algorithms for Nonsmooth Optimization 44 of 55

70 Clarke subdifferential What if f is nonconvex and nonsmooth? What are subgradients? We still need some structure; we assume f is locally Lipschitz and f is differentiable on a full measure set D Algorithms for Nonsmooth Optimization 45 of 55

71 Clarke subdifferential What if f is nonconvex and nonsmooth? What are subgradients? We still need some structure; we assume f is locally Lipschitz and f is differentiable on a full measure set D The Clarke subdifferential of f at x is { } f(x) = conv lim f(x j) : x j x and x j D, j i.e., convex hull of limits of gradients of f at points in D converging to x Algorithms for Nonsmooth Optimization 45 of 55

72 Clarke subdifferential What if f is nonconvex and nonsmooth? What are subgradients? We still need some structure; we assume f is locally Lipschitz and f is differentiable on a full measure set D The Clarke subdifferential of f at x is { } f(x) = conv lim f(x j) : x j x and x j D, j i.e., convex hull of limits of gradients of f at points in D converging to x Theorem If f is continuously differentiable at x, then f(x) = { f(x)} Algorithms for Nonsmooth Optimization 45 of 55

73 Differentiable, but nonsmooth Theorem If f is differentiable at x, then { f(x)} f(x) (not necessarily equal) Considering { x 2 cos( 1 f(x) = x ) if x 0 0 if x = 0 one finds that f (0) = 0 yet [ 1, 1] f(0) Algorithms for Nonsmooth Optimization 46 of 55

74 Clarke ɛ-subdifferential As before, we typically cannot compute f(x). It is approximated by the Clarke ɛ-subdifferential, namely, ɛf(x) = conv{ f(b(x, ɛ))}, which in turn can be approximated as in ɛf(x) conv{ f(x k ), f(x k,1 ),..., f(x k,m )}, where {x k,1,..., x k,m } B(x k, ɛ). Algorithms for Nonsmooth Optimization 47 of 55

75 Clarke ɛ-subdifferential and gradient sampling As before, we typically cannot compute f(x). It is approximated by the Clarke ɛ-subdifferential, namely, ɛf(x) = conv{ f(b(x, ɛ))}, which in turn can be approximated as in ɛf(x) conv{ f(x k ), f(x k,1 ),..., f(x k,m )}, where {x k,1,..., x k,m } B(x k, ɛ). In gradient sampling, we compute the minimum norm element in which is equivalent to solving min (x,v) R n R v + x x k 2 2 conv{ f(x k ), f(x k,1 ),..., f(x k,m )}, s.t. f(x k ) + f(x k,i ) T (x x k ) v for all i {1,..., m} Algorithms for Nonsmooth Optimization 47 of 55

76 Outline Motivating Examples Subdifferential Theory Fundamental Algorithms Nonconvex Nonsmooth Functions General Framework Algorithms for Nonsmooth Optimization 48 of 55

77 Popular and effective method Despite all I ve talked about, a very effective method: BFGS Algorithms for Nonsmooth Optimization 49 of 55

78 Popular and effective method Despite all I ve talked about, a very effective method: BFGS Approximate second-order information with gradient displacements: x k+1 x k x Secant equation H k y k = s k to match gradient of f at x k, where s k := x k+1 x k and y k := f(x k+1 ) f(x k ) Algorithms for Nonsmooth Optimization 49 of 55

79 BFGS-type updates Inverse Hessian and Hessian approximation updating formulas (s T k v k > 0): ( W k+1 I v ks T k s T k v k ( H k+1 I s ks T k H k s T k H ks k ) T W k ( I v ks T k s T k v k ) T ( H k ) I s ks T k H k s T k H ks k With an appropriate technique for choosing v k, we attain self-correcting properties for {H k } and {W k } + s ks T k s T k v k ) + v kv T k s T k v k (inverse) Hessian approximations that can be used in other algorithms Algorithms for Nonsmooth Optimization 50 of 55

80 Subproblems in nonsmooth optimization algorithms With sets of points, scalars, and (sub)gradients {x k,j } m j=1, {f k,j} m j=1, {g k,j} m j=1, nonsmooth optimization methods involve the primal subproblem ( min x R n max {f k,j + gk,j T (x x k,j)} + 1 j {1,...,m} 2 (x x k) T H k (x x k ) s.t. x x k δ k, ) (P) but, with G k [g k,1 g k,m ], it is typically more efficient to solve the dual sup (ω,γ) R m + Rn 1 2 (G kω + γ) T W k (G k ω + γ) + b T k ω δ k γ s.t. 1 T m ω = 1. (D) The primal solution can then be recovered by x k x k W k (G k ω k + γ k ). }{{} g k Algorithms for Nonsmooth Optimization 51 of 55

81 Algorithm Self-Correcting Variable-Metric Alg. for Nonsmooth Opt. 1: Choose x 1 R n. 2: Choose a symmetric positive definite W 1 R n n. 3: Choose α (0, 1) 4: for k = 1, 2,... do 5: Solve (P) (D) such that setting 6: yields G k [ g k,1 g k,m ], s k W k (G k ω k + γ k ), and x k+1 x k + s k f(x k+1 ) f(x k ) 1 2 α(g kω k + γ k ) T W k (G k ω k + γ k ). 7: Choose v k (details omitted, but very simple) 8: Set ( ) T ( W k W k+1 I v ks T k s T k v k I v ks T k s T k v k ) + s ks T k s T k v. k Algorithms for Nonsmooth Optimization 52 of 55

82 Instances of the framework Cutting plane / bundle methods Points added incrementally until sufficient decrease obtained Finite number of additions until accepted step Gradient sampling methods Points added randomly / incrementally until sufficient decrease obtained Sufficient number of iterations with good steps In any case: convergence guarantees require {W k } to be uniformly positive definite and bounded on a sufficient number of accepted steps Algorithms for Nonsmooth Optimization 53 of 55

83 C++ implementation: NonOpt BFGS w/ weak Wolfe line search Name Exit ɛ end f(x end) #iter #func #grad #subs maxq Stationary +9.77e e mxhilb Stepsize +3.13e e chained lq Stepsize +5.00e e chained cb3 1 Stepsize +1.00e e chained cb3 2 Stepsize +1.00e e active faces Stepsize +2.50e e brown function 2 Stepsize +1.00e e chained mifflin 2 Stepsize +5.00e e chained crescent 1 Stepsize +1.00e e chained crescent 2 Stepsize +1.00e e Bundle method with self-correcting properties Name Exit ɛ end f(x end) #iter #func #grad #subs maxq Stationary +9.77e e mxhilb Stationary +9.77e e chained lq Stationary +9.77e e chained cb3 1 Stationary +9.77e e chained cb3 2 Stationary +9.77e e active faces Stationary +9.77e e brown function 2 Stationary +9.77e e chained mifflin 2 Stationary +9.77e e chained crescent 1 Stationary +9.77e e chained crescent 2 Stationary +9.77e e Algorithms for Nonsmooth Optimization 54 of 55

84 Thanks! NonOpt coming soon... Andreas could finish in a day what has taken me 6 months on sabbatical, so it ll be done when he has a free day ;-) Thanks for listening! Algorithms for Nonsmooth Optimization 55 of 55

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth