Nonsmooth optimization : beyond first order methods. A tutorial focusing on bundle methods

Size: px

Start display at page:

Download "Nonsmooth optimization : beyond first order methods. A tutorial focusing on bundle methods"

Alexina Campbell
5 years ago
Views:

1 Nonsmooth optimization : beyond first order methods. A tutorial focusing on bundle methods Claudia Sagastizábal (IMECC-UniCamp, Campinas Brazil, adjunct researcher) SESO 2018, Paris, May 23 and 25, 2018

2 Computational NSO: what do we mean? For the unconstrained problem minf(x), where f : IR n IR is convex but not differentiable at some points Algorithms defined according on how much information is provided by certain oracle

3 Computational NSO: what do we mean? For the unconstrained problem minf(x), where f : IR n IR is convex but not differentiable at some points, Algorithms defined according on how much information is provided by certain oracle an informative oracle f(x) x the full f(x)

4 Computational NSO: what do we mean? For the unconstrained problem minf(x), where f : IR n IR is convex but not differentiable at some points Algorithms defined according on how much information is provided by certain oracle a black box f(x) x ONE g(x) f(x)

5 Computational NSO: what do we mean? For the unconstrained problem minf(x), where f : IR n IR is convex but not differentiable at some points, Algorithms defined according on how much information is provided by certain oracle a black box f(x) x ONE g(x) f(x) How common are nonsmooth objective functions in optimization?

6 When does nonsmoothness appear? * if the nature of the problem imposes a nonsmooth model; or * if sparsity of the solution is a concern; or * in problems difficult to solve, because they are large scale because they are heterogeneous sometimes the solution method induces nonsmoothness

7 Example of NS model Recovery of blocky images (l 1 -regularization of TV)

8 Example of sparse optimization min{ x 1 : Ax = b} Basis pursuit: find least 1-norm point on the affine plane Tends to return a sparse point (sometimes, the sparsest)

9 Example of sparse optimization min{ x 1 : Ax = b} Basis pursuit: find least 1-norm point on the affine plane Tends to return a sparse point (sometimes, the sparsest) LASSO denoises basis pursuit { } min Ax b 2 2 : x 1 τ or or min { } x 1 + µ 2 Ax b 2 2 } min { x 1 : Ax b 2 2 σ

10 Example of sparse optimization min{ x 1 : h(x) b} Basis pursuit: find least 1-norm point on a nonlinear set Tends to return a sparse point (sometimes, the sparsest) LASSO denoises basis pursuit { ( } + min h(x) b) 2 2 : x 1 τ or ( ) } + min { x 1 + µ2 h(x)x b 2 2 or { ( } + min x 1 : h(x)x b) 2 2 σ

11 Lagrangian Relaxation Example Real-life optimization problems min C j (p j ) j J (primal) for j J: p j P j g j (p j ) = Dem j J x

12 Lagrangian Relaxation Example Real-life optimization problems max C j (p j ) j J (primal) for j J: p j P j g j (p j ) = Dem j J x often exhibit separable structure after dualization

13 Lagrangian Relaxation Example Real-life optimization problems max C j (p j ) j J (primal) for j J: p j P j g j (p j ) = Dem j J x often exhibit separable structure passing to the (dual):

14 Lagrangian Relaxation Example Real-life optimization problems max C j (p j ) j J (primal) for j J: p j P j g j (p j ) = Dem j J x often exhibit separable structure passing to the (dual): min x f(x) := f 0 (x) + j J min x x,dem + j J f j (x) max C j (p j ) + p j P j x,g j (p j )

15 Lagrangian Relaxation Example Real-life optimization problems max C j (p j ) j J (primal) for j J: p j P j g j (p j ) = Dem j J x often exhibit separable structure passing to the (dual): min x f(x) := f 0 (x) + j J min x x,dem + j J f j (x) max C j (p j ) + p j P j x,g j (p j )

16 Benders Decomposition Example Similar situation, but now the uncoupling is done on a primal level min (primal) j J I j ( p j )+C j (p j ) for j J: p j P j p j p j + p j p D

17 Benders Decomposition Example Similar situation, but now the uncoupling is done on a primal level (primal) min p min j J I j ( p j )+C j (p j ) for j J: p j P j p D I j ( p j )+V j ( p j ) j J p D V j ( p j ) := p j p j + p j min C j (p j ) p j p j + p j minf(x) := j J f j ( p j ) f j ( p j ) :=

18 Benders Decomposition Example Similar situation, but now the uncoupling is done on a primal level (primal) min p min j J I j ( p j )+C j (p j ) for j J: p j P j p D I j ( p j )+V j ( p j ) j J p D V j ( p j ) := p j p j + p j min C j (p j ) p j p j + p j minf(x) := j J f j ( p j ) for f j ( p j ) := I j ( p j )+V j ( p j )

19 Computing f(x k ): how difficult is it? 1. f(x) = x, for n = 1 2. A linear Lasso function, f(x) = x 1 + µ 2 Ax b A nonlinear Lasso function, h C 1, f(x) = x 1 + µ 2 (h(x) b) One of the local subproblems in the Lagrangian example, max C j (p j ) + x k,g j (p j ) f j (x k ) := p j P j 5. One of the local subproblems in the Benders example, } (I j ( p j ) + V j ( p j ) =f j (x k,j ) = min {C j (p j ) : p j p j + x k,j

20 But why would one want ALL of f(x k )?

21 But why would one want ALL of f(x k )? Indispensible to calculate the proximal point p = prox f t (x) p = argminf(y) + 1 2t y x 2 2 f(p) + 1 t (p x) 1 t (x p) f(p)

22 But why would one want ALL of f(x k )? Indispensible to calculate the proximal point p = prox f t (x) p = argminf(y) + 1 2t y x f(p) + 1 t (p x) 1 t (x p) f(p)

23 But why would one want ALL of f(x k )? Indispensible to calculate the proximal point p = prox f t (x) p = argminf(y) + 1 2t y x f(p) + 1 t (p x) 1 t (x p) f(p)

24 But why would one want ALL of f(x k )? Indispensible to calculate the proximal point p = prox f t (x) p = argminf(y) + 1 2t y x f(p) + 1 t (p x) 1 t (x p) f(p) Without full knowledge of the subdifferential, the implicit inclusion cannot be solved!

25 But why would one want ALL of f(x k )? Indispensible to calculate the proximal point p = prox f t (x) p = argminf(y) + 1 2t y x f(p) + 1 t (p x) 1 t (x p) f(p) Without full knowledge of the subdifferential, the implicit inclusion cannot be solved! note: p x t f(p) akin to a subgradient method

26 Proximal point algorithms (Accel. Nesterov, FISTA, AugLag) x k+1 = prox f t k (x k ) x k+1 = argminf(y) + 1 2t k y x k 2 2

27 Proximal point algorithms (Accel. Nesterov, FISTA, AugLag) x k+1 = prox f t k (x k ) x k+1 = argminf(y) + 1 2t k y x k 2 2 of interest if computing prox f t k (x k ) is much easier than minimizing f stepsize t k > 0 impacts on the number of iterations

28 Proximal point algorithms (Accel. Nesterov, FISTA, AugLag) x k+1 = prox f t k (x k ) x k+1 = argminf(y) + 1 2t k y x k 2 2 of interest if computing prox f t k (x k ) is much easier than minimizing f stepsize t k > 0 impacts on the number of iterations

29 Proximal point: calculus rules separable sum: f(x,y) = (g(x),h(y)) = prox f t (x) = ( prox g t (x),proxh t (y) ) scalar factor (α 0) and translation (v 0): f(x) = g(αx + ( v) = ) prox f t (x) = α 1 prox α2 g t (αx + v) v perspective" (α > 0): f(x) = αg( α 1 x) = proxf t (x) = αproxg/α t ( α x )

30 Proximal point: special functions + linear term (v 0): f(x) = g(x) + v,x = prox f t (x) = proxg t (x v) + convex quadratic term (t > 0): f(x) = g(x) + 1 2t x v 2 = prox f t (x) = proxλg t (λx + (1 λ)v) for λ = t t + 1 composition with linear term such that A A = α 1 I, (α 0): f(x) = g(ax + v) = [ ] prox f t (x) = (I αa A)x + αa prox g/α (Ax + v) v t

31 Proximal point algorithm: convergence If argminf /0 then f(x k ) f( x) x0 x 2 2 k i=1 t i

32 Proximal point algorithm: convergence If argminf /0 then f(x k ) f( x) x0 x 2 2 k i=1 t i = convergence if t i + = rate 1/k if {t k } bounded away from zero

33 Proximal point algorithm: acceleration ) x k+1 = prox f t k (x k + θ k+1 ( θ 1 k 1)(x k x k 1 ) for θ 2 k+1 t k+1 = (1 θ k+1 ) θ2 k t k

34 Proximal point algorithm: acceleration ) x k+1 = prox f t k (x k + θ k+1 ( θ 1 k 1)(x k x k 1 ) for θ 2 k+1 t k+1 = (1 θ k+1 ) θ2 k t k = convergence if t i + = rate 1/k 2 if {t k } bounded away from zero

35 What if prox f t is not computable?

36 What if prox f t Use bundle methods! is not computable?

37 What if prox f t Use bundle methods! is not computable? When do bundle method prove most useful?

38 What if prox f t Use bundle methods! is not computable? When do bundle method prove most useful? In situations when the objective function is not available explicitly and/or when we do not have access to the full subdifferential and/or when calculations need to be done with high precision

39 Bundling to approximate the prox WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x f(p) + 1 t (p x) (x p) f(p) 1 t

40 Bundling to approximate the prox WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x f(p) + 1 t (p x) (x p) f(p) 1 t HAVE: q = prox M t (x) q = argminm(y) + 2t 1 y x M(q) + 1 t (q x) (x q) M(q) 1 t

41 Bundling to approximate the prox WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x f(p) + 1 t (p x) (x p) f(p) 1 t HAVE: q = prox M t (x) q = argminm(y) + 1 2t y x M(q) + 1 t (q x) (x q) M(q) 1 t M is a model of f for which we do have full knowledge of the subdifferential: the implicit inclusion can be solved!

42 Bundling to approximate the prox WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x f(p) + 1 t (p x) (x p) f(p) 1 t HAVE: q = prox M t (x) q = argminm(y) + 1 2t y x M(q) + 1 t (q x) (x q) M(q) 1 t M is a model of f for which we do have full knowledge of the subdifferential: the implicit inclusion can be solved!

43 How is the model built?

44 Model built with the black box

45 A quick overview of Convex Analysis An example of a convex nonsmooth function

46 A quick overview of Convex Analysis An example of a convex nonsmooth function

47 A quick overview of Convex Analysis An example of a convex nonsmooth function { f(x)} = {slope of the linearization supporting f, tangent at x}

48 A quick overview of Convex Analysis An example of a convex nonsmooth function { f(x)} = {slope of the linearization supporting f, tangent at x} By convexity, f(y) f(x) + f(x),y x for all y

49 A quick overview of Convex Analysis An example of a convex nonsmooth function

50 A quick overview of Convex Analysis An example of a convex nonsmooth function

51 A quick overview of Convex Analysis An example of a convex nonsmooth function

52 A quick overview of Convex Analysis An example of a convex nonsmooth function

53 A quick overview of Convex Analysis An example of a convex nonsmooth function f(x) = {g IR n : f(y) f(x) + g,y x for all y} = {slopes of linearizations supporting f, tangent at x}

54 A quick overview of Convex Analysis An example of a convex nonsmooth function f(x) = {g IR n : f(y) f(x) + g,y x for all y} = {slopes of linearizations supporting f, tangent at x}

55 What can be done with the oracle output? An example of a convex nonsmooth function f(x) = {g IR n : f(y) f(x) + g,y x for all y} = {slopes of linearizations supporting f, tangent at x}

56 What can be done with the oracle output? An example of a convex nonsmooth function f(x) = {g IR n : f(y) f(x) + g,y x for all y} = {slopes of linearizations supporting f, tangent at x} 1 oracle call x k f(xk ) g(x k ) f(x k ) = 1 linearization

57 BEWARE: x k f(x k ) g(x k ) f(x k ) if oracle output is not accurate, linearization can be wrong! wrong g(x k ) gives bad linearization at x k f(x) = {g IR n : f(y) f(x) + g,y x for all y} (similarly if wrong f(x k ), more on this later)

58 How is the oracle information used? Putting together linearizations creates a cutting-plane model M for f f i = x i f(x i ) g i = g(x i ) = M(y) = max{f i + g i,x x i } i

59 How is the oracle information used? Putting together linearizations 1 creates a cutting-plane model M for f f i = x i f(x i ) g i = g(x i ) { = M(y) = max f i + g i,x x i } i

60 How is the oracle information used? Putting together linearizations 1 creates a cutting-plane model M for f f i = x i f(x i ) g i = g(x i ) { = M(y) = max f i + g i,x x i } i

61 How is the oracle information used? Putting together linearizations 1 creates a cutting-plane model M for f f i = x i f(x i ) g i = g(x i ) { = M(y) = max f i + g i,y x i } i (just one type of model, many others are possible)

62 How is the oracle information used? Putting together linearizations 1 creates a cutting-plane model M for f f i = x i f(x i ) g i = g(x i ) = M k (y) = max i k { f i + g i,y x i } (just one type of model, many others are possible)

63 Infinite bundling yields prox f t WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x 2 2 HAVE: q k = prox M k t k (x) q k = argminm k (y) + 2t 1 k y x k = G k + t 1 k (q k x k ) Theorem [CL93] Suppose the models satisfy M k (y) f(y) for all k and y M k+1 (y) f(q k ) + g(q k ),y x k M k+1 (y) M k (q k ) + G k,y x k If 0 < t min t k+1 t k, then lim k qk = p and lim k M k(q k ) = f(p) for G k M k (q k )

64 Infinite bundling yields prox f t WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x 2 2 HAVE: q k = prox M k t k (x) q k = argminm k (y) + 2t 1 k y x k = G k + t 1 k (q k x k ) Theorem [CL93] Suppose the models satisfy M k (y) f(y) for all k and y M k+1 (y) f(q k ) + g(q k ),y x k M k+1 (y) M k (q k ) + G k,y x k If 0 < t min t k+1 t k, then lim k qk = p and lim k M k(q k ) = f(p) for G k M k (q k )

65 Models for the half-and-half function STRUCTURE none f(x) x Ax + x Bx sum compo sition f 1 (x) + f 2 (x) (h c)(x) f 1 (x) = x Ax f 2 (x) = x Bx f 2 is smooth c(x) = (x,x Bx) IR n+1 c is smooth h(c) = C 1:n AC 1:n + C n+1 h is sublinear

66 Models for the half-and-half function STRUCTURE none f(x) x Ax + x Bx sum compo sition f 1 (x) + f 2 (x) (h c)(x) f 1 (x) = x Ax f 2 (x) = x Bx f 2 is smooth c(x) = (x,x Bx) IR n+1 c is smooth h(c) = C 1:n AC 1:n + C n+1 h is sublinear

67 Models for the half-and-half function STRUCTURE none f(x) x Ax + x Bx f k := f(x k ), g k f(x k ) sum compo sition f 1 (x) + f 2 (x) (h c)(x) f 1 (x) = x Ax f 2 (x) = x Bx f k 1,gk 1,fk 2, f 2(x k ) c(x) = (x,x Bx) IR n+1 c k = c(x k ),c (x k ) h(c) = C 1:n AC 1:n + C n+1 h k,g k h(c k )

68 Models for the half-and-half function STRUCTURE none f(x) x Ax + x Bx f k := f(x k ), g k f(x k ) sum compo sition f 1 (x) + f 2 (x) (h c)(x) f 1 (x) = x Ax f 2 (x) = x Bx f k 1,gk 1,fk 2, f 2(x k ) c(x) = (x,x Bx) IR n+1 c k = c(x k ),c (x k ) h(c) = C 1:n AC 1:n + C n+1 h k,g k h(c k )

69 Models for the half-and-half function STRUCTURE none f(x) x Ax + x Bx f k := f(x k ), g k f(x k ) sum compo sition f 1 (x) + f 2 (x) (h c)(x) f 1 (x) = x Ax f 2 (x) = x Bx f k 1,gk 1,fk 2, f 2(x k ) c(x) = (x,x Bx) IR n+1 c k = c(x k ),c (x k ) h(c) = C 1:n AC 1:n + C n+1 h k,g k h(c k )

70 NSO pitfalls NSO pitfalls NSO pitfalls NSO pitfalls

71 Stopping test in smooth optimization Algorithms for unconstrained smooth optimization use as optimality certificate Fermat s rule 0 = f( x) and generate a minimizing sequence: {x k } x such that f(x k ) 0. If f C 1, then f( x) = 0

72 Stopping test in smooth optimization Algorithms for unconstrained smooth optimization use as optimality certificate Fermat s rule 0 = f( x) and generate a minimizing sequence: {x k } x such that f(x k ) 0. If f C 1, then f( x) = 0 things are less straightforward if f is nonsmooth...

73 What happens with the stopping test in NSO? Algorithms for unconstrained NSO use as optimality certificate the inclusion 0 f( x) As a set-valued mapping f(x) is osc: ( ) x k,g(x k ) f(x k x k x ) : g(x k ) ḡ = ḡ f( x)

74 What happens with the stopping test in NSO? Algorithms for unconstrained NSO use as optimality certificate the inclusion 0 f( x) As a set-valued mapping f(x) is osc: ( ) x k,g(x k ) f(x k x k x ) : g(x k ) ḡ = ḡ f( x) As a set-valued mapping, f(x) is not isc: Given ḡ f( x) ( ) / x k,g(x k ) f(x k x k x ) : g(x k ) ḡ

75 What happens with the stopping test in NSO? Algorithms for unconstrained NSO use as optimality certificate the inclusion 0 f( x) As a set-valued mapping f(x) is osc: ( ) x k,g(x k ) f(x k x k x ) : g(x k ) ḡ = ḡ f( x) As a set-valued mapping, f(x) is not isc: Given ḡ f( x) ( ) /// x k,g(x k ) f(x k x k x ) : g(x k ) ḡ

76 The subdifferential For the absolute value function, f(x) = x [ 1, 1 ε/x] x < ε/2, ε f(x) = [ 1,1] ε/2 x1 ε1/2, [1 ε/x,1] x > ε/2. 1 x < 0 f(x) = [ 1,1] x = 0 1 x > 0

77 What happens with the stopping test in NSO? We need to design a sound stopping test that does not rely on the straightforward extension of Fermat s rule.

78 What happens with the stopping test in NSO? We need to design a sound stopping test that does not rely on the straightforward extension of Fermat s rule. We use instead ḡ ε f( x) for ḡ and ε small where the ε-subdifferential contains the slopes of linearizations supporting f up to ε, tangent at x: ε f(x) = {g IR n : f(y) f(x)+ g,y x ε for all y}

79 The ε-subdifferential ε f(x) = {g IR n : f(y) f(x) + g,y x ε for all y} f + ε f ε Subgradient linearization

80 The ε-subdifferential ε f(x) = {g IR n : f(y) f(x) + g,y x ε for all y} f + ε f ε Epsilon Subgradient linearization

81 The ε-subdifferential ε f(x) = {g IR n : f(y) f(x) + g,y x ε for all y} f + ε f ε Epsilon Subgradient linearization

82 The ε-subdifferential ε f(x) = {g IR n : f(y) f(x) + g,y x ε for all y} f + ε f ε Epsilon Subgradient linearization

83 The ε-subdifferential For the absolute value function, f(x) = x [ 1, 1 ε/x] x < ε/2, ε f(x) = [ 1,1] ε/2 x1 ε1/2, [1 ε/x,1] x > ε/2. 1 x < 0 f(x) = [ 1,1] x = 0 1 x > 0

84 The ε-subdifferential For the absolute value function, f(x) = x [ 1, 1 ε/x] if x < ε/2, ε f(x) = [ 1,1] if ε/2 x1 ε1/2, [1 ε/x,1] if x > ε/2. 1 x < 0 f(x) = [ 1,1] x = 0 1 x > 0 ε 2 ε 2

85 The ε-subdifferential ε 2 ε 2 As a set-valued mapping ε f(x) is osc: ( ) ε k ε ε k,x k,g(x k ) ε kf(x k ) : x k x G(x k ) ḡ = ḡ ε f( x) As a set-valued mapping, ε f(x) is isc: Given ḡ ε f( x) ( ) ε k,x k,g(x k ) ε kf(x k ) : ε k ε x k x G(x k ) ḡ

86 The ε-subdifferential and bundle methods Generate iterates so that for a subsequence {^x k } As a set-valued mapping ε f(x) is osc: ( ) ε k ε ε k,^x k,g(^x k ) ε kf(^x k ) : x k x G(^x k ) ḡ = ḡ ε f( x) with ε = 0 and ḡ = 0 As a set-valued mapping, ε f(x) is isc:given ḡ ε f( x) : ( ) ε k ε ε k,^x k,g(^x k ) ε kf(^x k ) : x k x G(x k ) ḡ

87 The ε-subdifferential and bundle methods You told us we were going to use subgradient information provided by an oracle or a black box, and now you want to use ε-subgradients!

88 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x pmf(^x k ) ) = f(^x k ) + g i,y x Bigl(f(^x k ) f(x i ) = f(^x k ) + g i,y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i,y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k )

89 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i (y x) f(^x k ) f(x i ) ( ) = f(^x k ) + g i (y x ± ^x k ) f(^x k ) f(x i ) ( = f(^x k ) + g i (y ^x k ) f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i (y ^x k ) e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,^x k x i 0

90 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i,y x f(^x k ) f(x i ) = f(^x k ) + g i y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,,^x k x i ) 0

91 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i,y x f(^x k ) f(x i ) = f(^x k ) + g i,y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i,y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,,^x k x i ) 0

92 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i,y x f(^x k ) f(x i ) = f(^x k ) + g i,y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i,y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,,^x k x i ) 0

93 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i,y x f(^x k ) f(x i ) = f(^x k ) + g i,y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i,y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,,^x k x i ) 0

94 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i,y x f(^x k ) f(x i ) = f(^x k ) + g i,y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i,y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,^x k x i 0

95 The transportation formula How to express subgradients at x i as ε-subgradients at ^x k? g i f(x i ) if and only if, for all y IR n f(y) f(x i ) + g i,y x i = f(x i ) + g i,y x ± f(^x k ) ( ) = f(^x k ) + g i,y x f(^x k ) f(x i ) = f(^x k ) + g i,y x ± ^x k ( ) f(^x k ) f(x i ) = f(^x k ) + g i,y ^x k ( f(^x k ) f(x i ) g i,^x k x i ) = f(^x k ) + g i,y ^x k e i (^x k ) = g i e i (^x k ) f(^xk ) e i (^x k ) := f(^x k ) f(x i ) g i,^x k x i 0

96 Linearization errors e j (^x k ) f(^x k ) f M k ( ) f(x j ) ^x k x j f(x j ) + g j, x j

97 The ε-subdifferential and bundle methods We collect the black-box x i,i = 1,2,...,k, so that at iteration k we can define a bundle of information, centered at a special iterate ^x k {x i } B k := ei (^x k ) = f(^x k ) f(x i ) g i,^x k x i g i e i (^x k ) f(^xk )

98 The ε-subdifferential and bundle methods We collect the black-box x i,i = 1,2,...,k, so that at iteration k we can define a bundle of information, centered at a special iterate ^x k {x i } B k := ei (^x k ) = f(^x k ) f(x i ) g i,^x k x i g i e i (^x k ) f(^xk ) A suitable convex combination ε k := i B k α i e i (^x k ) and G k := i B k α i g i will eventually satisfy the optimality condition!

99 Why special NSO methods? Smooth optimization techniques do not work abs 0 f(x) = x f(x k ) = 1, x k 0 f(0) = [ 1,1] Smooth stopping test fails: f(x k ) TOL ( g(x k ) TOL)

100 Why special NSO methods? Smooth optimization techniques do not work Smooth approximations of derivatives by finite differences fail For f : IR 3 IR defined by f(x) = max(x 1,x 2,x 3 ) f(0) =? Forward finite difference f(x+ x) f(x) x = (1,1,1) Central finite difference f(x+ x) f(x ) 2 x = ( 1 2, 1 2, none of them in the subdifferential!

101 Why special NSO methods? Smooth optimization techniques do not work Smooth approximations of derivatives by finite differences fail For f : IR 3 IR defined by f(x) = max(x 1,x 2,x 3 ) f(0) =? Forward finite difference f(x+ x) f(x) x = (1,1,1) Central finite difference f(x+ x) f(x ) 2 x = ( 1 2, 1 2, 1 2 none of them in the subdifferential!

102 Why special NSO methods? Smooth optimization techniques do not work Linesearches get trapped in kinks and fail

103 Why special NSO methods? Smooth optimization techniques do not work Linesearches get trapped in kinks and fail Example 9.1 Instability of steepest descent"

104 Why special NSO methods? Smooth optimization techniques do not work g(x k ) may not provide descent

105 Why special NSO methods? Smooth optimization techniques do not work g(x k ) may not provide descent

106 Why special NSO methods? Smooth optimization techniques do not work Smooth stopping test fails Finite difference approximations fail Linesearches get trapped in kinks and fail Direction opposite to a subgradient may increase the functional values

107

108

109

110 Bundle Methods WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x 2 2 HAVE: q k = prox M k t k (x) q k = argminm k (y) + 2t 1 k y x k = G k + t 1 k (q k x k ) G k εk f(x) for G k M k (q k ) for ε k = f(x) M k (q k ) t k G k 2 2

111 Bundle Methods WANT: p = prox f t (x) p = argminf(y) + 2t 1 y x 2 2 HAVE: q k = prox M k t k (x) q k = argminm k (y) + 2t 1 k y x k = G k + t 1 k (q k x k ) Two subsequences G k εk f(x) for G k M k (q k ) for ε k = f(x) M k (q k ) t k G k 2 2 Iterates giving sufficiently good approximal points Iterates just helping the optimization process CL93 eventually applies null

112 Bundle Methods HAVE: q k = prox M k t k (x) = x k + t k G k G k εk f(x) Two subsequences for ε k = f(x) M k (q k ) t k G k 2 2 Iterates giving sufficiently good approximal points moving towards minimum in a manner that makes δ k := ε k + t k G k (serious) Iterates just helping the optimization process CL93 eventually applies (null)

113 Bundle Methods

114 Bundle Methods M 1 ( ) + 1 2t 1 ^x 1 2

115 Bundle Methods M 2 ( ) + 1 2t 2 ^x 2 2 ^x 1 x 2

116 Bundle Methods ^x 1 ^x 3 x 2

117 Bundle Methods ^x 1 ^x 4 ^x 3 x 2

118 Bundle Methods 0 Choose x 1, set k = 1, and let ^x 1 = x 1. 1 Compute x k+1 = argminm k (x) + 1 2t k x ^x k 2 2 If δ k :=f(^x k ) M k (x k+1 ) tol STOP 3 Call the oracle at x k+1. If f(x k+1 ) f(^x k ) mδ k, set ^x k+1 = x k+1 (Serious Step) Otherwise, maintain ^x k+1 = ^x k (Null Step) 4 Define M k+1, t k+1, make k = k + 1, and loop to 1.

119 Bundle Methods: selection mechanism ( M k+1 ( ) = max M k ( ),f k + now the choice of the new model is more flexible: x k+1 argminm k (x) + 2t 1 k x ^x k 2 with M k (x) = max i k {f i + g k, x k ), g i,x x i } is equivalent to a QP: min r IR,x IR n r + 2t 1 k x ^x k 2 s.t. r f i + g i,x x i for i k A posteriori, the solution remains the same if...

120 Bundle Methods: selection mechanism ( M k+1 ( ) = max M k ( ),f k + now the choice of the new model is more flexible: x k+1 argminm k (x) + 2t 1 k x ^x k 2 with M k (x) = max i k {f i + g k, x k ), g i,x x i } is equivalent to a QP: min r IR,x IR n r + 2t 1 k x ^x k 2 s.t. r f i + g i,x x i for i k A posteriori, the solution remains the same if all, or...

121 Bundle Methods: selection mechanism ( M k+1 ( ) = max M k ( ),f k + now the choice of the new model is more flexible: x k+1 argminm k (x) + 2t 1 k x ^x k 2 with M k (x) = max i k {f i + g k, x k ), g i,x x i } is equivalent to a QP: min r IR,x IR n r + 2t 1 k x ^x k 2 s.t. r f i + g i,x x i for active i s A posteriori, the solution remains the same if all, or active, or...

122 Bundle Methods: selection mechanism ( M k+1 ( ) = max M k ( ),f k + now the choice of the new model is more flexible: x k+1 argminm k (x) + 2t 1 k x ^x k 2 with M k (x) = max i k {f i + g k, x k ), g i,x x i } is equivalent to a QP: min r IR,x IR n r + 2t 1 k x ^x k 2 s.t. r i ᾱi( f i + g i,x x i ) A posteriori, the solution remains the same if all, or active, or the optimal convex combination

123 Bundle Methods: selection mechanism ( M k+1 ( ) = max M k ( ),f k + now the choice of the new model is more flexible: x k+1 argminm k (x) + 2t 1 k x ^x k 2 with M k (x) = max i k {f i + g k, x k ), g i,x x i } is equivalent to a QP: min r IR,x IR n r + 2t 1 k x ^x k 2 s.t. r i ᾱi( f i + g i,x x i ) A posteriori, the solution remains the same if all, or active, or the optimal convex combination are kept

124 Bundle Methods: next model options ( M k+1 ( ) = max M k ( ),f k + g k, x k ) or ( M k+1 ( ) = max max active,fk + g k, x k ) or ( M k+1 ( ) = max aggregate,f k + g k, x k ) Same QP solution if all, or active, or the optimal convex combination aggregate=full Bundle Compression: QP with only 2 constraints (but slows down the overall process)

125 The cutting-plane model You told us we were going to use a bundle B k composed by linearization errors and ε-subgradients at ^x k, but the model uses f i and g i f(x i )

126 Rewriting the cutting-plane model The transportation formula centers the ith linearization in the serious iterate f(y) f(x i ) + g i,y x i f(^x k ) f = f(^x k ) + g i,y ^x k e i (^x k e j (^x k ) ) f(x j ) M k ( ) ^x k x j f(x j ) + g j, x j

127 Rewriting the cutting-plane model The transportation formula centers the ith linearization in the serious iterate f(y) f(x i ) + g i,y x i f(^x k ) f = f(^x k ) + g i,y ^x k e i (^x k e j (^x k ) ) f(x j ) M k ( ) this translates into the model as follows { M k (y) = max f(x i ) + g i,y x i } : i B k { = max f(^x k ) + g i,y ^x k } e i (^x k ) : i B k = f(^x k ) + max { g i,y ^x k } e i (^x k ) : i B k ^x k x j f(x j ) + g j, x j

128 Bundle Method 0 Choose x 1, set k = 1, and let ^x 1 = x 1. 1 Compute x k+1 = argminm k (x) + 1 2t k x ^x k 2 2 If δ k :=f(^x k ) M k (x k+1 ) tol STOP 3 Call the oracle at x k+1. If f(x k+1 ) f(^x k ) mδ k, set ^x k+1 = x k+1 Otherwise, maintain ^x k+1 = ^x k 4 Define M k+1, t k+1, make k = k + 1, and loop to 1.

129 Bundle Method 0 Choose x 1, set k = 1, and let ^x 1 = x 1. 1 Compute x k+1 = argminm k (x) + 1 2t k x ^x k 2 2 If δ k :=f(^x k ) M k (x k+1 ) tol STOP 3 Call the oracle at x k+1. If f(x k+1 ) f(^x k ) mδ k, set ^x k+1 = x k+1 (Serious Step) k K S Otherwise, maintain ^x k+1 = ^x k (Null Step) k K N 4 Define M k+1, t k+1, make k = k + 1, and loop to 1.

130 Bundle Method When k, the algorithm generates two subsequences. Convergence analysis addresses the mutually exclusive situations either the SS subsequence is infinite K := {k K S } (limit point minimizes f) or there is a last SS, followed by infinitely many null steps K := {k K N : k lastss} (last SS minimizes f and nullto last SS)

131 Bundle Method When k, the algorithm generates two subsequences. Convergence analysis addresses the mutually exclusive situations either the SS subsequence is infinite K := {k K S } (limit point minimizes f) or there is a last SS, followed by infinitely many null steps K := {k K N : k last SS} (last SS minimizes f and nullto last SS)

132 Bundle Method When k, the algorithm generates two subsequences. Convergence analysis addresses the mutually exclusive situations either the SS subsequence is infinite K := {k K S } (limit point minimizes f) or there is a last SS, followed by infinitely many null steps K := {k K N : k last SS} (last SS minimizes f and null last SS)

133 Equivalent QPs 1. Given t k, the stepsize parameter of the proximal bundle method, with QP subproblem given by (PB) k minm k (x) + 1 2t k x ^x k 2 2. Given k, the radius parameter of the trust-region bundle method, with QP subproblem given by min M k (x) (TRB) k s.t. x ^x k 2 k 3. Given l k, the level parameter of the level bundle method,

134 with QP subproblem given by min 1 2 (LB) x ^xk 2 k s.t. M k (x) l k Show that 1. given t k, there exists k such that if x k+1 solves (PB) k, then x k+1 solves (TRB) k. 2. given k, there exists l k such that if x k+1 solves (TRB) k, then x k+1 solves (LB) k. 3. given ell k, there exists t k such that if x k+1 solves (LB) k, then x k+1 solves (PB) k.

135 Theorem K := {k K S } Suppose the bundle method loops forever and there are infinitely many serious steps. Either the solution set of minf is empty and f(^x k ) or the following holds (i) lim δ k = 0 and lim ε k = 0. k K S k K S (ii) If the stepsizes are chosen so that {^x k } is a minimizing sequence. k K S t k = + then (iii) If, in addition, t k t up for all k K S, then the subsequence {^x k } is bounded. In this case, any limit point x minimizes f and the whole sequence converges to x

136 Theorem K := { } k K N ˆk Suppose the bundle method loops forever and there are infinitely many null steps after a last serious one, denoted by ^x and generated at iteration ^k. Suppose stepsizes are chosen so that t lo t k+1 t k for all k K The following holds 1. The sequence {x k+1 } is bounded 2. lim k K M k (x k+1 ) = f(^x) 3. ^x minimizes f 4. lim k K x k+1 = ^x

137 Model requirements 1. M k f 2. If k was declared a null step a) M k+1 (x) f k+1 + g k+1,x x k+1 b) M k+1 (x) A k (x)= M k (x k+1 ) + G k,x x k+1

138 Model requirements 1. M k f 2. If k was declared a null step a) M k+1 (x) f k+1 + g k+1,x x k+1 b) M k+1 (x) A k (x)= M k (x k+1 ) + Any model satisfying these conditions that is used in the QP maintains the convergence results G k,x x k+1

139 Comparing the methods: bundle and SG Typical performance on a battery of Unit Commitment problems

140 Bundle Methods with on-demand accuracy the new generation

141 Oracle types: exact and upper f 1 (x)/g 1 (x) f 1 (x) is easy: exact f 1 (x)/g 1 (x) f 2 (x)/g 2 (x) f 2 (x) is difficult: inexact f 2 x/g 2 x f 2 f 1 f 1 (x) + g 1 (x), x f 2 x + g 2 x, x Oracle f 2 x/g 2 x NOT of lower type

142 Oracle types: exact and lower f 1 (x)/g 1 (x) f 1 (x) is easy: exact f 1 (x)/g 1 (x) f 2 (x)/g 2 (x) f 2 (x) is less difficult: inexact f 2 x/g 2 x f 2 f 1 f 1 (x) + g 1 (x), x f 2 x + g 2 x, x Oracle f 2 x/g 2 x of lower type

143 For the EM problem f j (x) = max{ C j (q j ) + x,g j (q j ) By computing f x k and g x k satisfying we can build A lower oracle f x k = f(x k ) η k and g x k η kf(x k ) An asymptotically exact oracle η k 0 as k A partly asymptotically exact oracle An on-demand accuracy oracle η k 0 as K s k η k η k when f x k f^x k mδ k : q j P j }

144 BM with lower inexact oracles M k (x) = max{f x i + g x i,x x i : i B k } δ k = ε k + t k G k 2 SS test: f x k+1 ^f k mδ k { ( )} ^f k := max f^x k,max M j (^x k ),j ^k + Oracle inaccuracy is locally bounded: R 0 η(r) 0 : x R = η η(r) convergence as before, up to the accuracy on SS

145 BM with lower inexact oracles M k (x) = max{f x i + g x i,x x i : i B k } δ k = ε k + t k G k 2 SS test: f x k+1 ^f k mδ k { ( )} ^f k := max f^x k,max M j (^x k ),j ^k + Oracle inaccuracy is locally bounded: R 0 η(r) 0 : x R = η η(r) convergence as before, up to the accuracy on SS Convex proximal bundle methods in depth: a unified analysis for inexact oracles W. de Oliveira, C. Sagastizábal, C. Lemaréchal MathProg 148, pp , 2014

146 General comments Bundle methods are robust (do not oscillate, as CP methods do) reliable (have a stopping test, unlike SG methods) can deal with inaccuracy in a reasonable manner

147 Extending bundle methods

148 Constrained NSO problems: an example

149 Optimal management of the hydrovalley ) max λ E η (ρ(u) s.t. (x,u) P P η ( Au + a min Mη Au + a max ) p

150 Optimal management of the hydrovalley ) max λ E η (ρ(u) s.t. Is this a convex program? (x,u) P P η ( Au + a min Mη Au + a max ) p

151 Optimal management of the hydrovalley ) max λ E η (ρ(u) s.t. (x,u) P P η ( Au + a min Mη Au + a max ) p Is this a convex program? YES: the function ( ) u log (P ) η Au + a min Mη Au + a max is convex. min f(u) We need to solve s.t. (x,u) P for linear f and with c(u) 0 ( ) c(u) := log (P ) η Au + a min Mη Au + a max logp

152 Optimal management of the hydrovalley ) max λ E η (ρ(u) s.t. (x,u) P P η ( Au + a min Mη Au + a max ) p Is this a convex program? YES: the function ( ) u log (P ) η Au + a min Mη Au + a max min f(u) We need to solve s.t. (x,u) P for linear f and with c(u) 0 ( ) c(u) := log (P ) η Au + a min Mη Au + a max logp difficult to compute! is convex.

153 Need to solve the constrained problem (P) min f(u) s.t. (x,u) P c(u) 0 for linear f and with inexact evaluation of c and its gradient, via a black box with controllable inaccuracy (bounded by a given tolerance ε, with confidence level 99%, noting that evaluation errors can be positive or negative)

154 Handling constraints in NSO For nonsmooth constrained problems use the Improvement Function min f(u) s.t. c(u) 0 max{f(u) f(^u),c(u)} u (changes with each serious point ^u and supposes exact f/c values available) [SagSol SiOPT, 2005 and KarasRibSagSol MPB, 2009]

155 Improvement function Let ( x,ū) be a solution to (P). The function Hū(u) := max {f(u) f(ū), c(u)} (x,u) P has perfect theoretical properties: If Slater condition ( (x, u) P s.t. c(u) < 0) holds, then ū solves min f(u) s.t. c(u) 0 (P) (x,u) P min H ū(u) = Hū(ū) = 0 (x,u) P 0 H(ū) for H( ) := Hū( )

156 Improvement function Let ( x,ū) be a solution to (P). The function Hū(u) := has perfect theoretical properties: Without Slater condition ū solves max {f(u) f(ū), c(u)} (x,u) P min f(u) s.t. c(u) 0 (P) (x,u) P BUT min H ū(u) = Hū(ū) = 0 (x,u) P and also 0 H(ū) for H( ) := Hū( )

157 Improvement function Let ( x,ū) be a solution to (P). The function Hū(u) := has perfect theoretical properties: Without Slater condition ū solves max {f(u) f(ū), c(u)} (x,u) P min f(u) s.t. c(u) 0 (P) (x,u) P : when c(ū) 0 ū solves (P), otherwise it minimizes infeasibility over P min H ū(u) = Hū(ū) = 0 (x,u) P 0 H(ū) for H( ) := Hū( )

158 Handling nonconvex problems

159 Nonconvex proximal point mapping [PR96] p R f(x) := argmin y IR N {f(y) + R 2 y x 2} x is the prox-center and R > R x is the prox-parameter Theorem If f is convex p R f is well defined for any R > 0. p R f is single valued and loc. Lip. p = p R f(x) R(x p) f(p) x minimizes f x = p R f(x ) for any R > 0. x k+1 = p R f(x k ) converges to a minimizer x. What is f is nonconvex?

160 Nonconvex difficulties Proximal Bundle Methods are the most robust and reliable (oracle) methods for convex minimization. Their success relies heavily on convexity. If f is convex: x k+1 = p R f(x k ) converges to a minimizer x. ˇf k lies entirely below f. May no longer be true for nonconvex f

161 Nonconvex difficulties Proximal Bundle Methods are the most robust and reliable (oracle) methods for convex minimization. Their success relies heavily on convexity. If f is convex x k+1 = p R f(x k ) converges to a minimizer x. ˇf k lies entirely below f. May no longer be true for nonconvex f

162 Nonconvex difficulties Proximal Bundle Methods are the most robust and reliable (oracle) methods for convex minimization. Their success relies heavily on convexity. If f is convex x k+1 = p R f(x k ) converges to a minimizer x. ˇf k lies entirely below f. May no longer be true for nonconvex f

163 Nonconvex difficulties Proximal Bundle Methods are the most robust and reliable (oracle) methods for convex minimization. Their success relies heavily on convexity. If f is convex x k+1 = p R f(x k ) converges to a minimizer x. ˇf k lies entirely below f. May no longer be true for nonconvex f

164 Nonconvex difficulties Proximal Bundle Methods are the most robust and reliable (oracle) methods for convex minimization. Their success relies heavily on convexity. If f is convex x k+1 = p R f(x k ) converges to a minimizer x. ˇf k lies entirely below f. May no longer be true for nonconvex f How this difficulty has been addressed?

165 Take each plane in the model: f i + g i, y i and rewrite it, centered at x k : [ ( )] f(x k ) f(x k ) f i + g i,x k y i + g i, x k f(x k ) e f k,i + g i, x k { } ˇf k (y) = max f(x k ) e f k,i + g i,y x k Good: e f k,i positive convergence Good: If f convex ef k,i positive. BAD: If f nonconvex, e f k,i may be negative

166 Nonconvex bundle methods fix negative linearization errors, replacing ˇf k by: { } ˇf FIX k (y) = max f(x k ) e f k,i + g i,y x k [Mif77, Lem80, Kiw85, Luk98]

167 Nonconvex bundle methods fix negative linearization errors, replacing ˇf k by: { } ˇf FIX k (y) = max f(x k ) e f k,i + g i,y x k [Mif77, Lem80, Kiw85, Luk98]

168 Nonconvex bundle methods fix negative linearization errors, replacing ˇf k by: { } ˇf FIX k (y) = max f(x k ) e f k,i + g i,y x k [Mif77, Lem80, Kiw85, Luk98]

169 Nonconvex bundle methods fix negative linearization errors, replacing ˇf k by: { } ˇf FIX k (y) = max f(x k ) e f k,i + g i,y x k [Mif77, Lem80, Kiw85, Luk98]

170 Nonconvex bundle methods fix negative linearization errors, replacing ˇf k by: { } ˇf FIX k (y) = max f(x k ) e f k,i + g i,y x k [Mif77, Lem80, Kiw85, Luk98]

171 Nonconvex bundle methods fix negative linearization errors, replacing ˇf k by: { } ˇf FIX k (y) = max f(x k ) e f k,i + g i,y x k [Mif77, Lem80, Kiw85, Luk98]

172 A new method A different approach (ours) is based on the following trick Take η,µ > 0 : R = η + µ and note } p R f(x k ) = min { f(w) + R 1 w 2 w x k 2 } = min { f(w) + (η + µ) 1 w 2 w x k 2 } = min { f(w) + η 1 w 2 w x k µ 2 w x k 2 } = min { F 1 η(w) + µ w 2 w x k 2 = p µ (F η ) (x k ) p R f = p µ (F η )

173 Redistributed Proximal Bundle Method At l th -iteration, for k = k(l), given R k, x k and a bundle B = {y i,f i,g i,i I l } 0. Split R k into η l and µ l. 1. Model F ηl ˇF ηl,l(y) = max i B {F ηl i + g ηl i,y y i } 2. Minimize the penalized model y l+1 = argmin{ˇf ηl,l(y) + µ l 2 y x k 2 } 3. Descent test If y l+1 good: x k+1 y l+1, define R k+1 serious step If y l+1 bad: null step 4. Update bundle B B {y l+1,f l+1,g l+1 }

174 VU quasi-newton bundle For x IR n, given matrices A 0, B 0, f(x) = x Ax + x Bx has a unique minimizer at x = 0. On N (A) the function is not differentiable, and the first term vanishes: f N (A) looks smooth. R(A) N (A) V U parallel to f( x) U perpendicular to V V is parallel to N (A), the ridge" of nonsmoothness

175

176 VU-Algorithm: (Mifflin&Sagastizábal, MathProg 05)Recall that f V N (A) is nice: the key is the two QP-solves R(A) N (A) V U 2 bundle QPs Newton-move Two successive bundle QPs identify the ridge of nonsmoothness Solve a 2nd QP to create an approximation of V based on ˇf(^p k )

177 VU-Algorithm: superlinear m serious subsequence (Mifflin&Sag, MathProg 05)

178 To learn more Bundle methods history R. MIFFLIN, C. SAGASTIZÁBAL, Documenta Math, A Science Fiction Story in Nonsmooth Optimization Originating at IIASA vol-ismp/44_mifflin-robert.pdf (exact) Bundle books J.F. BONNANS, J.C. GILBERT, C. LEMARÉCHAL, AND C. SAGASTIZÁBAL, Numerical Optimization: Theoretical and Practical Aspects, Springer, 2nd ed., J.B. HIRIART-URRUTY AND C. LEMARÉCHAL, Convex Analysis and Minimization Algorithms II, no. 306 in Grund. der math. Wissenschaften, Springer, 2nd ed., Inexact Bundle theory (next page)

179 Inexact Bundle theory M. HINTERMÜLLER, A proximal bundle method based on approximate subgradients, COAp 20 (2001), pp M. V. SOLODOV, On Approximations with Finite Precision in Bundle Methods for Nonsmooth Optimization. JoTA (2003), pp K.C. KIWIEL, A proximal bundle method with approximate subgradient linearizations, SiOpt 16 (2006), pp W. DE OLIVEIRA, C. SAGASTIZÁBAL, AND C. LEMARÉCHAL, Convex proximal bundle methods in depth: a unified analysis for inexact oracles, MathProg 148 (2014), pp Inexact Bundle variants with applications G. EMIEL AND C. SAGASTIZÁBAL, Incremental-like bundle methods with application to energy planning, COAp 46 (2010), pp W. DE OLIVEIRA, C. SAGASTIZÁBAL, AND S. SCHEIMBERG, Inexact bundle methods for two-stage stochastic programming, SiOpt 21 (2011), pp (next page)

180 W. VAN ACKOOIJ AND C. SAGASTIZÁBAL, Constrained bundle methods for upper inexact oracles with application to joint chance constrained energy problems, SiOpt 24 (2014), pp W. DE OLIVEIRA AND C. SAGASTIZÁBAL, Level bundle methods for oracles with on-demand accuracy, OMS 29 (2014), pp W. DE OLIVEIRA AND C. SAGASTIZÁBAL, Bundle methods in the xxi century: A birds -eye view, Pesquisa Operacional 34 (2014), pp W. DE OLIVEIRA AND M. SOLODOV, A doubly stabilized bundle method for nonsmooth convex optimization, MathProg 156(1), pp , 2016.

181 Any doubts or questions? Just me

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R