Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com

Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R n : f(y) f(ˉx) + g, y ˉx }. If f(x) is differentiable, then f(x) = { f(x)} is a singleton. For any x, y, g x f(x), and g y f(y), we have g x g y, x y 0. ( f is a monotonic point-to-set map)

Which functions have subgradients? Theorem (Nesterov Thm 3.1.13) Let f be a closed convex function and x 0 int(dom(f)). Then f(x 0) is a nonempty bounded set. Proof uses supporting hyperplanes of epigraph to show existence, and local Lipschitz continuity of convex functions to show boundedness. 1 1 This and next Slide taken from Damek s lecture. Figure taken from Boyd and Vandenberghe, http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf.

The converse Lemma (Nesterov Lm 3.1.6) If f(x) for all x dom(f), then f is convex. Proof. x, y dom(f), α [0, 1], y α = (1 α)x + αy = x + α(y x), g f(y α). f(y) f(y α ) + g, y y α = f(y α ) + (1 α) g, y x (1) f(x) f(y α) + g, x y α = f(y α) α g, y x (2) Multiply equation (1) by α and equation (2) by (1 α). Add them together to get αf(y) + (1 α)f(x) f(y α).

Compute subgradients: general rules Differentiable functions: f(x) = { f(x)}. Composition with affine map: φ(x) = f(a(x) + b) φ(x) = A T f(a(x) + b). Positive sums: α, β > 0, f(x) = αf 1(x) + βf 2(x). f(x) = α f 1(x) + β f 2(x) Maximums: f(x) = max i {1,,n} {f i(x)} f(x) = conv{ f i(x) f i(x) = f(x)}

Examples f(x) = x. f(x) = { {sign(x)} x 0; [ 1, 1] otherwise 2 2 figure taken from Boyd and Vandenberghe, http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf.

Examples f(x) = n ai, x bi. Define i=1 Then f(x) = I (x) = {i a i, x b i < 0} I + (x) = {i a i, x b i > 0} I 0(x) = {i a i, x b i = 0}. i I + (x) a i f(x) = max i {1,,n} x i. Then i I (x) a i + i I 0 (x) f(x) = conv{e i x i = f(x)} f(0) = conv{e i i {1,, n}} [ a i, a i]

Examples f(x) = x 2. f is differential away from 0, so: f(x) = At 0, go back to subgradient equation: Thus, g f(0), if, and only if, g,y y 2 ball B2(0, 1) = B 2(0, 1). This is a common pattern! x x 2 x 0. y 2 0 + g, y 0 1 for all y 0. Thus, g is in the dual

Examples f(x) = x = max i {1,,n} x (i). f(x) = conv{g (i) g (i) x (i), x (i) = f(x)}, x 0. Going back to subgradient equation y 0 + g, y g,y Thus, g f(0), if, and only if, y 1 for all y 0. Thus, f(0) is the dual ball to the l norm: B 1(0, 1).

Examples f(x) = x 1 = n xi. Then i=1 f(x) = e i x i >0 e i + [ e i, e i] x i <0 x i =0 for all x. Then n f(0) = [ e i, e i] = B (0, 1). i=1

Optimality condition: 0 subgradient minimum Suppose that 0 f(x). If f is smooth and convex, 0 f(x) = { f(x)} = f(x) = 0. In general: If 0 f(x), then f(y) f(x) + 0, y x = f(x) for all y R n. = x is a minimum! Converse also true: f(y) f(x ) + 0 = f(x ) + 0, y x.

The subgradient method Iteration: x k+1 x k α k g k where g k f(x k ). Questions: Applications? Are f(x k ) and x k x monotonic? How to choose α k?

Applications Finding a point in the intersection of closed convex sets minimize f(x) = max{dist(x, C 1),..., dist(x, C 1)} Subgradient: if f(x) = dist(x, C j) and x C j, then g = x Proj C j (x) x Proj Cj (x) x dist(x, C j ). Minimizing non-smooth convex functions, e.g., piece-wise linear convex functions Dual subgradient method (generalizes the Uzawa algorithm), dual decomposition (more to come in this lecture)

Convergence overview Typically, convergence is established by identifying a monotonically nonincreasing sequence, such as f(x k ) f and x k x 2 However, since the subgradient g(x) is not continuous in x, neither sequence is monotonic Instead, we will define the total descent and use it to bound f(x k ) The choice of step sizes α k is critical.

Monotonicity of f(x k )? The definition f(y) f(x) + g, y x, g f(x) yields f(x k+1 ) f(x k ) g k+1, x k x k+1 = f(x k ) α k g k+1, g k. It is generally difficult to estimate g k+1, g k since g is not continuous. (No matter how close x k+1 is to x k, their subgradients g k+1 and g k can be very different.) Therefore, we cannot guarantee f(x k+1 ) < f(x k ). Note: Taking the implicit iteration x k+1 = x k α k g k+1 (the proximal method), we would ensure f(x k+1 ) f(x k ). It is more expensive to compute though.

Monotonicity of x k x 2 Let us assume that x exists. Then x k+1 x 2 = (x k x ) α k g k 2 = x k x 2 2α k g k, x k x + α 2 k g k 2. To have monotonicity: x k+1 x 2 x k x 2, we need 2α k g k, x k x + α 2 k g k 2 0 g k, x k x α k 2 gk 2. However, even at x k x, g k may not vanish. Therefore, x k x 2 is generally not monotonic unless g k < G (commonly assumed for subgradient method), and α k = O( x k x ), which is unrealistic to ensure since x is unknown.

However, it is often easy to have an estimate on f. For example, f 0 in many applications. The definition f(y) f(x) + g, y x g f(x) yields α k g k, x k x α k (f(x k ) f ) 0. Interpretation: the term α k g k, x k x guarantees a sufficient descent by at least α k (f(x k ) f ) However, the ascending term α 2 k g k 2 can be as large as α 2 kg 2.

After substitution, we get the bound x k+1 x 2 x k x 2 2α k (f(x k ) f ) + α 2 k g k 2, Telescopic sum over k gives x k+1 x 2 = x 0 x 2 2 Therefore, x k+1 x 2 + 2 k α i(f(x i ) f ) + i=0 k α i(f(x i ) f ) x 0 x 2 + i=0 k αi 2 g i 2. i=0 k αi 2 g i 2 which bounds the total descent k i=0 α k(f(x i ) f ) by the total ascent k i=0 α2 i g i 2. Clearly, α k play the key role. i=0

Step size and convergence By x k+1 x 2 + 2 and letting k k α i(f(x i ) f ) x 0 x 2 + αi 2 g i 2 i=0 fbest k = min{f(x i ) : i = 0, 1,..., k} (thus, f best f f(x i ) f, i k) g G (equivalent to Lip. f: f(x) f(y) G x y x, y) i=0 we have fbest k f x0 x 2 + G 2 k i=0 α2 i 2. k i=0 αi

We need unbounded k i=0 αi and bounded k αi as k. i=0 To have f k best f 0, we require, for example, k i=0 αi = and k i=0 α2 i ; or more weakly, k i=0 αi = and lim α k 0. (the truncation trick) Otherwise, f k best f in general.

Fixed step size Fixing α k α, we get fbest k f x0 x 2 + G 2 k i=0 α2 2 k α = x 0 x 2 2α(k + 1) + αg2 2. i=0 f k best f αg 2 /2 = O(α). while in the early stage we have k < x0 x 2 α 2 G 2, x 0 x 2 2α(k + 1) + αg2 x0 x 2 2 α(k + 1) and thus the (non-asymptotic, conditional) rate of convergence O( 1 αk ). larger α = faster convergence, lower final accuracy smaller α = slower convergence, higher final accuracy

Fixed step length α k = α/ g k We have f k best f x0 x 2 + k i=0 α2 k g k 2 2 k i=0 α k = x0 x 2 G 2α(k + 1) + αg 2. f k best f αg/2 = O(α), slightly better than with a fixed step size. while k < x0 x 2 α 2, we have the (non-asymptotic, conditional) rate of convergence O( G αk ). larger α = faster convergence, lower final accuracy smaller α = slower convergence, higher final accuracy

Polyak step size Assume that f is known (not x though). Example: f = 0 in projection problems.) Set: α k = f(xk ) f g k = arg min x k x 2 2α k (f(x k ) f ) + α 2 k g k 2 Then, x k x 2 2α k (f(x k ) f ) + αk g 2 k 2 = x k x 2 (f(xk ) f ) 2 g k (so x k x 2 is monotonic) and thus after adding over k, x k+1 x 2 x 0 x 2 1 G k (f(x i ) f ) 2. i=0 Finally, f k best f x0 x G k + 1 = O( 1 k ).

Oracle optimality For an O( 1 k ) method to guarantee f k best f ɛ, we need O(1/ɛ 2 ) iterations. It this tight for the subgradient method? Answer: Yes. Suppose x k+1 is computed by an arbitrary method as where the oracle gives arbitrary g k f(x k ) and f(x k ). x k+1 x 0 + span{g 0, g 1,..., g k } Theorem (Nesterov Thm 3.2.1) There is a nonsmooth convex function with g G uniformly so that the above method obeys f(x k ) f(x ) x0 x G 2(1 + k + 1).

The subgradient algorithm f is a proper closed convex function. Problem: minimize x f(x) Algorithm: pick any starting point x 1 pick g k f(x k ) set α k (α 0/k, fixed size, fixed length, or Polyak) x k+1 x k α k g k k k + 1 (monitor f(x k ) and fbest, k especially if using fixed size, fixed length)

Variant: projected subgradient method f is a proper closed convex function. C is a nonempty closed convex set. Problem: pick g k f(x k ) and α k, and apply minimize f(x), subject to x C. x x k+1 Proj C (x k α k g k )

since projection is nonexpansive, Proj C (x) Proj C (y) x y, x, y R n the analysis remains the same. x k+1 x 2 = Proj C (x k α k g k ) Proj C (x ) 2 (x k α k g k ) x 2 = (x k x ) α k g k 2 = x k x 2 2α k g k, x k x + αk g 2 k 2 =...

Summary for subgradient methods Universal. It handles non-differentiable convex problems and, in particular, the dual problem of linearly constrained convex problems (later lectures) No monotonicity for either f(x k ) or dist(x k, X ) except for Polyak s step size (requiring known f ) Convergence relies on uniformly bounded subgradient (or Lipschitz f) Rate of convergence fbest k f depends on the step size Constant step size (or length) does not guarantee fbest k f If we need fbest k f, use diminishing step sizes; the rate is at best O(1/ k) Convergence is quite slow (but the method is widely applicable) Some non-smooth problems have better rates by other methods, e.g., prox-linear iteration, operator splitting, dual smoothing (later lectures)