IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything discussed in the class. Please email A swang157@illinois.edu) if you find any typos or mistakes. In this lecture, we cover the following topics Subgradient Method Bundle Method Kelley cutting plane method Level method Reference: Nesterov.004. Chapter 3..3, 3.3 5.1 Subgradient Method Recall that subgradient method works as follows x t+1 = Π X x t γ t g t ), t = 1,,... where g t fx t ), γ t > 0 and Π X x) = arg min y X y x is the projection operator. Note that the projection on X is easy to compute when X is simple, e.g. X is a ball, box, simplex, polyhedron, etc. Lemma 5.1 Projection) x inr n, z X, x z x Π X x) + z Π X x) Proof: When x X, the inequality immediately hold true. Let x X. By definition. Π X x) = arg min z x By optimality condition, this implies Π X x) x) z Π X x)) 0, z X. Hence, x z = x Π X x) + Π X x) z x Π X x) + Π X x) z 5-1
Lecture 5: Subgradient Method and Bundle Methods April 4 5- Lemma 5. Key relation) For the subgradient method, we have Proof: x t+1 x x t x γ t fx t ) f ) + γ t g t ) x t+1 x = Π X x t γ t g t ) x x t γ t g t x = x t x γ t g t x t x ) + γ t g t Due to convexity of f, we have f fx t ) + g t x x t ), i.e. g t x t x ) > fx t ) f Combing there two inequalities leads to the desired result. Remark: Note that when f is known, we can choose the optimal γ t by minimizing the right hand side of ): γt = fxt) f, which is the Polyak s stepsize. g t In fact, knowing f is not a problem sometimes. For instance, when the goal is to solve the convex feasibility problem: Find x X, s.t. g i x) 0, i = 1,..., m. We can formulate this as m max g ix) or min maxg i x), 0) 1 i m min he optimal value f is known to be 0 in this case. i=1 If f is not known, one can replace f by its online estimate. heorem 5.3 Convergence) Suppose fx) is convex and Lipschitz continuous on X: fx) fy) M f x y, where M f < +. hen the subgradient method satisfies: where ˆx = γ t) 1 γ tx t ) x, y X fˆx ) f x 1 x + γ t M f γ t Proof: he Lipschitz continuity implies that g t M f, t. Summing up the key relation ) from t = 1 to t =, we obtain γ t fx t ) f ) x 1 x x +1 x + By convexity of f: γ t)fˆx ) γ tfx t ) his further leads to γ t )[fˆx ) f ] 1 x 1 x + M and concludes the proof. γt Mf γ t
Lecture 5: Subgradient Method and Bundle Methods April 4 5-3 Convergence under various stepsize Assume D X = max x,y x y is the diameter of the set X. It is interesting to see how the bounds in the above theorem would imply the convergence and even the convergence rate with different choices of stepsizes. By abuse of notation, we denote both min fx t) f and fˆx ) f as ɛ. 1 t 1. Constant stepsize: with γ t γ, ɛ D X + γ M f γ = D X 1 γ + M f γ M f γ. It is worth noticing that the error upper-bound does not diminish to zero as grows to infinity, which shows one of the drawbacks of using arbitrary constant stepsizes. In addition, to optimize the upper bound, we can select the optimal stepsize γ to obtain: γ = D X M f ɛ D XM f. It is shown that under this optimal choice ɛ O D XM f ). However, this exhibits another drawback of constant stepsize that in practice is not known in prior for evaluating the optimal γ.. Scaled stepsize: with γ t = γ gx t), DX ɛ + γ M γ f 1/ gx t ) Ω 1 γ + 1 ) γ M f γ. Similarly, we can select the optimal γ by minimizing the right hand side, i.e. γ = D X : γ t = D X gxt ) ɛ D XM. he same convergence rate is achieved while the same drawback about not knowing in prior still exists in choosing γ t. 3. Non-summable but diminishing stepsize: ) / ) ɛ DX + γt Mf γ t ) / 1 DX + γt Mf ) γ t + Mf γt t= 1 +1 / t= 1 +1 where 1 1. When, select large 1 and the first term on the right hand side 0 since γ t is non-summable. he second term also 0 because γ t always approaches zero faster than γ t. Consequently, we know that γ t ɛ 0.
Lecture 5: Subgradient Method and Bundle Methods April 4 5-4 An example choice of the stepsize is γ t = O ) 1 t with q 0, 1]. As in the above cases, if we q choose γ t = Ω M, then t γ t = D ) X DX M f ln ) ɛ O. M f t In fact, if we choose the averaging from instead of 1, we have ) min fx Mf D X t) f O. / t 4. Non-summable but square-summable stepsize: It is obvious that ) / ɛ Ω + M ) γt γ t 0. A typical choice of γ t = 1 t 1+q, q > 0 also result in the rate of O 1 ). 5. Polyak stepsize: he stepsize yields x t+1 x x t x fx t) f ) gx t ), 5.1) which guarantees x t x decreases each step. Applying 5.1) recursively, we obtain fx t ) f ) DX M f <. herefore we have ɛ 0 as and ɛ O 1 ). Corollary 5.4 When is known, setting γ t fˆx ) f D XM f D X M f, in particular, we have Remark: Subgradient method converges sublinearly. number of iterations. For an accuracy ɛ > 0, need O D X M f ɛ ) 5. Bundle Method When running the subgradient method, we obtain a bundle of affine underestimate of fx): fx t ) + g t x x t ), t = 1,,...
Lecture 5: Subgradient Method and Bundle Methods April 4 5-5 Definition 5.5 he piecewise linear function: { f t x) = max fxi ) + gi x x i ) } 1 i t where g i fx i ) is called the t-th model of convex function f. Note 1. f t x) fx), x X. f t x i ) = fx i ), 1 i t 3. f 1 x) f x)... f t x)... fx) 5..1 Kelley method Kelley, 1960) he Kelley method works as follows: x t+1 = arg min f tx) Obviously, the above algorithm converges so long as X is compact. he auxiliary problem is not so disturbing reduces to LP) when X is polyhedron. However, the issue is that x t is not unique and Kelley method can be very unstable. Indeed, the worst-case complexity of Kelley method is at least O 1 ɛ n 1)/ ). Remedy: o prevent the instability issue, a possible remedy is update x t+1 by with properly selected α t > 0. { x t+1 = arg min f t x) + α } t x x t 5.. Level Method Lemarchal, Nemirovski, Nesterov, 1995) Denote = min ft f tx) f t = min fx i) 1 i t minimal value of the model) record value of the model) we have 1... f f f... f f 1 Denote the level set L t = { x : f t x) l t := 1 α) f t + α f } t Note that L t is nonempty, convex and closed, and doesn t contain the search points {x 1,..., x t }
Lecture 5: Subgradient Method and Bundle Methods April 4 5-6 he Level method works as follows { x t+1 = Π Lt x t ) = arg min x xt } : f t x) l t Note when α = 0, reduces to Kelley method. α = 1, there will be no progress. he auxiliary problem reduces to a quadratic program when X is polyhedron. heorem 5.6 When t > 1 1 α) α α) ) Mf D X, ɛ we have f t f t ɛ where M f is the Lipschitz constant and D X is the diameter of set X. Remark: Level method achieves same complexity as the subgradient method which indeed is unimprovable), but can perform much better than subgradient method in practice.