0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer: These notes have not been subjected to the usual scrutny reserved for formal publcatons. They may be dstrbuted outsde ths class only wth the permsson of the Instructor. 20. Background of Coordnate Descent We have studed a lot of sophstcated methods to solve the convex mnmzaton problem, e.g. gradent descent, proxmal gradent descent, stochastc gradent descent, Newton s method, Quas-Newton method, Proxmal Newton method, Barrer method, and prmal-dual nteror pont method. These methods are updatng the varables from all coordnates at the same tme. But these coordnates may not be equally mportant. It s possble one coordnate nfluences the crteron value more than other coordnates do. So what f now we can focus on mnmzng the crteron accordng to each coordnate separately? We mght be nterested n frst answerng the followng questons. Q: Gven convex, dfferentable functon f : R n R, f we are at a pont x such that f(x) s mnmzed along each coordnate axs, then have we found a global mnmzer? That s, does f(x + δe ) f(x) for all δ, = f(x) = mn z f(z)? Note that e = (0,..,,..., 0) R n, the th standard bass vector. A: Yes! Proof: 0 = f(x) = ( f x (x),..., f x n (x)) (20.) Q: Same queston, but now for f convex, and not dfferentable? A: No. Check the counter example n Fgure 20.. If we are now at the ntersecton of two red lnes where the functon f s not dfferentable, no matter how we move along each axs, we always get larger crteron value. But ths s not a global mnmum. Q: Same queston agan, but now f(x) = g(x) + h(x) = g(x) + n = h (x ), wth g convex, dfferentable and each h convex? (Here the non-smooth part s called separable) A: Yes! Proof: Here we want to prove that y R n, f(y) f(x) 0 (20.2) We know that f(x + δe ) = g(x + δe ) + j h j (x j ) + h (x + δ) (20.3) 20-
20-2 Lecture 20: November 7 Fgure 20.: A counter example Snce x s optmal along th axs, accordng to subgradent optmalty, we have 0 g(x) + h (x ) (20.4) g(x) h (x ) h (y ) h (x ) g(x)(y x ) g(x)(y x ) + h (y ) h (x ) 0 Snce f s convex, accordng to the frst-order characterzaton, we have: f(y) f(x) (20.5) n g(x) T (y x) + [h (y ) h (x )] 0 = n [ g (x)(y x ) + h (y ) h (x )] = 20.2 Coordnate Descent For the problem mn x f(x) (20.6) where f(x) = g(x) + n = h (x ), wth g convex and dfferentable and h convex, we can use coordnate descent:
Lecture 20: November 7 20-3 Let x (0) R n, and for k =, 2,... repeat x (k) = argmn x f(x (k),..., x(k), x, x (k ) +,..., x (k ) n ), =, 2,..., n Note that we always use the most recent nformaton possble. Tseng [4] proves that for such f (provded f s contnuous on compact set x : f(x) f(x (0) ) and f attans ts mnmum), any lmt pont of x (k), k =, 2, 3,... s a mnmzer of f. Here are some useful and mportant notes for coordnate descent:. Order of cycle through coordnates s arbtrary, can use any permutaton of {, 2,..., n} 2. Can everywhere replace ndvdual coordnates wth blocks of coordnates. For example, we can always update a group of coordnates at the same tme. 3. One-at-a-tme update scheme s crtcal, and all-at-once scheme does not necessarly converge. 4. The analogy for solvng lnear systems: Gauss-Sedel versus Jacob method. 20.3 Examples of Coordnate Descent 20.3. Lnear Regresson For the classcal lnear regresson, we consder mn β 2 y Xβ 2 2 (20.7) where y R n, and X R n p. Take the (sub)gradent of the objectve wth respect to β (the th element of β) where all other j are fxed and set t to zero to get the update step: X T (Xβ y) = X T X β + X T (X β y) = 0 β XT (y X β ) X T X (20.8) where X and β are orgnal matrx or vector wth -th column or element removed respectvely. Repeat ths update for =, 2,..., p,, 2,... Ths s the same as Guass-Sedl updates. Remark. The computatonal cost (n terms of flops) for cycle of coordnate descent s O(np), where O(n) to compute X T (y X β ) for each update n a cycle. Ths s the same as the cost of teraton of gradent descent. 20.3.2 LASSO Regresson For the classcal LASSO, we consder mn β 2 y Xβ 2 2 + λ β (20.9) where y R n, and X R n p. Notce that we can use coordnate descent as the regularzer term can be decomposed as the sum of convex functons, namely β = p = β. Take the (sub)gradent of the objectve wth respect to β where all other j are fxed and set t to zero to get the update step: ( ) X X T X β + X T T (X β y) + λs = 0 β S (y X β ) λ/ X 2 2 X T X (20.0)
20-4 Lecture 20: November 7 where s β and S λ s a soft-thresholdng operator, β λ β > λ [S λ (β)] = 0 λ β λ. β + λ β < λ Repeat ths update for =, 2,..., p,, 2,... 20.3.3 Box-constraned QP A box-constraned QP has the form: mn x 2 xt Qx + b T x subject to l x u (20.) for b R n, Q S n +. Notce that we can use coordnate descent as the constrant can be decomposed nto element-wse convex constrants: I(l x u) = n = I(l x u ), I beng the ndcator functon. Smlar steps for takng the (sub)gradent of the objectve wth respect to x wth all other elements j fxed gves the update step: ( b j x T Q ) jx j [l,u ] (20.2) where T [l,u ] s the projecton operator on to the nterval [l, u ] that clps the value: u z > u T [l,u ](z) = z l z u. l z < l Repeat ths update for =, 2,..., n,, 2,... Q 20.3.4 Support Vector Machnes Consder the SVM dual objectve: mn α 2 αt X XT α T α subject to 0 C, α T y = 0 (20.3) [3] ntroduces Sequental Mnmal Optmzaton (SMO), a blockwse coordnate descent method that uses greedy heurstcs to select the next block of 2 nstead of smple cyclng. SMO repeats the followng updates:. Greedly choose a block of and j such that α, α j volate the complementary slackness condton. That s, select two s (accordng to some heurstc) such that where β, β 0, ξ are prmal varables. α ( ξ ( Xβ) y β 0 ) 0 (C α )ξ 0 2. Mnmze the objectve over the two chosen varables whle keepng others fxed. For a more recent work on coordnate descent method for SVMs, refer to [2].
Lecture 20: November 7 20-5 20.4 Hstory of Coordnate descent Untl Fredman et. al 2007[], coordnate descent was consdered to be an nterestng, toy method. Ths could be because people were mplementng the Jacoban verson of t wthout dstngushng between one at a tme versus all at once type of updates. 20.4. Why s Coordnate descent used today? Coordnate descent s very smple and easy to mplement. It can acheve state-of-the-art f mplemented usng some trcks descrbed n the next secton. Ths s especally true for functons n consstng of a quadratc functon and a separable component ether drectly or under proxmal Newton. Examples: lasso regresson, lasso GLMs (under proxmal Newton), SVMs, group lasso, graphcal lasso (appled to the dual), etc. 20.5 Implementaton trcks - Pathwse Gradent Descent Pathwse coordnate descent for lasso has the followng structure- Outer Loop(pathwse strategy) : The dea s to go from a sparse to dense soluton. Compute the soluton over a sequence λ > λ 2 >... > λ r of tunng parameter values For tunng parameter value λ k, ntalze coordnate descent algorthm at the computed soluton for λ k+ (warm start) Inner Loop(actve set strategy) : Ths step s effcent snce we only work wth the actve set. Perform one coordnate cycle (or small number of cycles), and record actve set A of coeffcents that are nonzero Cycle over only the coeffcents n A untl convergence Check KKT condtons over all coeffcents; f not all satsfed, add offendng coeffcents to A, go back one step Pathwse coordnate descent combned wth screenng rules make practcal coordnate descent very effcent. 20.6 Coordnate gradent descent For a smooth functon f, the teratons x (k) = x (k ) t k. f(x (k),..., x(k), x(k), x (k) +,..., x(k) n ), =...n (20.4) for k =, 2, 3,... are called coordnate gradent descent, and when f = g + h, wth g smooth and h = n = h, the teratons ( x (k) = prox h,t k x (k ) t k. g(x (k),..., x(k), x(k) ), x (k) +,..., x(k) n ), =...n (20.5)
20-6 Lecture 20: November 7 for k =, 2, 3,... are called coordnate proxmal gradent descent. When g s quadratc, (proxmal) coordnate gradent descent s the same as coordnate descent under proper step sze. Roughly speakng, theory suggests that the convergence results for coordnate descent are smlar to those for proxmal gradent descent. References [] Jerome Fredman, Trevor Haste, Holger Höflng, Robert Tbshran, et al. Pathwse coordnate optmzaton. The Annals of Appled Statstcs, (2):302 332, 2007. [2] Cho-Ju Hseh, Ka-We Chang, Chh-Jen Ln, S Sathya Keerth, and Sellamanckam Sundararajan. A dual coordnate descent method for large-scale lnear svm. In Proceedngs of the 25th nternatonal conference on Machne learnng, pages 408 45. ACM, 2008. [3] John Platt. Sequental mnmal optmzaton: A fast algorthm for tranng support vector machnes. 998. [4] Paul Tseng. Convergence of a block coordnate descent method for nondfferentable mnmzaton. Journal of optmzaton theory and applcatons, 09(3):475 494, 200.