arxiv: v1 [cs.lg] 22 Feb 2015

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 22 Feb 2015"

Geoffrey Washington
6 years ago
Views:

1 SDCA wthout Dualty Sha Shalev-Shwartz arxv: v cs.lg Feb 05 Abstract Stochastc Dual Coordate Ascet s a popular method for solvg regularzed loss mmzato for the case of covex losses. I ths paper we show how a varat of SDCA ca be appled for o-covex losses. We prove lear covergece rate eve f dvdual loss fuctos are o-covex as log as the expected loss s covex. Itroducto The followg regularzed loss mmzato problem s assocated wth may mache learg methods: m w R d P w := φ w + λ w. = Oe of the most popular methods for solvg ths problem s Stochastc Dual Coordate Ascet SDCA. 8 aalyzed ths method, ad showed that whe each φ s L-smooth ad covex the the covergece rate of SDCA s ÕL/λ + log/ɛ. As ts ame dcates, SDCA s derved by cosderg a dual problem. I ths paper, we cosder the possblty of applyg SDCA for problems whch dvdual φ are o-covex, e.g., deep learg optmzato problems. I may such cases, the dual problem s meagless. Istead of drectly usg the dual problem, we descrbe ad aalyze a varat of SDCA whch oly gradets of φ are beg used smlar to opto 5 the pseudo code of Prox-SDCA gve 6. Followg 3, we show that SDCA s a varat of the Stochastc Gradet Descet SGD, that s, ts update s based o a ubased estmate of the gradet. But, ulke the valla SGD, for SDCA the varace of the estmato of the gradet teds to zero as we coverge to a mmum. For the case whch each φ s L-smooth ad covex, we derve the same lear covergece rate of ÕL/λ + log/ɛ as 8, but wth a smpler, drect, dual-free, proof. We also provde a lear covergece rate for the case whch dvdual φ ca be o-covex, as log as the average of φ are covex. The rate for o-covex losses has a worst depedece o L/λ ad we leave t ope to see f a better rate ca be obtaed for the o-covex case. Related work: I recet years, may methods for optmzg regularzed loss mmzato problems have bee proposed. For example, SAG 5, SVRG 3, Fto, SAGA, ad SGD 4. The best covergece rate s for accelerated SDCA 6. A systematc study of the covergece rate of the dfferet methods uder o-covex losses s left to future work. School of Computer Scece ad Egeerg, The Hebrew Uversty, Jerusalem, Israel

2 SDCA wthout Dualty We mata pseudo-dual vectors α,..., α, where each α R d. Dual-Free SDCAP, T, η, α 0 Goal: Mmze P w = = φ w + λ w Iput: Objectve P, umber of teratos T, step sze η s.t. β := ηλ <, tal dual vectors α 0 = α 0,..., α0 Italze: w 0 = λ = α0 For t =,..., T Pck uformly at radom from Update: α t = α t ηλ φ w t + α t Update: w t = w t η φ w t + α t Observe that SDCA keeps the prmal-dual relato w t = λ Observe also that the update of α ca be rewrtte as α t = βα t + β = α t φ w t amely, the ew value of α s a covex combato of ts old value ad the egato of the gradet. Fally, observe that, codtoed o the value of w t ad α t, we have that Ew t = w t η Eφ w t + Eα t = w t η φ w t + λw t = = w t η P w t. That s, SDCA s fact a stace of Stochastc Gradet Descet. As we wll see the aalyss secto below, the advatage of SDCA over a valla SGD algorthm s because the varace of the update goes to zero as we coverge to a optmum., 3 Aalyss The theorem below provdes a lear covergece rate for smooth ad covex fuctos. The rate matches the aalyss gve 8, but the aalyss s smpler ad does ot rely o dualty. Theorem. Assume that each φ s L-smooth ad covex, ad the algorthm s ru wth η L+λ. Let w be the mmzer of P w ad let α = φ w. The, for every t, λ E wt w + α t α e ηλt λ L w0 w + α 0 α. L = =

3 I partcular, settg η = L+λ, the after T Ω L λ + teratos we wll have EP w T P w ɛ. The theorem below provdes a lear covergece rate for smooth fuctos, wthout assumg that dvdual φ are covex. We oly requre that the average of φ s covex. The depedece o L/λ s worse ths case. Theorem. Assume that each φ s L-smooth ad that the average fucto, = φ, s covex. Let w be the mmzer of P w ad let α = φ w. The, f we ru SDCA wth η = m{ λ, L λ }, we have that λ E wt w + λ L α t α e ηλt λ w0 w + λ L α 0 α. = = It follows that wheever we have that EP w T P w ɛ. T Ω L λ + 3. SDCA as varace-reduced SGD As we have show before, SDCA s a stace of SGD, the sese that the update ca be wrtte as w t = w t ηv t, wth v t = φ w t + α t satsfyg Ev t = P w t. The advatage of SDCA over a geerc SGD s that the varace of the update goes to zero as we coverge to the optmum. To see ths, observe that E v t = E α t + φ w t = E α t α + α + φ w t E α t α + E φ w t α Theorem or Theorem tells us that the term E α t α goes to zero as e ηλt. For the secod term, by smoothess of φ we have φ w t α = φ w t φ w L w t w, ad therefore, usg Theorem or Theorem aga, the secod term also goes to zero as e ηλt. All all, whe t Ω ηλ log/ɛ we wll have that E v t ɛ. 4 Proofs Observe that 0 = P w = φ w + λw, whch mples that w = λ Defe u = φ w t ad v t = u + α t. We also deote two potetals: A t = j= α t j α j, B t = w t w. 3 α.

4 We wll frst aalyze the evoluto of A t ad B t. If o roud t we update usg elemet the α t βα t + βu, where β = ηλ. It follows that, I addto, A t A t = αt = α αt α = βαt α + βu α αt α = β α t α + β u α β β α t = β α t α + u α β v t = ηλ α t α + u α β v t. u α t α B t B t = w t w w t w = ηw t w v t + η v t. The proofs of Theorem ad Theorem wll follow by studyg dfferet combatos of A t ad B t. 4. Proof of Theorem Defe Combg ad we obta C t = λ L A t + B t. C t C t = ηλ L = ηλ α t α u α + β v t + λ λ L α t α u α + λ β L η ηw t w v t η v t v t + w t w v t The defto of η mples that η λ β/l, so the coeffcet of v t s o-egatve. By smoothess of each φ we have u α = φ w t φ w L w t w. Therefore, λ C t C t ηλ L αt α λ wt w + w t w v t. Takg expectato of both sdes w.r.t. the choce of ad codtoed o w t ad α t ad otg that Ev t = P w t, we obta that λ EC t C t ηλ L E αt α λ wt w + w t w P w t. Usg the strog covexty of P we have w t w P w t P w t P w + λ wt w ad P w t P w λ wt w, whch together yelds w t w P w t 4

5 λ w t w. Therefore, EC t C t ηλ λ L E αt α + λl L + λ w t w = ηλc t. It follows that ad repeatg ths recursvely we ed up wth EC t ηλc t EC t ηλ t C 0 e ηλt C 0, whch cocludes the proof of the frst part of Theorem. The secod part follows by observg that P s L + λ smooth, whch gves P w P w L+λ w w. 4. Proof of Theorem I the proof of Theorem we bouded the term u α by L w t w based o the smoothess of φ. We ow assume that φ s also covex, whch eables to boud u α based o the curret sub-optmalty. Lemma. Assume that each φ s L-smooth ad covex. The, for every w, = Proof. For every, defe φ w φ w L P w P w λ w w g w = φ w φ w φ w w w. Clearly, sce φ s L-smooth so s g. I addto, by covexty of φ we have g w 0 for all w. It follows that g s o-egatve ad smooth, ad therefore, t s self-bouded see Secto..3 7: Usg the defto of g, we obta g w Lg w. φ w φ w = g w Lg w = L φ w φ w φ w w w Takg expectato over ad observg that P w = Eφ w + λ w ad 0 = P w = E φ w + λw we obta E φ w φ w L P w λ w P w + λ w + λw w w = L P w P w λ w w... 5

6 We ow cosder the potetal Combg ad we obta D t = L A t + λ B t. D t D t = ηλ α t α u α + β v t + λ ηw t w v t η v t L = ηλ α t α u α β + L L η v t + w t w v t ηλ α t α u α + w t w v t, L where the last equalty we used the assumpto η L + λ η β L. Take expectato of the above w.r.t. the choce of, usg Lemma, usg Ev t = P w t, ad usg covexty of P that yelds P w P w t w w t P w t, we obta ED t D t ηλ E α t α E u α + w t w Ev t L ηλ L E αt α P w t P w λ wt w + w t w P w t ηλ L E αt α + λ wt w = ηλd t Ths gves ED t ηλd t e ηλ D t, whch cocludes the proof of the frst part of the theorem. The secod part follows by observg that P s L + λ smooth, whch gves P w P w L+λ w w. Refereces Aaro Defazo, Fracs Bach, ad Smo Lacoste-Jule. Saga: A fast cremetal gradet method wth support for o-strogly covex composte objectves. I Advaces Neural Iformato Processg Systems, pages , 04. Aaro J Defazo, Tbéro S Caetao, ad Just Domke. Fto: A faster, permutable cremetal gradet method for bg data problems. arxv preprt arxv:407.70, Re Johso ad Tog Zhag. Acceleratg stochastc gradet descet usg predctve varace reducto. I Advaces Neural Iformato Processg Systems, pages 35 33, Jakub Koečỳ ad Peter Rchtárk. Sem-stochastc gradet descet methods. arxv preprt arxv:3.666, 03. 6

7 5 Ncolas Le Roux, Mark Schmdt, ad Fracs Bach. A stochastc gradet method wth a expoetal covergece rate for fte trag sets. I Advaces Neural Iformato Processg Systems, pages , 0. 6 S. Shalev-Shwartz ad T. Zhag. Accelerated proxmal stochastc dual coordate ascet for regularzed loss mmzato. Mathematcal Programmg SERIES A ad B to appear, Sha Shalev-Shwartz ad Sha Be-Davd. Uderstadg Mache Learg: From Theory to Algorthms. Cambrdge uversty press, Sha Shalev-Shwartz ad Tog Zhag. Stochastc dual coordate ascet methods for regularzed loss mmzato. Joural of Mache Learg Research, 4: , Feb 03. 7

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture) CSE 546: Mache Learg Lecture 6 Feature Selecto: Part 2 Istructor: Sham Kakade Greedy Algorthms (cotued from the last lecture) There are varety of greedy algorthms ad umerous amg covetos for these algorthms.