New Optimisation Methods for Machine Learning Aaron Defazio

New Optmsato Methods for Mache Learg Aaro Defazo A thess submtted for the degree of Doctor of Phlosophy of The Australa Natoal Uversty October 205

c Aaro Defazo 204

Except where otherwse dcated, ths thess s my ow orgal work. Aaro Defazo 7 October 205

Ackowledgemets I would lke to thak several NICTA researchers for coversatos ad brastormg sessos durg the course of my PhD, partcularly Scott Saer ad my supervsor Tbero Caetao. I would lke to thak Just Domke for may dscussos about the Fto algorthm, ad hs assstace wth developg ad checkg the proof. Lkewse, for the SAGA algorthm I would lke to thak Fracs Bach ad Smo Lacoste-Jule for dscusso ad assstace wth the proofs. The SAGA algorthm was dscovered collaborato wth them whle vstg the INRIA lab, wth some facal support from INRIA. I would also lke to thak my famly for all ther support durg the course of my PhD. Partcularly my mother for gvg me a place to stay for part of the durato of the PhD as well as food, love ad support. I do ot thak her ofte eough. I also would lke to thak NICTA for ther scholarshp durg the course of the PhD. NICTA s fuded by the Australa Govermet through the Departmet of Commucatos ad the Australa Research Coucl through the ICT Cetre of Excellece Program. v

Abstract I ths work we troduce several ew optmsato methods for problems mache learg. Our algorthms broadly fall to two categores: optmsato of fte sums ad of graph structured objectves. The fte sum problem s smply the mmsato of objectve fuctos that are aturally expressed as a summato over a large umber of terms, where each term has a smlar or detcal weght. Such objectves most ofte appear mache learg the emprcal rsk mmsato framework the o-ole learg settg. The secod category, that of graph structured objectves, cossts of objectves that result from applyg maxmum lkelhood to Markov radom feld models. Ulke the fte sum case, all the o-learty s cotaed wth a partto fucto term, whch does ot readly decompose to a summato. For the fte sum problem, we troduce the Fto ad SAGA algorthms, as well as varats of each. The Fto algorthm s best suted to strogly covex problems where the umber of terms s of the same order as the codto umber of the problem. We prove the fast covergece rate of Fto for strogly covex problems ad demostrate ts state-of-the-art emprcal performace o 5 datasets. The SAGA algorthm we troduce s complemetary to the Fto algorthm. It s more geerally applcable, as t ca be appled to problems wthout strog covexty, ad to problems that have a o-dfferetable regularsato term. I both cases we establsh strog covergece rate proofs. It s also better suted to sparser problems tha Fto. The SAGA method has a broader ad smpler theory tha ay exstg fast method for the problem class of fte sums, partcular t s the frst such method that ca provably be appled to o-strogly covex problems wth odfferetable regularsers wthout troducto of addtoal regularsato. For graph-structured problems, we take three complemetary approaches. We look at learg the parameters for a fxed structure, learg the structure depedetly, ad learg both smultaeously. Specfcally, for the combed approach, we troduce a ew method for ecouragg graph structures wth the scale-free property. For the structure learg problem, we establsh SHORTCUT, a O( 2.5 ) expected tme approxmate structure learg method for Gaussa graphcal models. For problems where the structure s kow but the parameters ukow, we troduce a approxmate maxmum lkelhood learg algorthm that s capable of learg a useful subclass of Gaussa graphcal models. v

v Our thess as a whole troduces a ew sut of techques for mache learg practtoers that creases the sze ad type of problems that ca be effcetly solved. Our work s backed by extesve theory, cludg proofs of covergece for each method dscussed.

Cotets Itroducto ad Overvew. Covex Mache Learg Problems......................2 Problem Structure ad Black Box Methods................. 3.3 Early & Late Stage Covergece....................... 4.4 Approxmatos................................. 6.5 No-dfferetablty Mache Learg................. 7.6 Publcatos Related to Ths Thess...................... 8 2 Icremetal Gradet Methods 9 2. Problem Setup.................................. 9 2.. Explotg problem structure..................... 0 2..2 Radomess ad expected covergece rates........... 2..3 Data access order............................ 2 2.2 Early Icremetal Gradet Methods..................... 2 2.3 Stochastc Dual Coordate Descet (SDCA)................ 3 2.3. Alteratve steps............................ 6 2.3.2 Reducg storage requremets.................... 7 2.3.3 Accelerated SDCA........................... 7 2.4 Stochastc Average Gradet (SAG)...................... 8 2.5 Stochastc Varace Reduced Gradet (SVRG)............... 20 x

x Cotets 3 New Dual Icremetal Gradet Methods 23 3. The Fto Algorthm.............................. 23 3.. Addtoal otato.......................... 24 3..2 Method.................................. 24 3..3 Storage costs.............................. 25 3.2 Permutato & the Importace of Radomess............... 26 3.3 Expermets................................... 26 3.4 The MISO Method............................... 27 3.5 A Prmal Form of SDCA............................ 3 3.6 Prox-Fto: a Novel Mdpot Algorthm.................. 33 3.6. Prox-Fto relato to Fto..................... 35 3.6.2 No-Uform Lpschtz Costats.................. 35 3.7 Fto Theory.................................. 36 3.7. Ma proof............................... 37 3.8 Prox-Fto Theory............................... 46 3.8. Ma result............................... 50 3.8.2 Proof of Theorem 3.4.......................... 52 4 New Prmal Icremetal Gradet Methods 55 4. Composte Objectves.............................. 55 4.2 SAGA Algorthm................................ 56 4.3 Relato to Exstg Methods......................... 57 4.3. SAG................................... 58 4.3.2 SVRG................................... 59 4.3.3 Fto................................... 60 4.4 Implemetato................................. 6 4.5 Expermets................................... 6 4.6 SAGA Theory.................................. 64 4.6. Lear covergece for strogly covex problems......... 69 4.6.2 /k covergece for o-strogly covex problems........ 72 4.7 Uderstadg the Covergece of the SVRG Method.......... 74 4.8 Verfyg SAGA Costats........................... 78 4.8. Strogly covex step sze g = /2(µ + L)... 79 4.8.2 Strogly covex step sze g = /3L... 8 4.8.3 No-strogly covex step sze g = /3L... 83

Cotets x 5 Access Orders ad Complexty Bouds 85 5. Lower Complexty Bouds.......................... 85 5.. Techcal assumptos......................... 87 5..2 Smple ( )k boud......................... 87 5..3 Mmsato of o-strogly covex fte sums......... 89 5..4 Ope problems............................. 90 5.2 Access Ordergs................................ 9 5.3 MISO Robustess................................ 94 6 Beyod Fte Sums: Learg Graphcal Models 0 6. Beyod the Fte Sum Structure....................... 0 6.2 The Structure Learg Problem....................... 02 6.3 Covarace Selecto.............................. 03 6.3. Drect optmsato approaches.................... 05 6.3.2 Neghbourhood selecto....................... 06 6.3.3 Thresholdg approaches....................... 06 6.3.4 Codtoal thresholdg....................... 07 6.4 Alteratve Regularsers............................ 08 7 Learg Scale Free Networks 7. Combatoral Objectve............................ 2 7.2 Submodularty.................................. 3 7.3 Optmsato................................... 4 7.3. Alteratg drecto method of multplers............ 5 7.3.2 Proxmal operator usg dual decomposto........... 6 7.4 Alteratve Degree Prors........................... 8 7.5 Expermets................................... 20 7.5. Recostructo of sythetc etworks................ 20 7.5.2 Recostructo of a gee actvato etwork............ 20 7.5.3 Rutme comparso: dfferet proxmal operator methods... 22 7.5.4 Rutme comparso: submodular relaxato agast other approaches................................. 25 7.6 Proof of Correctess.............................. 25

x Cotets 8 Fast Approxmate Structural Iferece 29 8. SHORTCUT................................... 29 8.2 Rug Tme.................................. 30 8.3 Expermets................................... 34 8.3. Sythetc datasets............................ 34 8.3.2 Real world datasets........................... 35 8.4 Theoretcal Propertes............................. 37 9 Fast Approxmate Parameter Iferece 4 9. Model Class................................... 4 9.. Improper models............................ 4 9..2 Precso matrx restrctos...................... 42 9.2 A Approxmate Costraed Maxmum Etropy Learg Algorthm. 43 9.2. Maxmum Etropy Learg..................... 43 9.2.2 The Bethe Approxmato....................... 44 9.2.3 Maxmum etropy learg of ucostraed Gaussas dstrbutos.................................. 45 9.2.4 Restrcted Gaussa dstrbutos.................. 46 9.3 Maxmum Lkelhood Learg wth Belef Propagato......... 47 9.4 Collaboratve Flterg............................. 49 9.5 The Item Graph................................. 50 9.5. Lmtatos of prevous approaches................. 52 9.6 The Item Feld Model.............................. 52 9.7 Predcto Rule................................. 53 9.8 Expermets................................... 55 9.9 Related Work.................................. 56 9.0 Extesos.................................... 57 9.0. Mssg Data & Kerel Fuctos.................. 57 9.0.2 Codtoal Radom Feld Varats................. 58

Cotets x 0 Cocluso ad Dscusso 59 0. Icremetal Gradet Methods........................ 59 0.. Summary of cotrbutos....................... 59 0..2 Applcatos............................... 6 0..3 Ope problems............................. 62 0.2 Learg Graph Models............................ 64 0.2. Summary of cotrbutos....................... 64 0.2.2 Applcatos............................... 64 0.2.3 Ope Problems............................. 65 A Basc Covexty Theorems 67 A. Deftos.................................... 67 A.2 Useful Propertes of Covex Cojugates................... 68 A.3 Types of Dualty................................. 70 A.4 Propertes of Dfferetable Fuctos.................... 70 A.5 Covexty Bouds................................ 7 A.5. Taylor lke bouds........................... 7 A.5.2 Gradet dfferece bouds...................... 73 A.5.3 Ier product bouds......................... 74 A.5.4 Stregtheed bouds usg both Lpschtz ad strog covexty 75 B Mscellaeous Lemmas 79 Bblography 83 Refereces 83

xv Cotets

Chapter Itroducto ad Overvew Numercal optmsato s may ways the core problem moder mache learg. Vrtually all learg problems ca be tackled by formulatg a real valued objectve fucto expressg some otato of loss or suboptmalty whch ca be optmsed over. Ideed approaches that do t have well fouded objectve fuctos are rare, perhaps cotrastve dvergece (Hto, 2002) ad some samplg schemes beg otable examples. May methods that started as heurstcs were able to be sgfcatly mproved oce well-fouded objectves were dscovered ad exploted. Ofte a covex varat ca be developed. A prme example s belef propagato, the relato to the Bethe approxmato (Yedda et al., 2000), ad the later developmet of tree weghted varats (Wawrght et al., 2003) whch allowed a covex formulato. The core of ths thess s the developmet of several ew umercal optmsato schemes, prmarly focusg o covex objectves, whch ether address lmtatos of exstg approaches, or mprove o the performace of state-of-the-art algorthms. These methods crease the breadth ad depth of mache learg problems that are tractable o moder computers.. Covex Mache Learg Problems I ths work we partcularly focus o problems that have covex objectves. Ths s a major restrcto, ad oe at the core of much of moder optmsato theory, but oe that evertheless requres justfcato. The prmary reasos for targetg covex problems s ther ubqutousess applcatos ad ther relatve ease of solvg them. Logstc regresso, least-squares, support vector maches, codtoal radom felds ad tree-weghted belef propagato all volve covex models. All of these techques have see real world applcato, although ther use has bee overshadowed recet years by o-covex models such as eural etworks. Covex optmsato s stll of terest whe addressg o-covex problems though. May algorthms that were developed for covex problems, motvated by ther provably fast covergece have later bee appled to o-covex problems wth good

2 Itroducto ad Overvew emprcal results. Addtoally, ofte the best approach to solvg a o-covex problem s through the repeated soluto of covex sub-problems, or by replacg the problem etrely wth a close covex surrogate. The class of covex umercal problems s sometmes cosdered syoymous wth that of computatoally tractable problems. Ths s ot strctly true the usual compute scece sese of tractablty as some covex problems o complcated but covex sets ca stll be NP-hard (de Klerk ad Pasechk, 2006). O the other had, we ca sometmes approxmately solve o-covex problems of massve scale usg moder approaches (e.g. Dea et al., 202). Istead, covex problems ca be better thought of as the relably solvable problems. For covex problems we ca almost always establsh theoretcal results gvg a practcal boud o the amout of computato tme requred to solve a gve covex problem (Nesterov ad Nemrovsk, 994). For o-covex problems we ca rarely do better tha fdg a locally optmal soluto. Together wth the small or o tug requred by covex optmsato algorthms, they ca be used as buldg blocks wth larger programs; detals of the problem ca be abstracted away from the users. Ths s rarely the case for o-covex problems, where the most commoly used methods requre substatal had tug. Whe tasked wth solvg a covex problem, we have at our dsposal powerful ad flexble algorthms such as teror pot methods ad partcular Newto s method. Whle Newto s method s strkgly successful o small problems, ts approxmately cubc rug tme per terato resultg from the eed to do a lear solve meas that t scales extremely poorly to problems wth large umbers of varables. It s also uable to drectly hadle o-dfferetable problems commo mache learg. Both of these shortcomgs have bee addressed to some degree (Nocedal, 980; Lu ad Nocedal, 989; Adrew ad Gao, 2007), by the use of lowrak approxmatos ad trcks for specfc o-dfferetable structures, although problems rema. A addtoal complcato s a dvergece betwee the umercal optmsato ad mache learg commutes. Numercal covex optmsato researchers the 80s ad 90s largely focused o solvg problems wth large umbers of complex costrats, partcularly Quadratc Programmg (QP) ad Lear Programmg (LP) problems. These advaces were applcable to the kerel methods of the early 2000s, but at odds wth may of the more moder mache learg problems whch are charactersed by large umbers of potetally o-dfferetable terms. The core examples would be lear support vector maches, other max-marg methods ad eural etworks wth o-dfferetable actvato fuctos. The problem we address Chapter 7 also fts to ths class. I ths thess we wll focus o smooth optmsato problems, allowg oly a cotrolled level of o-smooth structure the form of certa o-dfferetable regularsato terms (detaled Secto.2). The oto of smoothess we use s that

.2 Problem Structure ad Black Box Methods 3 of Lpschtz smoothess. A fucto f s Lpschtz smooth wth costat L f ts gradets are Lpschtz cotuous. That s, for all x, y 2 R d : f 0 (x) f 0 (y) apple L kx yk. Lpschtz smooth fuctos are dfferetable, ad f ther Hessa matrx exsts t s bouded spectral orm. The other assumpto we wll sometmes make s that of strog covexty. A fucto f s strogly covex wth costat µ f for all x, y 2 R d ad a 2 [0, ]: f (ax +( a)y) apple a f (x)+( a) f (y) a ( a) µ 2 kx yk2. Essetally rather tha the usual covexty terpolato boud f (ax +( a f (x)+( a) f (y), we have t stregtheed by a quadratc term. a)y) apple.2 Problem Structure ad Black Box Methods The last few years have see a resurgece covex optmsato cetred aroud the techque of explotg problem structure, a approach we take as well. Whe o structure s assumed by the optmsato method about the problem other tha the degree of covexty, very strog results are kow about the best possble covergece rates obtaable. These results date back to the semal work of Nemrovsky ad Yud (983) ad Nesterov (998, earler work Russa). These results have cotrbuted to the wdely held atttude that covex optmsato s a solved problem. But whe the problem has some sort of addtoal structure these worst-case theoretcal results are o loger applcable. Ideed, a seres of recet results suggest that practcally all problems of terest have such structure, allowg advaces theoretcal, ot just practcal covergece. For example, o-dfferetable problems uder reasoable Lpschtz smoothess assumptos ca be solved wth a error reducto of O( p t) tmes after t teratos, for stadard measures of covergece rate, at best (Nesterov, 998, Theorem 3.2.). I practce, vrtually all o-dfferetable problems ca be treated by a smoothg trasformato, gvg a O(t) reducto error after t teratos whe a optmal algorthm s used (Nesterov, 2005). May problems of terest have a structure where most terms the objectve volve oly a small umber of varables. Ths s the case for example ferece problems o graphcal models. I such cases block coordate descet methods ca gve better theoretcal ad practcal results (Rchtark ad Takac, 20). Aother explotable structure volves a sum of two terms F(x) = f (x)+h(x), where the frst term f (x) s structurally ce, say smooth ad dfferetable, but potetally

4 Itroducto ad Overvew complex to evaluate, ad where the secod term h(x) s o-dfferetable. As log as h(x) s smple the sese that ts proxmal operator s easy to evaluate, the algorthms exst wth the same theoretcal covergece rate as f h(x) was ot part of the objectve at all (F(x) = f (x)) (Beck ad Teboulle, 2009). The proxmal operator s a key costructo ths work, ad deed moder optmsato theory. It s defed for a fucto h ad costat g as: proxg h (v) =arg m h(x)+ g x 2 kx vk2o. Some deftos of the proxmal operator use the weghtg 2g stead of g 2 ;we use ths form throughout ths work. The proxmal operator s tself a optmsato problem, ad so geeral t s oly useful whe the fucto h s smple. I may cases of terest the proxmal operator has a closed form soluto. The frst four chapters of ths work focus o qute possbly the smplest problem structure, that of a fte summato. Ths occurs whe there s a large umber of terms wth smlar structure added together or averaged the objectve. Recet results have show that for strogly covex problems better covergece rates are possble uder such summato structures tha s possble for black box problems (Schmdt et al., 203; Shalev-Shwartz ad Zhag, 203b). We provde three ew algorthms for ths problem structure, dscussed Chapters 3 ad 4. We also dscuss propertes of problems the fte sum class extesvely Chapter 5..3 Early & Late Stage Covergece Whe dealg wth problems wth a fte sum structure, practtoers have tradtoally had to make a key trade-off betwee stochastc methods whch access the objectve oe term at a tme, ad batch methods whch work drectly wth the full objectve. Stochastc methods such as SGD (stochastc gradet descet, Robbs ad Moro, 95) exhbt rapd covergece durg early stages of optmsato, yeldg a good approxmate soluto quckly, but ths covergece slows dow over tme; gettg a hgh accuracy soluto s early mpossble wth SGD. Fortuately, mache learg t s ofte the case that a low accuracy soluto gves just as a good a result as a hgh accuracy soluto for mmsg the test loss o held out data. A hgh accuracy soluto ca effectvely over-ft to the trag data. Rug SGD for a small umber of epochs s commo practce. Batch methods o the other had are slowly covergg but steady; f ru for log eough they yeld a hgh accuracy soluto. For strogly covex problems, the dfferece covergece s betwee a O(/t) error after t teratos for SGD versus

.3 Early & Late Stage Covergece 5.0 0.8 Suboptmalty 0.6 0.4 0.2 0.0 0 0 20 30 40 Iterato LBFGS SGD Icremetal Gradet Fgure.: Schematc llustrato of covergece rates a O(r t ) error (r < ) for LBFGS, the most popular batch method (Nocedal, 980). We have llustrated the dfferece schematcally Fgure.. The SGD ad LBFGS les here are typcal of smple predcto problems, where SGD gves acceptable solutos after 5-0 epochs (passes over the data), where LBFGS evetually gves a better soluto, takg 30-00 teratos to do so. LBFGS s partcularly well suted to use a dstrbuted computg settg, ad t s sometmes the case LBFGS wll gve better results ultmately o the test loss, partcularly for poorly codtoed (hgh-curvature) problems. Fgure. also llustrates the kd of covergece that the recetly developed class of cremetal gradet methods potetally offers. Icremetal gradet methods have the same lear O(r t ) error after t epochs as a batch method, but wth a coeffcet r dramatcally better. The dfferece beg theory thousads of tmes faster covergece, ad practce usually 0-20 tmes better. Wth favorable problem structure cremetal gradet have the potetal to offer the best of both worlds, havg rapd tal covergece wthout the later stage slow-dow of SGD. Aother tradtoal advatage of batch methods over stochastc methods s ther ease of use. Methods such as LBFGS requre o had tug to be appled to vrtually ay smooth problem. Some tug of the memory costat that holds the umber of past gradets to remember at each step ca gve faster covergece, but bad choces of ths costat stll result covergece. SGD ad other tradtoal stochastc methods o the other had requre a step sze parameter ad a parameter aealg schedule to be set. SGD s sestve to these choces, ad wll dverge for poor choces. Quas-ewto methods are ofte cted as havg local super-lear covergece. Ths s oly true f the dmesoalty of the uderlyg parameter space s comparable to the umber of teratos used. I mache learg the parameter space s usually much larger effectve dmeso tha the umber of teratos.

6 Itroducto ad Overvew 3 5 C = 00 2 6 4-0 0. 0. 0.0 0 0. 0. 0 0.0 0. 0 0.0 0. 0. 0 - - 2 4 3 7 5 - - 2 P = C = 6 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 7 5 Fgure.2: Gaussa graphcal model defed by the precso matrx P, together wth the o-sparse covarace matrx C t duces wth roudg to sgfcat fgure. Correlatos are dcated by egatve edge weghts a Gaussa model. Icremetal gradet methods offer a soluto to the tug problem as well. Most cremetal gradet algorthms have oly a sgle step sze parameter that eeds to be set. Fortuately the covergece rate s farly robust to the value of ths parameter. The SDCA algorthm (Shalev-Shwartz ad Zhag 203b) reduces ths to 0 parameters, but at the expese of beg lmted to problems wth effcet to compute proxmal operators..4 Approxmatos The explotato of problem structure s ot always drectly possble wth the objectves we ecouter mache learg. A case we focus o ths work s the learg of weght parameters a Gaussa graphcal model structure. Ths s a udrected graph structure wth weghts assocated wth both edges ad odes. These weghts are the etres of the precso matrx (verse covarace matrx) of a Gaussa dstrbuto. Abset edges effectvely have a weght of zero (Fgure.2). A formal defto s gve Chapter 6. A key approach to such problems s the use of approxmatos that troduce addtoal structure the objectve whch we ca explot. The regularsed maxmum lkelhood objectve for fttg a Gaussa graphcal model ca requre tme O( 3 ) to evaluate 2. Ths s prohbtvely log o may problems of terest. Istead, approxmatos ca be troduced that decompose the objectve, allowg more effcet techques to be used. I Chapter 9 we show how the Bethe approxmato may be appled for learg the edge weghts o restrcted classes of 2 Theoretcally t takes tme equvalet to the bg-o cost of a fast matrx multplcato such as Strasse s algorthm ( O( 2.8 )), but practce smpler O( 3 ) techques are used.

.5 No-dfferetablty Mache Learg 7 Gaussa graphcal models. Ths approxmato allows for the use of a effcet dual decomposto optmsato method, ad has drect practcal applcablty the doma of recommedato systems. Besdes parameter learg, the other prmary task volvg graphs s drectly learg the structure. Structure learg for Gaussa graphcal models s problem that has see a lot of terest mache learg. The structure ca be used a mache learg ppele as the precursor to parameter learg, or t ca be used for ts ow sake as dcator of correlato structure a dataset. The use of approxmatos structure learg s more wdespread tha parameter learg, ad we gve a overvew of approaches Chapter 6. We mprove o a exstg techque Chapter 8, where we show that a exstg approxmato ca be further approxmated, gvg a substatal practcal ad theoretcal speed-up by a factor of O( p )..5 No-dfferetablty Mache Learg As metoed, mache learg problems ted to have substatal o-dfferetable structure compared to the costrat structures more commoly addressed umercal optmsato. These two forms of structure are a sese two sdes of the same co, as for covex problems the trasformato to the dual problem ca ofte covert from oe to the other. The prmary example beg support vector maches, where o-dfferetablty the prmal hge loss s coverted to a costrat set whe the dual s cosdered. Recet progress optmsato has see the use of proxmal methods as the tool of choce for hadlg both structures mache learg problems. Whe usg a regularsed loss objectve of the form F(x) = f (x) +h(x) as metoed above Secto.2, the o-dfferetablty ca be the regularser h(x) or the loss term f (x). We troduce methods addressg both cases ths work. The SAGA algorthm of Chapter 4 s a ew prmal method, the frst prmal cremetal gradet method able to be used o o-strogly covex problems wth o-dfferetable regularsers drectly. It makes use of the proxmal operator of the regularser. It ca also be used o problems wth costrats, where the fucto h(x) s the dcator fucto of the costrat set, ad proxmal operator s projecto oto the costrat set. I ths work we also troduce a ew o-dfferetable regularser for the above metoed graph structure learg problem, whch ca also be attacked usg proxmal methods. Its o-dfferetable structure s atypcally complex compared to other regularsers used mache learg, requrg a specal optmsato procedure to be used just to evaluate the proxmal operator. For o-dfferetable losses, we troduce the Prox-Fto algorthm (Secto 3.6). Ths cremetal gradet algorthm uses the proxmal operator of the sgle datapot loss. It provdes a brdge betwee the Fto algorthm (Secto 3.) ad

8 Itroducto ad Overvew the SDCA algorthm (Shalev-Shwartz ad Zhag, 203b), havg propertes of both methods..6 Publcatos Related to Ths Thess The majorty of the cotet ths thess has bee publshed as coferece artcles. For the work o cremetal gradet methods, the Fto method has bee publshed as Defazo et al. (204b), ad the SAGA method as Defazo et al. (204a). Chapters 3 & 4 cota much more detaled theory tha has bee prevously publshed. Some of the dscusso Chapter 5 appears Defazo et al. (204b) also. For the porto of ths thess o Gaussa graphcal models, Chapter 7 largely follows the publcato Defazo ad Caetao (202a). Chapter 9 s based o the work Defazo ad Caetao (202b), although heavly revsed.

Chapter 2 Icremetal Gradet Methods I ths chapter we gve a troducto to the class of cremetal gradet (IG) methods. Icremetal gradet methods are smply a class of methods that ca take advatage of kow summato structure a optmsato objectve by accessg the objectve oe term at a tme. Objectves that are decomposable as a sum of a umber of terms come up ofte appled mathematcs ad scetfc computg, but are partcularly prevalet mache learg applcatos. Research the last two decades o optmsato problems wth a summato structure has focused more o the stochastc approxmato settg, where the summato s assumed to be over a fte set of terms. The fte sum case that cremetal gradet methods cover has see a resurgece recet years after the dscovery that there exst fast cremetal gradet methods whose covergece rates are better tha ay possble black box method for fte sums wth partcular (commo) structures. We provde a extesve overvew of all kow fast cremetal gradet methods the later parts of ths chapter. Buldg o the descrbed methods, Chapters 3 & 4 we troduce three ovel fast cremetal gradet methods. Depedg o the problem structure, each of these methods ca have state-of-the-art performace. 2. Problem Setup We are terested mmsg fuctos of the form f (x) = Â f (x), = where x 2 R d ad each f s covex ad Lpschtz smooth wth costat L. We wll also cosder the case where each f s addtoally strogly covex wth costat µ. See Appedx A. for deftos of Lpschtz smoothess ad strog covexty. Icremetal gradet methods are algorthms that at each step evaluate the gradet ad fucto value of oly a sgle f. 9

0 Icremetal Gradet Methods We wll measure covergece rates terms of the umber of ( f (x), f 0 (x)) evaluatos, ormally these are much cheaper computatoally tha evaluatos of the whole fucto gradet f 0, such as performed by the gradet descet algorthm. We use the otato x to deote a mmser of f. For strogly covex problems ths s the uque mmser. Ths setup dffers from the tradtoal black box smooth covex optmsato problem oly that we are assumg that our fucto s decomposable to a fte sum structure. Ths fte sum structure s wdespread mache learg applcatos. For example, the stadard framework of Emprcal Rsk Mmsato (ERM) takes ths form, where for a loss fucto L : R d R! R ad data label tuples (x, y ), we have: R emp (h) = Â L(h(x ), y ), where h s the hypothess fucto that we ted to optmse over. The most commo case of ERM s mmsato of the egatve log-lkelhood, for stace the classcal logstc regresso problem (See stadard texts such as Bshop 2006). Ofte ERM s a approxmato to a uderlyg stochastc programmg problem, where the summato s replaced wth a expectato over a populato of data tuples from whch we ca sample from. 2.. Explotg problem structure Gve the very geeral ature of the fte sum structure, we ca ot expect to get faster covergece tha we would by accessg the whole gradet wthout addtoal assumptos. For example, suppose the summato oly has oe term, or alteratvely each f s the zero fucto except oe of the. Notce that the Lpschtz smoothess ad strog covexty assumptos we made are o each f rather tha o f. Ths s a key pot. If the drectos of maxmum curvature of each term are alged ad of smlar magtude, the we ca expect the term Lpschtz smoothess to be smlar to the smoothess of the whole fucto. However, t s easy to costruct problems for whch ths s ot the case, fact the Lpschtz smoothess of f may be tmes smaller tha that of each f. I that case the cremetal gradet methods wll gve o mprovemet over black box optmsato methods. For mache learg problems, ad partcularly the emprcal rsk mmsato problem, ths worst case behavor s ot commo. The curvature ad hece the Lpschtz costats are defed largely by the loss fucto, whch s shared betwee the terms, rather tha the data pot. Commo data preprocessg methods such as data whteg ca mprove ths eve further.

2. Problem Setup The requremet that the magtude of the Lpschtz costats be approxmately balaced ca be relaxed some cases. It s possble to formulate IG methods where the covergece s stated terms of the average of the Lpschtz costats of the f stead of the maxmum. Ths s the case for the Prox-Fto algorthm descrbed Secto 3.6. All kow methods that make use of the average Lpschtz costat requre kowledge of the ratos of the Lpschtz costats of the f terms, whch lmts ther practcalty ufortuately. Regardless of the codto umber of the problem, f we have a summato wth eough terms optmsato becomes easy. Ths made precse the defto that follows. Defto 2.. The bg data codto: For some kow costat b, b L µ. Ths codto obvously requres strog covexty ad Lpschtz smoothess so that L/µ s well defed. It s a very strog assumpto for small, as the codto umber L/µ typcal mache learg problems s at least the thousads. For applcatos of ths assumpto, b s typcally betwee ad 8. Several of the methods we descrbe below have a fxed ad very fast covergece rate depedet of the codto umber whe ths bg-data codto holds. 2..2 Radomess ad expected covergece rates Ths thess works extesvely wth optmsato methods that make radom decsos durg the course of the algorthm. Ulke the stochastc approxmato settg, we are dealg wth determstc, kow optmsato problems; the stochastcty s troduced by our optmsato methods, t s ot heret the problem. We troduce radomess because t allows us to get covergece rates faster tha that of ay curretly kow determstc methods. The caveat s that these covergece rates are expectato, so they do t always hold precsely. Ths s ot as bad as t frst seems though. Determg that the expectato of a geeral radom varable coverges s ormally qute a weak result, as ts value may vary aroud the expectato substatally practce, potetally by far more tha t coverges by. The reaso why ths s ot a ssue for the optmsato methods we cosder s that all the radom varables we boud are o-egatve. A o-egatve radom varable X wth a very small expectato, say: E[X] = 0 5, s wth hgh probablty close to ts expectato. Ths s a fudametal result mpled by Markov s equalty. For example, suppose E[X] = 0 5 ad we wat to boud the probablty that X s greater tha 0 3,.e. a factor of 00 worse tha

2 Icremetal Gradet Methods ts expectato. The Markov s equalty tells us that: P(X 0 3 ) apple 00. So there s oly a % chace of X beg larger tha 00 tmes ts expected value here. We wll largely focus o methods wth lear covergece the followg chapters, so order to crease the probablty of the value X holdg by a factor r, oly a logarthmc umber of addtoal teratos r s requred (O(log r)). We would also lke to ote that Markov s equalty ca be qute coservatve. Our expermets later chapters show lttle the way of radom ose attrbutable to the optmsato procedure, partcularly whe the amout of data s large. 2..3 Data access order The source of radomess all the methods cosdered ths chapter s the order of accessg the f terms. By access we mea the evaluato of f (x) ad f 0 (x) at a x of our choce. Ths s more formally kow as a oracle evaluato (see Secto 5.), ad typcally costtutes the most computatoally expesve part of the ma loop of each algorthm we cosder. The access order s defed o a per-epoch bass, where a epoch s evaluatos. Oly three dfferet access orders are cosdered ths work: Cyclc Each step wth j = +(k mod ). Effectvely we access f the order they appear, the loop to the begg ad the ed of every epoch. Permuted Each epoch wth j s sampled wthout replacemet from the set of dces ot accessed yet that epoch. Ths s equvalet to permutg the f at the begg of each epoch, the usg the cyclc order wth the epoch. Radomsed The value of j s sampled uformly at radom wth replacemet from,...,. The permuted termology s our omeclature, whereas the other two terms are stadard. 2.2 Early Icremetal Gradet Methods The classcal cremetal gradet (IG) method s smply a step of the form: x k+ = x k g k f 0 j (xk ),

2.3 Stochastc Dual Coordate Descet (SDCA) 3 where at step k we use cyclc access, takg j = +(k mod ). Ths s smlar to the more well kow stochastc gradet descet, but wth a cyclc order of access of the data stead of a radom order. We have troduced here a superscrpt otato x k for the varable x at step k. We use ths otato throughout ths work. It turs out to be much easer to aalyse such methods uder a radom access orderg. For the radom order IG method (.e. SGD) o smooth strogly covex problems, the followg rate holds for a approprately chose step szes: h E f (x k ) f (x ) apple L 2k x 0 x 2. The step sze scheme requred s of the form g k = q k, where q s a costat that depeds o the gradet orm boud R as well as the degree of strog covexty µ. It may be requred to be qute small some cases. Ths s what s kow as a sublear rate of covergece, as the depedece o k s of the form O( L 2k ), whch s slower tha the lear rate O(( a) k ) for ay a 2 (0, ) asymptotcally. Icremetal gradet methods for strogly covex smooth problems were of less practcal utlty mache learg up utl the developmet of fast varats (dscussed below), as the sublear rates for the prevously kow methods dd ot compare favourably to the (super-)lear rate of quas-newto methods. For ostrogly covex problems, or strogly covex but o-smooth problems, the story s qute dfferet. I those cases, the theoretcal ad practcal rates are hard to beat wth full (sub-)gradet methods. The o-covex case s of partcular terest mache learg. SGD has bee the de facto stadard optmsato method for eural etworks for example sce the 980s (Rumelhart et al., 986). Such cremetal gradet methods have a log hstory, havg bee appled to specfc problems as far back as the 960s (Wdrow ad Hoff, 960). A up-to-date survey ca be foud Bertsekas (202). 2.3 Stochastc Dual Coordate Descet (SDCA) The stochastc dual coordate descet method (Shalev-Shwartz ad Zhag, 203b) s based o the prcple that for problems wth explct quadratc regularsers, the dual takes a partcularly easy to work wth form. Recall the fte sum structure f (x) = Â = f (x) defed earler. Istead of assumg that each f s strogly covex, we stead eed to cosder the regularsed objectve: f (x) = Â f (x)+ µ 2 kxk2. =

4 Icremetal Gradet Methods For ay strogly covex f, we may trasform our fucto to ths form by replacg µ each f wth f 2 kxk2, the cludg a separate regularser. Ths chages the Lpschtz smoothess costat for each f to L µ, ad preserves covexty. We are ow ready to cosder the dual trasformato. We apply the techque of dual decomposto, where we decouple the terms our objectve as follows: m f (x) = x,x,...x,...,x Â f (x )+ µ 2 kxk2, = s.t. x = x =... Ths reformulato tally acheves othg, but the key dea s that we ow have a costraed optmsato problem, ad so we may apply Lagraga dualty (Secto A.3). The Lagraga fucto s: L(x, x,...a,...) = = Â f (x )+ µ 2 kxk2 + = Â ha, x Â ( f (x ) ha, x ) + µ 2 kxk2 + = x * Â a, x +, (2.) where a 2 R d are the troduced dual varables. The Lagraga dual fucto s formed by takg the mmum of L wth respect to each x, leavg a, the set of a =... free: D(a) = Â = m x { f (x ) ha, x } + m x ( * µ 2 kxk2 + Â a, x Now recall that the defto of the covex cojugate (Secto A.2) says that: m { f (x) ha, x} = sup x {ha, x f (x)} = f (a). +), (2.2) Clearly we ca plug ths for each f to get: D(a) = " * Â f (a µ )+m x 2 kxk2 + Â = a, x +#. We stll eed to smplfy the remag m term, whch s also the form of a covex cojugate. We kow that squared orms are self-cojugate, ad scalg a fucto by a postve costat b trasforms ts cojugate from f (a) to b f (a/b), so we fact have: D(a) = Â f (a ) = µ 2 µ Â a 2.

2.3 Stochastc Dual Coordate Descet (SDCA) 5 Algorthm 2. SDCA (exact coordate descet) Italse x 0 ad a 0 as the zero vector, for all. Step k + :. Pck a dex j uformly at radom. 2. Update a j, leavg the other a uchaged: a k+ j = arg m y " f j (y)+µ 2 x k # 2 y a k j. µ 3. Update x k+ = x k µ a k+ j a k j. At completo, for smooth f retur x k. For o-smooth, retur a tal average of the x k sequece. Ths s the objectve drectly maxmsed by SDCA. As the ame mples, SDCA s radomsed (block) coordate ascet o ths objectve, where oly oe a s chaged each step. I coordate descet we have the opto of performg a gradet step a coordate drecto, or a exact mmsato. For the exact coordate mmsato, the update s easy to derve: 2 j = arg m 4 a j Â f (a )+ µ 2 = 2 a k+ = arg m a j 4 f j (a j)+ µ 2 µ µ Â Â a a 2 3 5 2 3 5. (2.3) The prmal pot x k correspodg to the dual varables a k at step k s the mmser h of the cojugate problem x k = µ arg m x 2 kxk2 + Â a k, x, whch closed form s smply x k = µ Â a k. Ths ca be used to further smplfy Equato 2.3. The full method s Algorthm 2.. The SDCA method has a geometrc covergece rate the dual objectve D of the form: h E D(a k ) D(a µ k ) apple D(a 0 ) D(a ). L + µ Ths s easly exteded to a statemet about the dualty gap f (x k ) the suboptmalty f (x k ) f (x ) by usg the relato: f (x k ) D(a k ) apple L + µ µ D(a k ) D(a ). D(a k ) ad hece

6 Icremetal Gradet Methods 2.3. Alteratve steps The full coordate mmsato step dscussed the prevous secto s ot always practcal. If we are treatg each elemet f the summato Â f (x) as a sgle data pot loss, the eve for the smple bary logstc loss there s ot a closed form soluto for the exact coordate step. We ca use a black-box D optmsato method to fd the coordate mmser, but ths wll geerally requre 20-30 expoetal fucto evaluatos, together wth oe vector dot product. For multclass logstc loss, the subproblem solve s ot fast eough to yeld a usable algorthm. I the case of o-dfferetable losses, the stuato s better. Most odfferetable fuctos we use mache learg, such as the hge loss, yeld closed form solutos. For performace reasos we ofte wat to treat each f as a mbatch loss, whch case we vrtually ever have a closed form soluto for the subproblem, eve the o-dfferetable case. Shalev-Shwartz ad Zhag (203a) descrbe a umber of other possble steps whch lead to the same theoretcal covergece rate as the exact mmsato step, but whch are more usable practce: Iterval Le search: It turs out that t s suffcet to perform the mmsato Equato 2.3 alog the terval betwee the curret dual varable a k j ad the pot u = f 0 j (xk ). The update takes the form: s = arg m s2[0,] " f j a k j + s(u ak j ) + µ 2 a k+ j = a k j + s(u ak j ). x k + s # 2 u a k j, µ Costat step: If the value of the Lpschtz smoothess costat L s kow, we ca calculate a coservatve value for the parameter s stead of optmsg over t wth a terval le search. Ths gves a update of the form: a k+ j = a k j + s(u ak j ) where s = µ µ + L. Ths method s much slower practce tha performg a le-search, just as a step sze wth gradet descet s much slower tha performg a le search. L

2.3 Stochastc Dual Coordate Descet (SDCA) 7 2.3.2 Reducg storage requremets We have preseted the SDCA algorthm full geeralty above. Ths results dual varables of dmeso d, for whch the total storage d ca be prohbtve. I practce, the dual varables ofte le o a low-dmesoal subspace. Ths s the case wth lear classfers ad regressors, where a r class problem has gradets o a r dmesoal subspace. A lear classfer takes the form f (x) =f X T x, for a fxed loss f : Rr! R ad data stace matrx X : d r. I the smplest case X s just the data pot duplcated as r rows. The the dual varables are r dmesoal, ad the x k updates chage to: x k = µ Â X a. a k+ j = arg m a " f j (a)+µ 2 x k + # 2 µ X a a k j. Ths s the form of SDCA preseted by Shalev-Shwartz ad Zhag (203a), although wth the egato of our dual varables. 2.3.3 Accelerated SDCA The SDCA method s also curretly the oly fast cremetal gradet method to have a kow accelerated varat. By accelerato, we refer to the modfcato of a optmsato method to mprove the covergece rate by a amout greater tha ay costat factor. Ths termology s commo optmsato although a precse defto s ot ormally gve. The ASDCA method (Shalev-Shwartz ad Zhag, 203a) works by utlsg the regular SDCA method as a sub-procedure. It has a outer loop, whch at each step vokes SDCA o a modfed problem x k+ = m x f (x)+ l 2 kx yk2, where y s chose as a over-relaxed step of the form: y = x k + b(x k x k ), for some kow costat b. The costat l s lkewse computed from the Lpschtz smoothess ad strog covexty costats. These regularsed sub-problems f (x)+ l 2 kx yk2 have a greater degree of strog covexty tha f (x), ad so dvdually are much faster to solve. By a careful choce of the accuracy at whch they are computed to, the total umber of steps made betwee all the subproblem solves s

8 Icremetal Gradet Methods much smaller tha would be requred f regular SDCA s appled drectly to f (x) to reach the same accuracy. I partcular, they state that to reach a accuracy of e expectato for the fucto value, they eed k teratos, where: k = Õ d + m ( s )! dl µ, d L log(/e). µ The Õ otato hdes costat factors. Ths rate s ot of the same precse form as the other covergece rates we wll dscuss ths chapter. We ca make some geeral statemets though. Whe s the rage of the bg-data codto, ths rate s o better tha regular SDCA s rate, ad probably worse practce due to overheads hdde by the Õ otato. Whe s much smaller tha L µ, the potetally t ca be much faster tha regular SDCA. Ufortuately, the ASDCA procedure has sgfcat computatoal overheads that make t ot ecessarly the best choce practce. Probably the bggest ssue however s a sestvty to the Lpschtz smoothess ad strog covexty costats. It assumes these are kow, ad f the used values dffer from the true values, t may be sgfcatly slower tha regular SDCA. I cotrast, regular SDCA requres o kowledge of the Lpschtz smoothess costats (for the prox varat at least), just the strog covexty (regularsato) costat. 2.4 Stochastc Average Gradet (SAG) The SAG algorthm (Schmdt et al., 203) s the closest form to the classcal SGD algorthm amog the fast cremetal gradet methods. Istead of storg dual varables a lke SDCA above, we store a table of past gradets y, whch has the same storage cost geeral, d. The SAG method s gve Algorthm 2.2. The key equato for SAG s the step: x k+ = x k g Â y k. Essetally we move the drecto of the average of the past gradets. Note that ths average cotas oe past gradet for each term, ad they are equally weghted. Ths ca be cotrasted to the SGD method wth mometum, whch uses a geometrcally decayg weghted sum of all past gradet evaluatos. SGD wth mometum however s ot a learly coverget method. It s surprsg that usg equal weghts lke ths actually yelds a much faster covergg algorthm, eve though some of the gradets the summato ca be extremely out of date.

2.4 Stochastc Average Gradet (SAG) 9 Algorthm 2.2 SAG Italse x 0 as the zero vector, ad y = f 0 (x0 ) for each. Step k + :. Pck a dex j uformly at radom. 2. Update x usg step legth costat g: x k+ = x k g 3. Set y k+ j = f 0 j (xk+ ). Leave y k+ = y k for 6= j. Â y k. SAG s a evoluto of the earler cremetal averaged gradet method (IAG, Blatt et al., 2007) whch has the same update wth a dfferet costat factor, ad wth cyclc access used stead of radomsed. IAG has a more lmted covergece theory coverg quadratc or bouded gradet problems, ad a much slower rate of covergece. The covergece rate of SAG for strogly covex problems s of the same order as SDCA, although the costats are ot qute as good. I partcular, we have a expected covergece rate terms of fucto value suboptmalty of: E[ f (x k ) f (x )] apple m 8, µ 6L k L 0, Where L 0 s a complex expresso volvg f (x 0 + g Â y 0 ) ad a quadratc form of x 0 ad each y 0. Ths theoretcal covergece rate s betwee 8 ad 6 tmes worse tha SDCA. I practce SAG s ofte faster tha SDCA though, suggestg that the SAG theory s ot tght. A ce feature of SAG s that ulke SDCA, t ca be drectly appled to o-strogly covex problems. Dfferetablty s stll requred though. The covergece rate s the terms of the average terate x k = k Âk l xl : E[ f ( x k ) f (x )] apple 32 k L 0. The SAG algorthm has great practcal performace, but t s surprsgly dffcult to aalyse theoretcally. The above rates are lkely coservatve by a factor of betwee 4 ad 8. Due to the dffculty of aalyss, the proxmal verso for composte losses has ot yet had ts theoretcal covergece establshed.

20 Icremetal Gradet Methods Algorthm 2.3 SVRG Italse x 0 as the zero vector, g k = Â f 0 (x0 ) ad x 0 = x 0. Step k + :. Pck j uformly at radom. 2. Update x: x k+ = x k h f j 0(xk )+ h fj 0 h ( xk ) g k. 3. Every m teratos, set x ad recalculate the full gradet at that pot: x k+ = x k+. g k = Â f 0 ( xk+ ). Otherwse leave x k+ = x k ad g k+ = g k. At completo retur x. 2.5 Stochastc Varace Reduced Gradet (SVRG) The SVRG method (Johso ad Zhag, 203) s a recetly developed fast cremetal gradet method. It was developed to address the potetally hgh storage costs of SDCA ad SAG, by tradg off storage agast computato. The SVRG method s gve Algorthm 2.3. Ulke the other methods dscussed, there s a tuable parameter m, whch specfes the umber of teratos to complete before the curret gradet approxmato s recalbrated by computg a full gradet f 0 ( x) at the last terate before the recalbrato, x := x k. Essetally, stead of matag a table of past gradets y for each lke SAG does, the algorthm just stores the locato x at whch those gradets should be evaluated, the re-evaluates them whe eeded by just computg fj 0( x). Lke the SAG algorthm, at each step we eed to kow the updated term gradet f 0 j (xk ), the old term gradet f 0 j ( x) ad the average of the old gradets f 0 ( x). Sce we are ot storg the old term gradet, just ts average, we eed to calculate two term gradets stead of the oe term gradet calculated by SAG at each step. The S2GD method (Koečý ad Rchtárk, 203) was cocurretly developed wth SVRG. It has the same update as SVRG, just dfferg that the theoretcal choce of x dscussed the ext paragraph. We use SVRG heceforth to refer to both methods. The update x k+ = x k+ step 3 above s techcally ot supported by the theory. Istead, oe of the followg two updates are used:. x s the average of the x values from the last m teratos. Ths s the varat suggested by Johso ad Zhag (203).

2.5 Stochastc Varace Reduced Gradet (SVRG) 2 2. x s a radomly sampled x from the last m teratos. Ths s used the S2GD varat (Koečý ad Rchtárk, 203). These alteratve updates are requred theoretcally as the covergece betwee recalbratos s expressed terms of the average of fucto values of the last m pots, m k Â r=k m [ f (x r ) f (x )], stead of terms of f (x k ) f (x ) drectly. Varat avods ths ssue by usg Jese s equalty to pull the summato sde: m k Â r=k m [ f (x r ) f (x )] f ( m k Â r=k m x r ) f (x ). Varat 2 uses a sampled x, whch expectato wll also have the requred value. I practce, there s a very hgh probablty that f (x k ) f (x ) s less tha the last-m average, so just takg x = x k works. The SVRG method has the followg covergece rate f k s a multple of m: E[ f ( x k ) f (x )] apple r k/m f ( x 0 ) f (x ), where r = h 4L(m + ) + µ( 4L/h)m h( 4L/h)m. Note also that each step requres two term gradets, so the rate must be halved whe comparg agast the other methods descrbed ths chapter. There s also the cost of the recalbrato pass, whch (depedg o m) ca further crease the ru tme to three tmes that of SAG per step. Ths covergece rate has qute a dfferet form from that of the other methods cosdered ths secto, makg drect comparso dffcult. However, for most parameter values ths theoretcal rate s worse tha that of the other fast cremetal gradet methods. I Secto 4.7 we gve a aalyss of SVRG that requres addtoal assumptos, but gves a rate that s drectly comparable to the other fast cremetal gradet methods.