Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Size: px

Start display at page:

Download "Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization"

Charity Fletcher
5 years ago
Views:

1 Dual Averaging Mehods for Regularized Sochasic Learning and Online Opimizaion Lin Xiao Microsof Research Microsof Way Redmond, WA 985, USA Revised March 8, Absrac We consider regularized sochasic learning and online opimizaion problems, where he objecive funcion is he sum of wo convex erms: one is he loss funcion of he learning ask, and he oher is a simple regularizaion erm such as l -norm for promoing sparsiy. We develop exensions of Neserov s dual averaging mehod, ha can exploi he regularizaion srucure in an online seing. A each ieraion of hese mehods, he learning variables are adjused by solving a simple minimizaion problem ha involves he running average of all pas subgradiens of he loss funcion and he whole regularizaion erm, no jus is subgradien. In he case of l -regularizaion, our mehod is paricularly effecive in obaining sparse soluions. We show ha hese mehods achieve he opimal convergence raes or regre bounds ha are sandard in he lieraure on sochasic and online convex opimizaion. For sochasic learning problems in which he loss funcions have Lipschiz coninuous gradiens, we also presen an acceleraed version of he dual averaging mehod. Keywords: sochasic learning, online opimizaion, l -regularizaion, srucural convex opimizaion, dual averaging mehods, acceleraed gradien mehods.. Inroducion In machine learning, online algorihms operae by repeiively drawing random examples, one a a ime, and adjusing he learning variables using simple calculaions ha are usually based on he single example only. The low compuaional complexiy per ieraion) of online algorihms is ofen associaed wih heir slow convergence and low accuracy in solving he underlying opimizaion problems. As argued by Boou and Bousque 8), he combined low complexiy and low accuracy, ogeher wih oher radeoffs in saisical learning heory, sill make online algorihms favorie choices for solving large-scale learning problems. Neverheless, radiional online algorihms, such as sochasic gradien descen, have limied capabiliy of exploiing problem srucure in solving regularized learning problems. As a resul, heir low accuracy ofen makes i hard o obain he desired regularizaion effecs, e.g., sparsiy under l -regularizaion. In his paper, we develop a new class of online algorihms, he regularized dual averaging RDA) mehods, ha can exploi he regularizaion srucure more effecively in an online seing. In his secion, we describe he wo ypes of problems ha we consider, and explain he moivaion of our work.

2 . Regularized Sochasic Learning The regularized sochasic learning problems we consider are of he following form: { } minimize φw) E z fw,z)+ψw) w where w R n is he opimizaion variable ofen called weighs in learning problems), z = x, y) is an inpu-oupu pair of daa drawn from an unknown) underlying disribuion, fw,z) is he loss funcion of using w and x o predic y, and Ψw) is a regularizaion erm. We assume Ψw) is a closed convex funcion Rockafellar, 97), and is effecive domain, domψ = {w R n Ψw) < + }, is closed. We also assume ha fw,z) is convex in w for each z, and i is subdiffereniable a subgradien always exiss) on domψ. Examples of he loss funcion fw,z) include: Leas-squares: x R n, y R, and fw,x,y)) = y w T x). Hinge loss: x R n, y {+, }, and fw,x,y)) = max{, yw T x)}. Logisicregression: x R n,y {+, },andfw,x,y)) = log +exp yw T x) )). Examples of he regularizaion erm Ψw) include: l -regularizaion: Ψw) = λ w wih λ >. Wih l -regularizaion, we hope o ge a relaively sparse soluion, i.e., wih many enries of he weigh vecor w being zeroes. l -regularizaion: Ψw) = σ/) w, wih σ >. When l -regularizaion is used wih he hinge loss funcion, we have he sandard seup of suppor vecor machines. Convex consrains: Ψw) is he indicaor funcion of a closed convex se C, i.e., {, if w C, Ψw) = I C w) +, oherwise. We can also consider mixed regularizaions such as Ψw) = λ w + σ/) w. These examples cover a wide range of pracical problems in machine learning. A common approach for solving sochasic learning problems is o approximae he expeced loss funcion φw) by using a finie se of independen observaions z,...,z T, and solve he following problem o minimize he empirical loss: minimize w T ) T fw,z )+Ψw). ) = By our assumpions, his is a convex opimizaion problem. Depending on he srucure of paricular problems, hey can be solved efficienly by inerior-poin mehods e.g., Ferris and Munson, 3; Koh e al., 7), quasi-newon mehods e.g., Andrew and Gao, 7), or acceleraed firs-order mehods Neserov, 7; Tseng, 8; Beck and Teboulle, 9). However, his bach opimizaion approach may no scale well for very large problems: even wih firs-order mehods, evaluaing one single gradien of he objecive funcion in ) requires going hrough he whole daa se.

3 In his paper, we consider online algorihms ha process samples sequenially as hey become available. More specifically, we draw a sequence of i.i.d. samples z,z,z 3,..., and use hem o calculae a sequence w,w,w 3,... Suppose a ime, we have he mos upo-dae weigh vecor w. Whenever z is available, we can evaluae he loss fw,z ), and also a subgradien g fw,z ) here fw,z) denoes he subdifferenial of fw,z) wih respec o w). Then we compue w + based on hese informaion. The mos widely used online algorihm is he sochasic gradien descen SGD) mehod. Consider he general case Ψw) = I C w) + ψw), where I C w) is a hard se consrain and ψw) is a sof regularizaion. The SGD mehod akes he form w + = Π C w α g +ξ ) ), 3) where α is an appropriae sepsize, ξ is a subgradien of ψ a w, and Π C ) denoes Euclidean projecion ono he se C. The SGD mehod belongs o he general scheme of sochasic approximaion, which can be raced back o Robbins and Monro 95) and Kiefer and Wolfowiz 95). In general we are also allowed o use all previous informaion o compue w +, and even second-order derivaives if he loss funcions are smooh. In a sochasic online seing, each weigh vecor w is a random variable ha depends on {z,...,z }, and so is he objecive value φw ). Assume an opimal soluion w o he problem ) exiss, and le φ = φw ). The goal of online algorihms is o generae a sequence {w } = such ha lim Eφw ) = φ, and hopefully wih reasonable convergence rae. This is he case for he SGD mehod 3) if we choose he sepsize α = c/, where c is a posiive consan. The corresponding convergence rae is O/ ), which is indeed bes possible for subgradien schemes wih a black-box model, even in he case of deerminisic opimizaion Nemirovsky and Yudin, 983). Despie such slow convergence and he associaed low accuracy in he soluions compared wih bach opimizaion using, e.g., inerior-poin mehods), he SGD mehod has been very popular in he machine learning communiy due o is capabiliy of scaling wih very large daa ses and good generalizaion performances observed in pracice e.g., Boou and LeCun, ; Zhang, ; Shalev-Shwarz e al., 7). Neverheless, a main drawback of he SGD mehod is is lack of capabiliy in exploiing problem srucure, especially for problems wih explici regularizaion. More specifically, he SGD mehod 3) reas he sof regularizaion ψw) as a general convex funcion, and only uses is subgradien in compuing he nex weigh vecor. In his case, we can simply lump ψw) ino fw,z ) and rea hem as a single loss funcion. Alhough in heory he algorihm converges o an opimal soluion in expecaion) as goes o infiniy, in pracice i is usually sopped far before ha. Even in he case of convergence in expecaion, we sill face possibly big) variaions in he soluion due o he sochasic naure of he algorihm. Therefore, he regularizaion effec we hope o have by solving he problem ) may be elusive for any paricular soluion generaed by 3) based on finie random samples. An imporan example and main moivaion for his paper is l -regularized sochasic learning, where Ψw) = λ w. In he case of bach learning, he empirical minimizaion problem ) can be solved o very high precision, e.g., by inerior-poin mehods. Therefore simply rounding he weighs wih very small magniudes oward zero is usually enough o 3

4 produce desired sparsiy. As a resul, l -regularizaion has been very effecive in obaining sparse soluions using he bach opimizaion approach in saisical learning e.g., Tibshirani, 996) and signal processing e.g., Chen e al., 998). In conras, he SGD mehod 3) hardly generaes any sparse soluion, and is inheren low accuracy makes he simple rounding approach very unreliable. Several principled sof-hresholding or runcaion mehods have been developed o address his problem e.g., Langford e al., 9; Duchi and Singer, 9), bu he levels of sparsiy in heir soluions are sill unsaisfacory compared wih he corresponding bach soluions. In his paper, we develop regularized dual averaging RDA) mehods ha can exploi he srucure of ) more effecively in a sochasic online seing. More specifically, each ieraion of he RDA mehods akes he form w + = argmin w { } g τ,w +Ψw)+ β hw), ) where hw) is an auxiliary srongly convex funcion, and {β } is a nonnegaive and nondecreasing inpu sequence, which deermines he convergence properies of he algorihm. Essenially, a each ieraion, his mehod minimizes he sum of hree erms: a linear funcion obained by averaging all previous subgradiens he dual average), he original regularizaion funcion Ψw), and an addiional srongly convex regularizaion erm β /)hw). The RDA mehod is an exension of he simple dual averaging scheme of Neserov 9), which is equivalen o leing Ψw) be he indicaor funcion of a closed convex se. For he RDA mehod o be pracically efficien, we assume ha he funcions Ψw) and hw) are simple, meaning ha we are able o find a closed-form soluion for he minimizaion problem in ). Then he compuaional effor per ieraion is only On), he same as he SGD mehod. This assumpion indeed holds in many cases. For example, if we le Ψw) = λ w and hw) = /) w, hen w + has an enry-wise closed-from soluion. This soluion uses a much more aggressive runcaion hreshold han previous mehods, hus resuls in significanly improved sparsiy see discussions in Secion 5). In erms of ieraion complexiy, we show ha if β = Θ ), i.e., wih order exacly, hen he RDA mehod ) has he sandard convergence rae ) G Eφ w ) φ O, where w = /) w τ is he primal average, and G is a uniform upper bound on he norms of he subgradiens g. If he regularizaion erm Ψw) is srongly convex, hen seing β Oln) gives a faser convergence rae Oln/). For sochasic opimizaion problems in which he loss funcions fw, z) are all differeniable and have Lipschiz coninuous gradiens, we also develop an acceleraed version of he RDA mehod ha has he convergence rae L Eφw ) φ O) + Q ), wherelishelipschizconsanofhegradiens, andq isanupperboundonhevariances of he sochasic gradiens. In addiion o convergence in expecaion, we show ha he same orders of convergence raes hold wih high probabiliy.

5 . Regularized Online Opimizaion In online opimizaion, we use an online algorihm o generae a sequence of decisions w, one a a ime, for =,,3,... A each ime, a previously unknown cos funcion f is revealed, and we encouner a loss f w ). We assume ha he cos funcions f are convex for all. The goal of he online algorihm is o ensure ha he oal cos up o each ime, f w ), is no much larger han min w f w), he smalles oal cos of any fixed decision w from hindsigh. The difference beween hese wo cos is called he regre of he online algorihm. Applicaions of online opimizaion include online predicion of ime series and sequenial invesmen e.g., Cesa-Bianchi and Lugosi, 6). In regularized online opimizaion, we add a convex regularizaion erm Ψw) o each cos funcion. The regre wih respec o any fixed decision w domψ is R w) fτ w τ )+Ψw τ ) ) fτ w)+ψw) ). 5) As in he sochasic seing, he online algorihm can query a subgradien g f w ) a each sep, and possibly use all previous informaion, o compue he nex decision w +. I urns ou ha he simple subgradien mehod 3) is well suied for online opimizaion: wih a sepsize α = Θ/ ), i has a regre R w) O ) for all w domψ Zinkevich, 3). This regre bound canno be improved in general for convex cos funcions. However, if he cos funcions are srongly convex, say wih convexiy parameer σ, hen he same algorihm wih sepsize α = /σ) gives an Oln) regre bound e.g., Hazan e al., 6; Barle e al., 8). Similar o he discussions on regularized sochasic learning, he online subgradien mehod 3) in general lacks he capabiliy of exploiing he regularizaion srucure. In his paper, we show ha he same RDA mehod ) can effecively exploi such srucure in an online seing, and ensure he O ) regre bound wih β = Θ ). For srongly convex regularizaions, seing β = Oln) yields he improved regre bound Oln). Since here is no specificaions on he probabiliy disribuion of he sequence of funcions, nor assumpions like muual independence, online opimizaion can be considered as a more general framework han sochasic learning. In his paper, we will firs esablish regre bounds of he RDA mehod for solving online opimizaion problems, hen use hem o derive convergence raes for solving sochasic learning problems..3 Ouline of Conens The mehods we develop apply o more general seings han R n wih Euclidean geomery. In Secion., we inroduce he necessary noaions and definiions associaed wih a general finie-dimensional real vecor space. In Secion, we presen he generic RDA mehod for solving boh he sochasic learning and online opimizaion problems, and give several concree examples of he mehod. In Secion 3, we presen he precise regre bounds of he RDA mehod for solving regularized online opimizaion problems. In Secion, we derive convergence raes of he RDA mehod for solving regularized sochasic learning problems. In addiion o he raes of convergence in expecaion, we also give associaed high probabiliy bounds. 5

6 In Secion 5, we explain he connecions of he RDA mehod o several relaed work, and analyze is capabiliy of generaing beer sparse soluions han oher mehods. In Secion 6, we give an enhanced version of he l -RDA mehod, and presen compuaional experimens on he MNIST handwrien daase. Our experimens show ha he RDA mehod is capable of generae sparse soluions ha are comparable o hose obained by bach learning using inerior-poin mehods. In Secion 7, we discuss he RDA mehods in he conex of srucural convex opimizaion and heir connecions o incremenal subgradien mehods. As an exension, we develop an acceleraed version of he RDA mehod for sochasic opimizaion problems wih smooh loss funcions. We also discuss in deail he p-norm based RDA mehods. Appendices A-D conain echnical proofs of our main resuls.. Noaions and Generaliies Le E be a finie-dimensional real vecor space, endowed wih a norm. This norm defines a sysems of balls: Bw,r) = {u E u w r}. Le E be he vecor space of all linear funcions on E, and le s,w denoe he value of s E a w E. The dual space E is endowed wih he dual norm s = max w s,w. A funcion h : E R {+ } is called srongly convex wih respec o he norm if here exiss a consan σ > such ha hαw + α)u) αhw)+ α)hu) σ α α) w u, w,u domh. The consan σ is called he convexiy parameer, or he modulus of srong convexiy. Le rin C denoe he relaive inerior of a convex se C Rockafellar, 97). If h is srongly convex wih modulus σ, hen for any w domh and u rindomh), hw) hu)+ s,w u + σ w u, s hu). See, e.g., Goebel and Rockafellar 8) and Judisky and Nemirovski 8). In he special case of he coordinae vecor space E = R n, we have E = E, and he sandard inner produc s,w = s T w = n i= si) w i), wherew i) denoes he i-hcoordinae of w. For he sandard Euclidean norm, w = w = w,w and s = s. For any w R n, he funcion hw) = σ/) w w is srongly convex wih modulus σ. For anoher example, consider he l -norm w = w = n i= wi) and is associaed dual norm w = w = max i n w i). Le S n be he sandard simplex in R n, i.e., S n = { w R n + n i= wi) = }. Then he negaive enropy funcion hw) = n w i) lnw i) +lnn, 6) i= wihdomh = S n, issronglyconvexwihrespeco wihmodulussee, e.g.,neserov, 5, Lemma 3). In his case, he unique minimizer of h is w = /n,...,/n). For a closed proper convex funcion Ψ, we use Argmin w Ψw) o denoe he convex) se of minimizing soluions. If a convex funcion h has a unique minimizer, e.g., when h is srongly convex, hen we use argmin w hw) o denoe ha single poin. 6

7 Algorihm Regularized dual averaging RDA) mehod inpu: an auxiliary funcion hw) ha is srongly convex on domψ and also saisfies arg min w a nonnegaive and nondecreasing sequence {β }. iniialize: se w = argmin w hw) and ḡ =. hw) ArgminΨw). 7) w for =,,3,... do. Given he funcion f, compue a subgradien g f w ).. Updae he average subgradien: 3. Compue he nex weigh vecor: end for w + = argmin w ḡ = ḡ + g. { ḡ,w +Ψw)+ β } hw). 8). Regularized Dual Averaging Mehod In his secion, we presen he generic RDA mehod Algorihm ) for solving regularized sochasic learning and online opimizaion problems, and give several concree examples. To unify noaion, we use f w) o denoe he cos funcion a each sep. For sochasic learning problems, we simply le f w) = fw,z ). A he inpu o he RDA mehod, we need an auxiliary funcion h ha is srongly convex on dom Ψ. The condiion 7) requires ha is unique minimizer mus also minimize he regularizaion funcion Ψ. This can be done, e.g., by firs choosing a saring poin w Argmin w Ψw) and an arbirary srongly convex funcion h w), hen leing hw) = h w) h w ) h w ),w w. In oher words, hw) is he Bregman divergence from w induced by h w). If h is no differeniable, bu subdiffereniable a w, we can replace h w ) wih a subgradien. The inpu sequence {β } deermines he convergence rae, or regre bound, of he algorihm. There are hree seps in each ieraion of he RDA mehod. Sep is o compue a subgradien of f a w, which is sandard for all subgradien or gradien based mehods. Sep is he online version of compuing he average subgradien: ḡ = g τ. Thenamedual averaging comesfromhefachahesubgradiensliveinhedualspacee. 7

8 Sep 3 is mos ineresing and worh furher explanaion. In paricular, he efficiency in compuing w + deermines how useful he mehod is in pracice. For his reason, we assume he regularizaion funcions Ψw) and hw) are simple. This means he minimizaion problem in 8) can be solved wih lile effor, especially if we are able o find a closed-form soluion for w +. A firs sigh, his assumpion seems o be quie resricive. However, he examples below show ha his indeed is he case for many imporan learning problems in pracice.. RDA Mehods wih General Convex Regularizaion For a general convex regularizaion Ψ, we can choose any posiive sequence {β } ha is order exacly, o obain an O/ ) convergence rae for sochasic learning, or an O ) regre bound for online opimizaion. We will sae he formal convergence heorems in Secions 3 and. Here, we give several concree examples. To be more specific, we choose a parameer γ > and use he sequence β = γ, =,,3,... Neserov s dual averaging mehod. Le Ψw) be he indicaor funcion of a closed convex se C. This recovers he simple dual averaging scheme in Neserov 9). If we choose hw) = /) w, hen he equaion 8) yields ) ) w + = Π C γ ḡ = Π C γ g τ. 9) When C = {w R n w δ} for some δ >, we have hard l -regularizaion. In his case, alhough here is no closed-form soluion for w +, efficien algorihms for projecion ono he l -ball can be found, e.g., in Duchi e al. 8). Sof l -regularizaion. Le Ψw) = λ w for some λ >, and hw) = /) w. In his case, w + has a closed-form soluion see Appendix A for he derivaion): w i) + = if ḡ i) λ, ḡ i) λ sgn ḡ i) ) ) oherwise, γ i =,...,n. ) Here sgn ) is he sign or signum funcion, i.e., sgnω) equals if ω >, if ω <, and if ω =. Whenever a componen of ḡ is less han λ in magniude, he corresponding componen of w + is se o zero. Furher exensions of he l -RDA mehod, and associaed compuaional experimens, are given in Secion 6. Exponeniaed dual averaging mehod. Le Ψw) be he indicaor funcion of he sandard simplex S n, and hw) be he negaive enropy funcion defined in 6). In his case, w i) + = ) exp Z + γ ḡi), i =,...,n, 8

9 where Z + is a normalizaion parameer such ha n i= wi) + =. This is he dual averaging version of he exponeniaed gradien algorihm Kivinen and Warmuh, 997); see also Tseng and Bersekas 993) and Judisky e al. 5). We noe ha his example is also covered by Neserov s dual averaging mehod. We discuss in deail he special case of p-norm RDA mehod in Secion 7.. Several oher examples, including l -norm and a hybrid l /l -norm Berhu) regularizaion, also admi closed-form soluions for w +. Their soluions are similar in form o hose obained in he conex of he Fobos algorihm in Duchi and Singer 9).. RDA Mehods wih Srongly Convex Regularizaion If he regularizaion erm Ψw) is srongly convex, we can use any nonnegaive and nondecreasing sequence {β } ha grows no faser han Oln), o obain an Oln/) convergence rae for sochasic learning, or an Oln ) regre bound for online opimizaion. For simpliciy, in he following examples, we use he zero sequence β = for all. In his case, we do no need he auxiliary funcion hw), and he equaion 8) becomes w + = argmin w { ḡ,w +Ψw) }. l -regularizaion. Le Ψw) = σ/) w for some σ >. In his case, w + = σḡ = σ g τ. Mixed l /l -regularizaion. Le Ψw) = λ w + σ/) w In his case, we have if ḡ i) w i) + = σ ḡ i) λ, λ sgn ḡ i) ) ) oherwise, wih λ > and σ >. i =,...,n. Kullback-Leibler KL) divergence regularizaion. Le Ψw) = σd KL w p), where he given probabiliy disribuion p rins n, and ) n D KL w p) w i) w i) ln p i). i= Here D KL w p) is srongly convex wih respec o w wih modulus. In his case, w i) + = p i) exp ) σḡi), Z + where Z + is a normalizaion parameer such ha n i= wi) + =. KL divergence regularizaion has he pseudo-sparsiy effec, meaning ha mos elemens in w can be replaced by elemens in he consan vecor p wihou significanly increasing he loss funcion e.g., Bradley and Bagnell, 9). 9

10 3. Regre Bounds for Online Opimizaion In his secion, we give he precise regre bounds of he RDA mehod for solving regularized online opimizaion problems. The convergence raes for sochasic learning problems can be esablished based on hese regre bounds, and will be given in he nex secion. For clariy, we gaher here he general assumpions used hroughou his paper: The regularizaion erm Ψw) is a closed proper convex funcion, and dom Ψ is closed. The symbol σ is dedicaed o he convexiy parameer of Ψ. Wihou loss of generaliy, we assume min w Ψw) =. For each, he funcion f w) is convex and subdiffereniable on domψ. The funcion hw) is srongly convex on dom Ψ, and subdiffereniable on rindom Ψ). Wihoulossofgeneraliy, assumehw)hasconvexiyparameerandmin w hw) =. We will no repea hese general assumpions when saing our formal resuls laer. To faciliae regre analysis, we firs give a few definiions. For any consan D >, we define he se F D { w domψ hw) D }, and le Γ D = sup w F D inf g. ) g Ψw) We use he convenion inf g g = +. As a resul, if Ψ is no subdiffereniable everywhere on F D, i.e., if Ψw) = a some w F D, hen we have Γ D = +. Noe ha Γ D is no a Lipschiz-ype consan which would be required o be an upper bound on all he subgradiens; insead, we only require ha a leas one subgradien is bounded in norm by Γ D a every poin in he se F D. We assume ha he sequence of subgradiens {g } generaed by Algorihm is bounded, i.e., here exis a consan G such ha g G,. ) This is rue, for example, if domψ is compac and each f has Lipschiz-coninuous gradien on domψ. We require ha he inpu sequence {β } be chosen such ha max{σ, β } >, 3) where σ is he convexiy parameer of Ψw). For convenience, we le β = max{σ,β } and define he sequence of regre bounds β D + G τ= + β β )G στ +β τ β +σ), =,,3,..., ) where D is he consan used in he definiion of F D. We could always se β σ, so ha β = β and herefore he erm β β )G /β +σ) vanishes in he definiion ). However, when σ >, we would like o keep he flexibiliy of seing β = for all, as we did in Secion..

11 Theorem Le he sequences {w } and {g } be generaed by Algorihm, and assume ) and 3) hold. Then for any and any w F D, we have: a) The regre defined in 5) is bounded by, i.e., R w). 5) b) The primal variables are bounded as w + w σ+β R w) ). 6) c) If w is an inerior poin, i.e., Bw,r) F D for some r >, hen ḡ Γ D σr + r R w) ). 7) In Theorem, he bounds on w + w and g depend on he regre R w). More precisely, hey depend on R w), which is he slack of he regre bound in 5). A smaller slack is equivalen o a larger regre R w), which means w is a beer fixed soluion for he online opimizaion problem he bes one gives he larges regre); correspondingly, he inequaliy 6) gives a igher bound on w + w. In 7), he lef-hand side ḡ does no depend on any paricular inerior poin w o compare wih, bu he righ-hand side depends on boh R w) and how far w is from he boundary of F D. The ighes bound on ḡ can be obained by aking he infimum of he righ-hand side over all w inf D. We furher elaborae on par c) hrough he following wo examples: Consider he case when Ψ is he indicaor funcion of a closed convex se C. In his case, σ = and Ψw) is he normal cone o C a w Rockafellar, 97, Secion 3). By he definiion ), we have Γ D = because he zero vecor is a subgradien a every w C, even hough he normal cones can be unbounded a he boundary of C. In his case, if Bw,r) F D for some r >, hen 7) simplifies o ḡ r R w) ). Consider he funcion Ψw) = σd KL w p) wih domψ = S n assuming p rins n ). In his case, domψ, and hence F D, have empy inerior. Therefore he bound in par c) does no apply. In fac, he quaniy Γ D can be unbounded anyway. In paricular, he subdifferenials of Ψ a he relaive boundary of S n are all empy. In he relaive inerior of S n, he subgradiens acually gradiens) of Ψ always exis, bu can become unbounded for poins approaching he relaive boundary. Neverheless, he bounds in pars a) and b) sill hold. The proof of Theorem is given in Appendix B. In he res of his secion, we discuss more concree regre bounds depending on wheher or no Ψ is srongly convex.

12 3. Regre Bound wih General Convex Regularizaion For a general convex regularizaion erm Ψ, any nonnegaive and nondecreasing sequence β = Θ ) gives an O ) regre bound. Here we give deailed analysis for he sequence used in Secion.. More specifically, we choose a consan γ > and le We have he following corollary of Theorem. β = γ,. 8) Corollary Le he sequences {w } and {g } be generaed by Algorihm using {β } defined in 8), and assume ) holds. Then for any and any w F D : a) The regre is bounded as R w) ) γd + G. γ b) The primal variables are bounded as w + w D + G γ γ R w). c) If w is an inerior poin, i.e., Bw,r) F D for some r >, hen ḡ Γ D + ) γd + G γ r r R w). Proof To simplify regre analysis, le γ σ. Therefore β = β = γ. Then defined in ) becomes = γ ) D + G +. γ τ Nex using he inequaliy we ge + τ τ dτ =, γ D + G + )) = γ ) γd + G. γ Combining he above inequaliy and he conclusions of Theorem proves he corollary. The regre bound in Corollary is essenially he same as he online gradien descen mehod of Zinkevich 3), which has he form 3), wih he sepsize α = /γ ). The main advanage of he RDA mehod is is capabiliy of exploiing he regularizaion srucure, as shown in Secion. The parameers D and G are no used explicily in he

13 algorihm. However, we need good esimaes of hem for choosing a reasonable value for γ. The bes γ ha minimizes he expression γd +G /γ is which leads o he simplified regre bound γ = G D, R w) GD. If he oal number of online ieraions T is known in advance, hen using a consan sepsize in he classical gradien mehod 3), say α = γ T = D, =,...,T, 9) G T gives a slighly improved bound R T w) GD T see, e.g., Nemirovski e al., 9). The bound in par b) does no converge o zero. This resul is sill ineresing because here is no special cauion aken in he RDA mehod, more specifically in 8), o ensure he boundedness of he sequence w. In he case Ψw) =, as poined ou by Neserov 9), his may even look surprising since we are minimizing over E he sum of a linear funcion and a regularizaion erm γ/ )hw) ha evenually goes o zero. Par c) gives a bound on he norm of he dual average. If Ψw) is he indicaor funcion of a closed convex se, hen Γ D = and par c) shows ha ḡ acually converges o zero if here exis an inerior w in F D such ha R w). However, a properly scaled version of ḡ, /γ)ḡ, racks he opimal soluion; see he examples in Secion.. 3. Regre Bounds wih Srongly Convex Regularizaion If he regularizaion funcion Ψw) is srongly convex, i.e., wih a convexiy parameer σ >, hen any nonnegaive, nondecreasing sequence ha saisfies β Oln) will give an Oln) regre bound. If {β } is no he all zero sequence, we can simply choose he auxiliary funcion hw) = /σ)ψw). Here are several possibiliies: Posiive consan sequences. For simpliciy, le β = σ for. In his case, = σd + G σ τ= τ + σd + G σ +ln). Logarihmic sequences. Le β = σ+ln) for. In his case, β = β = σ and ) = σ+ln)d + G ) + σd + G +ln). σ τ ++lnτ σ The zero sequence. Le β = for. In his case, β = σ and ) = G + + G σ τ σ G 6+ln). ) σ Noice ha in his las case, he regre bound does no depend on D. 3

14 When Ψ is srongly convex, we also conclude ha, given wo differen poins u and v, he regres R u) and R v) canno be nonnegaive simulaneously if is large enough. To see his, we noice ha if R u) and R v) are nonnegaive simulaneously for some, hen par b) of Theorem implies ) ln w + u O, and w + v O which again implies ) u v w + u + w + v ) ln O. ln Therefore, if he even R u) and R v) happens for infiniely many, we mus have u = v. If u v, hen evenually a leas one of he regres associaed wih hem will become negaive. However, i is possible o consruc sequences of funcions f such ha he poins wih nonnegaive regres do no converge o a fixed poin. ),. Convergence Raes for Sochasic Learning In his secion, we give convergence raes of he RDA mehod when i is used o solve he regularized sochasic learning problem ), and also he relaed high probabiliy bounds. These raes and bounds are esablished no for he individual w s generaed by he RDA mehod, bu raher for he primal average w = w τ,.. Rae of Convergence in Expecaion Theorem 3 Assume here exiss an opimal soluion w o he problem ) ha saisfies hw ) D for some D >, and le φ = φw ). Le he sequences {w } and {g } be generaed by Algorihm, and assume ) holds. Then for any, we have: a) The expeced cos associaed wih he random variable w is bounded as Eφ w ) φ. b) The primal variables are bounded as E w + w σ+β. c) If w is an inerior poin, i.e., Bw,r) F D for some r >, hen E ḡ Γ D σr + r.

15 Proof Firs, we subsiue all f τ ) by f,z τ ) in he definiion of he regre R w ) = fwτ,z τ )+Ψw τ ) ) fw,z τ )+Ψw ) ). Le z[] denoe he collecion of i.i.d. random variables z,...,z ). All he expecaions in Theorem 3 are aken wih respec o z[], i.e., he symbol E can be wrien more explicily as E z[]. We noe ha he random variable w τ, where τ, is a funcion of z,...,z τ ), and is independen of z τ,...,z ). Therefore and E z[] fwτ,z τ )+Ψw τ ) ) = E z[τ ] Ezτ fw τ,z τ )+Ψw τ ) ) = E z[τ ] φw τ ) = E z[] φw τ ), E z[] fw,z τ )+Ψw ) ) = E zτ fw,z τ )+Ψw ) = φw ) = φ. Since φ = φw ) = min w φw), we have By convexiy of φ, we have E z[] R w ) = φ w ) = φ E z[] φw τ ) φ. ) ) w τ φw τ ) Taking expecaion wih respec o z[] and subracing φ, we have ) E z[] φ w ) φ E z[] φ w τ ) φ = E z[]r w ). Then par a) follows from ha of Theorem, which saes ha R w ) for all realizaions of z[]. Similarly, pars b) and c) follow from hose of Theorem and ). Specific convergence raes can be obained in parallel wih he regre bounds discussed in Secions 3. and 3.. We only need o divide every regre bound by o obain he corresponding rae of convergence in expecaion. More specifically, using appropriae sequences {β }, we have Eφ w ) converging o φ wih rae O/ ) for general convex regularizaion, and Oln /) for srongly convex regularizaion. The bound in par b) applies o boh he case σ = and he case σ >. For he laer, we can derive a slighly differen and more specific bound. When Ψ has convexiy parameer σ >, so is he funcion φ. Therefore, φw ) φw )+ s,w w + σ w w, s φw ). Since w is he minimizer of φ, we mus have φw ) Rockafellar, 97, Secion 7). Seing s = in he above inequaliy and rearranging erms, we have w w σ φw ) φ ). 5

16 Taking expecaion of boh sides of he above inequaliy leads o E w w σ Eφw ) φ ) σ, ) where in he las sep we used par a) of Theorem 3. This bound direcly relae w o. Nex we ake a closer look a he quaniy E w w. By convexiy of, we have E w w E w τ w 3) If σ =, hen i is simply bounded by a consan because each E w τ w for τ is bounded by a consan. When σ >, he opimal soluion w is unique, and we have: Corollary If Ψ is srongly convex wih convexiy parameer σ > and β = Oln), hen ) ln) E w w O. Proof For he ease of presenaion, we consider he case β = for all. Subsiuing he bound on in ) ino he inequaliy ) gives Then by 3), E w w 6+ln)G σ,. E w w 6 τ + lnτ ) G τ σ 6+ln)+ ) G ln) σ. In oher words, E w w converges o zero wih rae Oln) /). This can be shown for any β = Oln); see Secion 3. for oher choices of β. As a furher noe, he conclusions in Theorem 3 sill hold if he assumpion ) is weakened o E g G,. ) However, we need ) in order o prove he high probabiliy bounds presened nex.. High Probabiliy Bounds For sochasic learning problems, in addiion o he raes of convergence in expecaion, i is ofen desirable o obain confidence level bounds for approximae soluions. For his purpose, we sar from par a) of Theorem 3, which saes Eφw ) φ /). By Markov s inequaliy, we have for any ε >, Prob φ w ) φ > ε ) ε. 5) This bound holds even wih he weakened assumpion ). However, i is possible o have much igher bounds under more resricive assumpions. To his end, we have he following resul. 6

17 Theorem 5 Assume here exis consans D and G such ha hw ) D, and hw ) D and g G for all. Then for any δ,), we have, wih probabiliy a leas δ, φ w ) φ + 8GD ln/δ),. 6) Theorem 5 is proved in Appendix C. From our resuls in Secion 3., wih he inpu sequence β = γ for all, we have = O ) regardless of σ = or σ >. Therefore, φ w ) φ = O/ ) wih high probabiliy. To simplify furher discussion, le γ = G/D, hence GD see Secion 3.). In his case, if δ /e.368, hen wih probabiliy a leas δ, φ w ) φ GD ln/δ). Leing ε = GD ln/δ)/, hen he above bound is equivalen o ) Probφ w ) φ > ε) exp ε GD), which is much igher han he one in 5). I follows ha for any chosen accuracy ε and < δ /e, he sample size GD) ln/δ) ε guaranees ha, wih probabiliy a leas δ, w is an ε-opimal soluion of he original sochasic opimizaion problem ). When Ψ is srongly convex σ > ), our resuls in Secion 3. show ha we can obain regre bounds = Oln) using β = Oln). However, he high probabiliy bound in Theorem 5 does no improve: we sill have φ w ) φ = O/ ), no Oln/). The reason is ha he concenraion inequaliy Azuma, 967) used in proving Theorem 5 canno ake advanage of he srong-convexiy propery. By using a refined concenraion inequaliy due o Freedman 975), Kakade and Tewari 9, Theorem ) showed ha for srongly convex sochasic learning problems, wih probabiliy a leas δ ln, φ w ) φ R w ) + R w ) G ln/δ) σ { } 6G ln/δ) +max σ,6b. In our conex, he consan B is an upper bound on fw,z)+φw) for w F D. Using he regre bound Rw ), his gives ) φ w ) φ ln/δ) +O + ln/δ). Here he consans hidden in he O-noaion are deermined by G, σ and D. Plugging in = Oln), we have φ w ) φ = Oln/) wih high probabiliy. The addiional penaly of geing he high probabiliy bound, compared wih he rae of convergence in expecaion, is only O ln/). 7

18 5. Relaed Work As we poined ou in Secion., if Ψ is he indicaor funcion of a convex se C, hen he RDA mehod recovers he simple dual averaging scheme in Neserov 9). This special case also belongs o a more general primal-dual algorihmic framework developed by Shalev-Shwarz and Singer 6), which can be expressed equivalenly in our noaion: { } w + = argmin w C γ dτ,w +hw), where d,...,d ) is he se of dual variables ha can be chosen a ime. The simple dual averaging scheme 9) is in fac he passive exreme of heir framework in which he dual variables are simply chosen as he subgradiens and do no change over ime, i.e., d τ = g τ, τ,. 7) However, wih he addiion of a general regularizaion erm Ψw) as in ), he convergence analysis and O ) regre bound of he RDA mehod do no follow direcly as corollaries of eiher Neserov 9) or Shalev-Shwarz and Singer 6). Our analysis in Appendix B exends he framework of Neserov 9). Shalev-Shwarz and Kakade 9) exended he primal-dual framework of Shalev- Shwarz and Singer 6) o srongly convex funcions and obained Oln ) regre bound. In he conex of his paper, heir algorihm akes he form { } w + = argmin d w C σ τ,w +hw), where σ is he convexiy parameer of Ψ, and hw) = /σ)ψw). The passive exreme of his mehod, wih he dual variables chosen in 7), is equivalen o a special case of he RDA mehod wih β = for all. Oher han improving he ieraion complexiy, he idea of reaing he regularizaion explicily in each sep of a subgradien-based mehod insead of lumping i ogeher wih he loss funcion and aking heir subgradiens) is mainly moivaed by pracical consideraions, such as obaining sparse soluions. In he case of l -regularizaion, his leads o sof-hresholding ype of algorihms, in boh bach learning e.g., Figueiredo e al., 7; Wrigh e al., 9; Bredies and Lorenz, 8; Beck and Teboulle, 9) and he online seing e.g., Langford e al., 9; Duchi and Singer, 9; Shalev-Shwarz and Tewari, 9). Mos of hese algorihms can be viewed as exensions of classical gradien mehods including mirror-descen mehods) in which he new ierae is obained by sepping from he curren ierae along a single subgradien, and hen followed by a runcaion. Oher ypes of algorihms include an inerior-poin based sochasic approximaion scheme by Carboneo e al. 9), and Balakrishnan and Madigan 8), where a modified shrinkage algorihm is developed based on sequenial quadraic approximaions of he loss funcion. The main poin of his paper, is o show ha dual-averaging based mehods can be more effecive in exploiing he regularizaion srucure, especially in a sochasic or online seing. To demonsrae his poin, we compare he RDA mehod wih he Fobos mehod 8

19 sudied in Duchi and Singer 9). In an online seing, each ieraion of he Fobos mehod consiss of he following wo seps: w + = w α g, w + = argmin w { w w + } +α Ψw). For convergence wih opimal raes, he sepsize α is se o be Θ/ ) for general convex regularizaions and Θ/) if Ψ is srongly convex. This mehod is based on a echnique known as forward-backward spliing, which was firs proposed by Lions and Mercier 979) and laer analyzed by Chen and Rockafellar 997) and Tseng ). For easy comparison wih he RDA mehod, we rewrie he Fobos mehod in an equivalen form w + = argmin w { g,w +Ψw)+ α w w }. 8) Compared wih his form of he Fobos mehod, he RDA mehod 8) uses he average subgradien ḡ insead of he curren subgradien g ; i uses a global proximal funcion, say hw) = /) w, insead of is local Bregman divergence /) w w ; moreover, he coefficien for he proximal funcion is β / = Θ/ ) insead of /α = Θ ) for general convex regularizaion, and Oln /) insead of Θ) for srongly convex regularizaion. Alhough hese wo mehods have he same order of ieraion complexiy, he differences lis above conribue o quie differen properies of heir soluions. These differences can be beer undersood in he special case of l -regularizaion, i.e., when Ψw) = λ w. In his case, he Fobos mehod is equivalen o a special case of he Truncaed Gradien TG) mehod of Langford e al. 9). The TG mehod runcaes he soluions obained by he sandard SGD mehod every K seps; more specifically, { ) w i) rnc + = w i) α g i),λ TG,θ if mod,k) =, 9) w i) α g i) oherwise, where λ TG = α λk, mod,k) is he remainder on division of by K, and rncω,λ TG,θ) = if ω λ TG, ω λ TG sgnω) if λ TG < ω θ, ω if ω > θ. When K = and θ = +, he TG mehod is he same as he Fobos mehod 8) wih l -regularizaion. Now comparing he runcaion hreshold λ TG and he hreshold λ used in he l -RDA mehod ): wih α = Θ/ ), we have λ TG = Θ/ )λ. This Θ/ ) discoun facor is also common for oher previous work ha use sof-hresholding, including Shalev-Shwarz and Tewari 9). I is clear ha he RDA mehod uses a much more aggressive runcaion hreshold, hus is able o generae significanly more sparse soluions. This is confirmed by our compuaional experimens in he nex secion. Mos recenly, Duchi e al. ) developed a family of subgradien mehods ha can adapively modifying he proximal funcion squared Mahalanobis norms) a each ieraion, in order o beer incorporae learned knowledge abou geomery of he daa. Their mehods includes exensions for boh he mirror-descen ype of algorihms like 8) and he RDA mehods sudied in his paper. 9

20 Algorihm Enhanced l -RDA mehod Inpu: γ >, ρ Iniialize: w =, ḡ =. for =,,3,... do. Given he funcion f, compue subgradien g f w ).. Compue he dual average ḡ = ḡ + g. 3. Le λ RDA w i) + = = λ+γρ/, and compue w + enry-wise: if ḡ i) λ RDA ḡ i) λ RDA sgn ḡ i) ) ) oherwise, γ, i =,...,n. 3) end for 6. Compuaional Experimens wih l -Regularizaion Inhissecion, weprovidecompuaionalexperimensofhel -RDAmehodonheMNIST daase of handwrien digis LeCun e al., 998). Our purpose here is mainly o illusrae he basic characerisics of he l -RDA mehod, raher han comprehensive performance evaluaion on a wide range of daases. Firs, we describe a varian of he l -RDA mehod ha is capable of geing enhanced sparsiy in he soluion. 6. Enhanced l -RDA Mehod The enhanced l -RDA mehod shown in Algorihm is a special case of Algorihm. I is derived by seing Ψw) = w, β = γ, and replacing hw) wih a paramerized version h ρ w) = w +ρ w, 3) where ρ is a sparsiy-enhancing parameer. Noe ha h ρ w) is srongly convex wih modulus for any ρ. Hence he convergence rae of his algorihm is he same as if we choose hw) = /) w. In his case, he equaion 8) becomes w + = argmin { ḡ,w +λ w + γ )} w +ρ w w { = argmin ḡ,w +λ RDA w + γ } w w, where λ RDA = λ + γρ/. The above minimizaion problem has a closed-form soluion given in 3) see Appendix A for he derivaion). By leing ρ >, he effecive runcaion is larger han λ, especially in he iniial phase of he online process. For problems wihou explici l -regularizaion in he objecive funcion, i.e., when λ =, his sill gives a diminishing runcaion hreshold γρ/. hreshold λ RDA

21 Figure : Sample images from he MNIST daase, wih gray-scale from o 55. We can also resric l -regularizaion on par of he opimizaion variables only. For example, in suppor vecor machines or logisic regression, we usually wan he bias erms o befreeofregularizaion. Inhiscase, wecansimplyreplaceλ RDA by for he corresponding coordinaes in 3). 6. Experimens on he MNIST Daase Each image in he MNIST daase is represened by a 8 8 gray-scale pixel-map, for a oal of 78 feaures. Each of he digis has roughly 6, raining examples and, esing examples. Some of he samples are shown in Figure. From he perspecive of using sochasic and online algorihms, he number of feaures and size of he daase are considered very small. Neverheless, we choose his daase because he compuaional resuls are easy o visualize. No preprocessing of he daa is employed. We use l -regularized logisic regression o do binary classificaion on each of he 5 pairs of digis. More specifically, le z = x,y) where x R 78 represens a gray-scale image and y {+, } is he binary label, and le w = w,b) where w R 78 and b is he bias. Then he loss funcion and regularizaion erm in ) are fw,z) = log +exp y w T x+b) )), Ψw) = λ w. Noehawedonoapplyregularizaiononhebiasermb. Inheexperimens, wecompare he enhanced) l -RDA mehod Algorihm ) wih he SGD mehod w i + = w i α g i +λsgnw i ) ), and he TG mehod 9) wih θ =. These hree online algorihms have similar convergence raes and he same order of compuaional cos per ieraion. We also compare hem wih he bach opimizaion approach, more specifically solving he empirical minimizaion problem ) using an efficien inerior-poin mehod IPM) of Koh e al. 7). Each pair of digis have abou, raining examples and, esing examples. We use online algorihms o go hrough he randomly permued) daa only once, herefore he algorihms sop a T =,. We vary he regularizaion parameer λ from. o. As a reference, he maximum λ for he bach opimizaion case Koh e al., 7) is mosly in he range of 3 5 beyond which he opimal weighs are all zeros). In he

22 λ =. λ =.3 λ =. λ =.3 λ = λ = 3 λ = SGD w T TG w T RDA w T IPM w SGD w T TG w T RDA w T Figure : Sparsiy paerns of w T and w T for classifying he digis 6 and 7 when varying he parameer λ from. o in l -regularized logisic regression. The background gray represens he value zero, brigh spos represen posiive values and dark spos represen negaive values. Each column corresponds o a value of λ labeled a he op. The op hree rows are he weighs w T wihou averaging) from he las ieraion of he hree online algorihms; he middle row shows opimal soluions of he bach opimizaion problem solved by inerior-poin mehod IPM); he boom hree rows show he averaged weighs w T in he hree online algorihms. Boh he TG and RDA mehods were run wih parameers for enhanced l -regularizaion, i.e., K = for TG and γρ = 5 for RDA.

23 NNZs in w λ =.) Lef: K = for TG, ρ = for RDA SGD TG K=) RDA ρ = ) Righ: K = for TG, γρ = 5 for RDA SGD TG K=) RDA γρ = 5) NNZs in w λ = ) Number of samples Number of samples Figure 3: Number of non-zeros NNZs) in w for he hree online algorihms classifying he pair 6 and 7). The lef column shows SGD, TG wih K =, and RDA wih ρ = ; he righ column shows SGD, TG wih K =, and RDA wih γρ = 5. The same curves for SGD are ploed in boh columns for clear comparison. The wo rows correspond o λ =. and λ =, respecively. l -RDA mehod, we use γ = 5,, and se ρ o be eiher for basic regularizaion, or.5 effecively γρ = 5) for enhanced regularizaion effec. These parameers are chosen by cross-validaion. For he SGD and TG mehods, we use a consan sepsize α = /γ) /T for comparable convergence rae; see 9) and relaed discussions. In he TG mehod, he period K is se o be eiher for basic regularizaion same as Fobos), or for periodic enhanced regularizaion effec. Figure shows he sparsiy paerns of he soluions w T and w T for classifying he digis 6 and 7. The algorihmic parameers used are: K = for he TG mehod, and γρ = 5 for he RDA mehod. I is clear ha he RDA mehod gives more sparse soluions han boh SGD and TG mehods. The sparsiy paern obained by he RDA mehod is very similar o he bach opimizaion resuls solved by IPM, especially for larger λ. 3

24 Lef: K= for TG, ρ= for RDA Righ: K= for TG, γρ=5 for RDA Error raes of wt %) SGD TG K=) RDA ρ = ) IPM.... SGD TG K=) RDA γρ = 5) IPM Error raes of wt %) NNZs in wt NNZs in wt.. Regularizaion parameer λ.. Regularizaion parameer λ Figure : Tradeoffs beween esing error raes and NNZs in soluions when varying λ from. o for classifying 6 and 7). The lef column shows SGD, TG wih K =, RDA wih ρ =, and IPM. The righ column shows SGD, TG wih K =, RDA wih γρ = 5, and IPM. The same curves for SGD and IPM are ploed in boh columns for clear comparison. The op wo rows shows he esing error raes and NNZs of he final weighs w T, and he boom wo rows are for he averaged weighs w T. All horizonal axes have logarihmic scale. For verical axes, only he wo plos in he firs row have logarihmic scale.

25 Error raes %), λ =. Error raes %), λ = Error raes %), λ = Parameer γ RDA w T RDA w T IPM NNZs, λ =. NNZs, λ = NNZs, λ = Parameer γ RDA w T RDA w T IPM Figure 5: Tesing error raes and NNZs in soluions for he RDA mehod when varying he parameer γ from, o,, and seing ρ such ha γρ = 5. The hree rows show resuls for λ =.,, and, respecively. The corresponding bach opimizaion resuls found by IPM are shown as a horizonal line in each plo. To have a beer undersanding of he behaviors of he algorihms, we plo he number of non-zeros NNZs) in w in Figure 3. Only he RDA mehod and TG wih K = give explici zero weighs using sof-hresholding a every sep. In order o coun he NNZs in all oher cases, we have o se a small hreshold for rounding he weighs o zero. Considering ha he magniudes of he larges weighs in Figure are mosly on he order of 3, we se 5 as he hreshold and verified ha rounding elemens less han 5 o zero does no affec he esing errors. Noe ha we do no runcae he weighs for RDA and TG wih K = furher, even if some of heir componens are below 5. I can be seen ha he RDA mehod mainains a much more sparse w han he oher online algorihms. While he TG mehod generaes more sparse soluions han he SGD mehod when λ is large, he NNZs in w oscillaes wih a very big range. The oscillaion becomes more severe wih 5

26 IPM λ=. λ= λ= Figure 6: Sparsiy paerns of w T by varying he parameer γ in he RDA mehod from, o, for classifying he pair 6 and 7). The firs column shows resuls of bach opimizaion using IPM, and he oher columns show resuls of RDA mehod using γ labeled a he op. K =. In conras, he RDA mehod demonsraes a much more smooh behavior of he NNZs. For he RDA mehod, he effec of enhanced regularizaion using γρ = 5 is more pronounced for relaively small λ. Nex we illusrae he radeoffs beween sparsiy and esing error raes. Figure shows ha he soluions obained by he RDA mehod mach he bach opimizaion resuls very well. Since he performance of he online algorihms vary when he raining daa are given in differen random sequences permuaions), we run hem on randomly permued sequences of he same raining se, and plo he means and sandard deviaions shown as errorbars. ForheSGDandTGmehods,heesingerrorraesofw T varyalofordifferen random sequences. In conras, he RDA mehod demonsraes very robus performance small sandard deviaions) for w T, even hough he heorems only give convergence bound for he averaged weigh w T. For large values of λ, he averaged weighs w T obained by SGD and TG mehods acually have much smaller error raes han hose of RDA and bach opimizaion. This can be explained by he limiaion of he SGD and TG mehods in obaining sparse soluions: hese lower error raes are obained wih much more nonzero feaures han used by he RDA and bach opimizaion mehods. Figure 5 shows he resuls of choosing differen values for he parameer γ in he RDA mehod. We see ha smaller values of γ, which corresponds o faser learning raes, lead o more sparse w T and higher esing error raes; larger values of γ resul in less sparse w T wih lower esing error raes. Bu ineresingly, he effecs on he averaged soluion w T is almos opposie: smaller values of γ lead o less sparse w T in his case, we coun he NNZs using he rounding hreshold 5 ). For large regularizaion parameer λ, smaller values of γ also give lower esing error raes. Figure 6 shows he sparsiy paerns of w T when varying γ from, o,. We see ha smaller values of γ give more sparse w T, which are also more scaered like he bach opimizaion soluion by IPM. Figure 7 shows summary of classificaion resuls for all he 5 pairs of digis. For clariy, we only show resuls of he l -RDA mehod and bach opimizaion using IPM. We see ha he soluions obained by he l -RDA mehod demonsrae very similar radeoffs beween sparsiy and esing error raes as rendered by he bach opimizaion soluions. 6

Lecture 9: September 25

Lecture 9: September 25 0-725: Opimizaion Fall 202 Lecure 9: Sepember 25 Lecurer: Geoff Gordon/Ryan Tibshirani Scribes: Xuezhi Wang, Subhodeep Moira, Abhimanu Kumar Noe: LaTeX emplae couresy of UC Berkeley EECS dep. Disclaimer: