Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Size: px
Start display at page:

Download "Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization"

Transcription

1 Dual Averaging Mehods for Regularized Sochasic Learning and Online Opimizaion Lin Xiao Microsof Research Microsof Way Redmond, WA 985, USA Revised March 8, Absrac We consider regularized sochasic learning and online opimizaion problems, where he objecive funcion is he sum of wo convex erms: one is he loss funcion of he learning ask, and he oher is a simple regularizaion erm such as l -norm for promoing sparsiy. We develop exensions of Neserov s dual averaging mehod, ha can exploi he regularizaion srucure in an online seing. A each ieraion of hese mehods, he learning variables are adjused by solving a simple minimizaion problem ha involves he running average of all pas subgradiens of he loss funcion and he whole regularizaion erm, no jus is subgradien. In he case of l -regularizaion, our mehod is paricularly effecive in obaining sparse soluions. We show ha hese mehods achieve he opimal convergence raes or regre bounds ha are sandard in he lieraure on sochasic and online convex opimizaion. For sochasic learning problems in which he loss funcions have Lipschiz coninuous gradiens, we also presen an acceleraed version of he dual averaging mehod. Keywords: sochasic learning, online opimizaion, l -regularizaion, srucural convex opimizaion, dual averaging mehods, acceleraed gradien mehods.. Inroducion In machine learning, online algorihms operae by repeiively drawing random examples, one a a ime, and adjusing he learning variables using simple calculaions ha are usually based on he single example only. The low compuaional complexiy per ieraion) of online algorihms is ofen associaed wih heir slow convergence and low accuracy in solving he underlying opimizaion problems. As argued by Boou and Bousque 8), he combined low complexiy and low accuracy, ogeher wih oher radeoffs in saisical learning heory, sill make online algorihms favorie choices for solving large-scale learning problems. Neverheless, radiional online algorihms, such as sochasic gradien descen, have limied capabiliy of exploiing problem srucure in solving regularized learning problems. As a resul, heir low accuracy ofen makes i hard o obain he desired regularizaion effecs, e.g., sparsiy under l -regularizaion. In his paper, we develop a new class of online algorihms, he regularized dual averaging RDA) mehods, ha can exploi he regularizaion srucure more effecively in an online seing. In his secion, we describe he wo ypes of problems ha we consider, and explain he moivaion of our work.

2 . Regularized Sochasic Learning The regularized sochasic learning problems we consider are of he following form: { } minimize φw) E z fw,z)+ψw) w where w R n is he opimizaion variable ofen called weighs in learning problems), z = x, y) is an inpu-oupu pair of daa drawn from an unknown) underlying disribuion, fw,z) is he loss funcion of using w and x o predic y, and Ψw) is a regularizaion erm. We assume Ψw) is a closed convex funcion Rockafellar, 97), and is effecive domain, domψ = {w R n Ψw) < + }, is closed. We also assume ha fw,z) is convex in w for each z, and i is subdiffereniable a subgradien always exiss) on domψ. Examples of he loss funcion fw,z) include: Leas-squares: x R n, y R, and fw,x,y)) = y w T x). Hinge loss: x R n, y {+, }, and fw,x,y)) = max{, yw T x)}. Logisicregression: x R n,y {+, },andfw,x,y)) = log +exp yw T x) )). Examples of he regularizaion erm Ψw) include: l -regularizaion: Ψw) = λ w wih λ >. Wih l -regularizaion, we hope o ge a relaively sparse soluion, i.e., wih many enries of he weigh vecor w being zeroes. l -regularizaion: Ψw) = σ/) w, wih σ >. When l -regularizaion is used wih he hinge loss funcion, we have he sandard seup of suppor vecor machines. Convex consrains: Ψw) is he indicaor funcion of a closed convex se C, i.e., {, if w C, Ψw) = I C w) +, oherwise. We can also consider mixed regularizaions such as Ψw) = λ w + σ/) w. These examples cover a wide range of pracical problems in machine learning. A common approach for solving sochasic learning problems is o approximae he expeced loss funcion φw) by using a finie se of independen observaions z,...,z T, and solve he following problem o minimize he empirical loss: minimize w T ) T fw,z )+Ψw). ) = By our assumpions, his is a convex opimizaion problem. Depending on he srucure of paricular problems, hey can be solved efficienly by inerior-poin mehods e.g., Ferris and Munson, 3; Koh e al., 7), quasi-newon mehods e.g., Andrew and Gao, 7), or acceleraed firs-order mehods Neserov, 7; Tseng, 8; Beck and Teboulle, 9). However, his bach opimizaion approach may no scale well for very large problems: even wih firs-order mehods, evaluaing one single gradien of he objecive funcion in ) requires going hrough he whole daa se.

3 In his paper, we consider online algorihms ha process samples sequenially as hey become available. More specifically, we draw a sequence of i.i.d. samples z,z,z 3,..., and use hem o calculae a sequence w,w,w 3,... Suppose a ime, we have he mos upo-dae weigh vecor w. Whenever z is available, we can evaluae he loss fw,z ), and also a subgradien g fw,z ) here fw,z) denoes he subdifferenial of fw,z) wih respec o w). Then we compue w + based on hese informaion. The mos widely used online algorihm is he sochasic gradien descen SGD) mehod. Consider he general case Ψw) = I C w) + ψw), where I C w) is a hard se consrain and ψw) is a sof regularizaion. The SGD mehod akes he form w + = Π C w α g +ξ ) ), 3) where α is an appropriae sepsize, ξ is a subgradien of ψ a w, and Π C ) denoes Euclidean projecion ono he se C. The SGD mehod belongs o he general scheme of sochasic approximaion, which can be raced back o Robbins and Monro 95) and Kiefer and Wolfowiz 95). In general we are also allowed o use all previous informaion o compue w +, and even second-order derivaives if he loss funcions are smooh. In a sochasic online seing, each weigh vecor w is a random variable ha depends on {z,...,z }, and so is he objecive value φw ). Assume an opimal soluion w o he problem ) exiss, and le φ = φw ). The goal of online algorihms is o generae a sequence {w } = such ha lim Eφw ) = φ, and hopefully wih reasonable convergence rae. This is he case for he SGD mehod 3) if we choose he sepsize α = c/, where c is a posiive consan. The corresponding convergence rae is O/ ), which is indeed bes possible for subgradien schemes wih a black-box model, even in he case of deerminisic opimizaion Nemirovsky and Yudin, 983). Despie such slow convergence and he associaed low accuracy in he soluions compared wih bach opimizaion using, e.g., inerior-poin mehods), he SGD mehod has been very popular in he machine learning communiy due o is capabiliy of scaling wih very large daa ses and good generalizaion performances observed in pracice e.g., Boou and LeCun, ; Zhang, ; Shalev-Shwarz e al., 7). Neverheless, a main drawback of he SGD mehod is is lack of capabiliy in exploiing problem srucure, especially for problems wih explici regularizaion. More specifically, he SGD mehod 3) reas he sof regularizaion ψw) as a general convex funcion, and only uses is subgradien in compuing he nex weigh vecor. In his case, we can simply lump ψw) ino fw,z ) and rea hem as a single loss funcion. Alhough in heory he algorihm converges o an opimal soluion in expecaion) as goes o infiniy, in pracice i is usually sopped far before ha. Even in he case of convergence in expecaion, we sill face possibly big) variaions in he soluion due o he sochasic naure of he algorihm. Therefore, he regularizaion effec we hope o have by solving he problem ) may be elusive for any paricular soluion generaed by 3) based on finie random samples. An imporan example and main moivaion for his paper is l -regularized sochasic learning, where Ψw) = λ w. In he case of bach learning, he empirical minimizaion problem ) can be solved o very high precision, e.g., by inerior-poin mehods. Therefore simply rounding he weighs wih very small magniudes oward zero is usually enough o 3

4 produce desired sparsiy. As a resul, l -regularizaion has been very effecive in obaining sparse soluions using he bach opimizaion approach in saisical learning e.g., Tibshirani, 996) and signal processing e.g., Chen e al., 998). In conras, he SGD mehod 3) hardly generaes any sparse soluion, and is inheren low accuracy makes he simple rounding approach very unreliable. Several principled sof-hresholding or runcaion mehods have been developed o address his problem e.g., Langford e al., 9; Duchi and Singer, 9), bu he levels of sparsiy in heir soluions are sill unsaisfacory compared wih he corresponding bach soluions. In his paper, we develop regularized dual averaging RDA) mehods ha can exploi he srucure of ) more effecively in a sochasic online seing. More specifically, each ieraion of he RDA mehods akes he form w + = argmin w { } g τ,w +Ψw)+ β hw), ) where hw) is an auxiliary srongly convex funcion, and {β } is a nonnegaive and nondecreasing inpu sequence, which deermines he convergence properies of he algorihm. Essenially, a each ieraion, his mehod minimizes he sum of hree erms: a linear funcion obained by averaging all previous subgradiens he dual average), he original regularizaion funcion Ψw), and an addiional srongly convex regularizaion erm β /)hw). The RDA mehod is an exension of he simple dual averaging scheme of Neserov 9), which is equivalen o leing Ψw) be he indicaor funcion of a closed convex se. For he RDA mehod o be pracically efficien, we assume ha he funcions Ψw) and hw) are simple, meaning ha we are able o find a closed-form soluion for he minimizaion problem in ). Then he compuaional effor per ieraion is only On), he same as he SGD mehod. This assumpion indeed holds in many cases. For example, if we le Ψw) = λ w and hw) = /) w, hen w + has an enry-wise closed-from soluion. This soluion uses a much more aggressive runcaion hreshold han previous mehods, hus resuls in significanly improved sparsiy see discussions in Secion 5). In erms of ieraion complexiy, we show ha if β = Θ ), i.e., wih order exacly, hen he RDA mehod ) has he sandard convergence rae ) G Eφ w ) φ O, where w = /) w τ is he primal average, and G is a uniform upper bound on he norms of he subgradiens g. If he regularizaion erm Ψw) is srongly convex, hen seing β Oln) gives a faser convergence rae Oln/). For sochasic opimizaion problems in which he loss funcions fw, z) are all differeniable and have Lipschiz coninuous gradiens, we also develop an acceleraed version of he RDA mehod ha has he convergence rae L Eφw ) φ O) + Q ), wherelishelipschizconsanofhegradiens, andq isanupperboundonhevariances of he sochasic gradiens. In addiion o convergence in expecaion, we show ha he same orders of convergence raes hold wih high probabiliy.

5 . Regularized Online Opimizaion In online opimizaion, we use an online algorihm o generae a sequence of decisions w, one a a ime, for =,,3,... A each ime, a previously unknown cos funcion f is revealed, and we encouner a loss f w ). We assume ha he cos funcions f are convex for all. The goal of he online algorihm is o ensure ha he oal cos up o each ime, f w ), is no much larger han min w f w), he smalles oal cos of any fixed decision w from hindsigh. The difference beween hese wo cos is called he regre of he online algorihm. Applicaions of online opimizaion include online predicion of ime series and sequenial invesmen e.g., Cesa-Bianchi and Lugosi, 6). In regularized online opimizaion, we add a convex regularizaion erm Ψw) o each cos funcion. The regre wih respec o any fixed decision w domψ is R w) fτ w τ )+Ψw τ ) ) fτ w)+ψw) ). 5) As in he sochasic seing, he online algorihm can query a subgradien g f w ) a each sep, and possibly use all previous informaion, o compue he nex decision w +. I urns ou ha he simple subgradien mehod 3) is well suied for online opimizaion: wih a sepsize α = Θ/ ), i has a regre R w) O ) for all w domψ Zinkevich, 3). This regre bound canno be improved in general for convex cos funcions. However, if he cos funcions are srongly convex, say wih convexiy parameer σ, hen he same algorihm wih sepsize α = /σ) gives an Oln) regre bound e.g., Hazan e al., 6; Barle e al., 8). Similar o he discussions on regularized sochasic learning, he online subgradien mehod 3) in general lacks he capabiliy of exploiing he regularizaion srucure. In his paper, we show ha he same RDA mehod ) can effecively exploi such srucure in an online seing, and ensure he O ) regre bound wih β = Θ ). For srongly convex regularizaions, seing β = Oln) yields he improved regre bound Oln). Since here is no specificaions on he probabiliy disribuion of he sequence of funcions, nor assumpions like muual independence, online opimizaion can be considered as a more general framework han sochasic learning. In his paper, we will firs esablish regre bounds of he RDA mehod for solving online opimizaion problems, hen use hem o derive convergence raes for solving sochasic learning problems..3 Ouline of Conens The mehods we develop apply o more general seings han R n wih Euclidean geomery. In Secion., we inroduce he necessary noaions and definiions associaed wih a general finie-dimensional real vecor space. In Secion, we presen he generic RDA mehod for solving boh he sochasic learning and online opimizaion problems, and give several concree examples of he mehod. In Secion 3, we presen he precise regre bounds of he RDA mehod for solving regularized online opimizaion problems. In Secion, we derive convergence raes of he RDA mehod for solving regularized sochasic learning problems. In addiion o he raes of convergence in expecaion, we also give associaed high probabiliy bounds. 5

6 In Secion 5, we explain he connecions of he RDA mehod o several relaed work, and analyze is capabiliy of generaing beer sparse soluions han oher mehods. In Secion 6, we give an enhanced version of he l -RDA mehod, and presen compuaional experimens on he MNIST handwrien daase. Our experimens show ha he RDA mehod is capable of generae sparse soluions ha are comparable o hose obained by bach learning using inerior-poin mehods. In Secion 7, we discuss he RDA mehods in he conex of srucural convex opimizaion and heir connecions o incremenal subgradien mehods. As an exension, we develop an acceleraed version of he RDA mehod for sochasic opimizaion problems wih smooh loss funcions. We also discuss in deail he p-norm based RDA mehods. Appendices A-D conain echnical proofs of our main resuls.. Noaions and Generaliies Le E be a finie-dimensional real vecor space, endowed wih a norm. This norm defines a sysems of balls: Bw,r) = {u E u w r}. Le E be he vecor space of all linear funcions on E, and le s,w denoe he value of s E a w E. The dual space E is endowed wih he dual norm s = max w s,w. A funcion h : E R {+ } is called srongly convex wih respec o he norm if here exiss a consan σ > such ha hαw + α)u) αhw)+ α)hu) σ α α) w u, w,u domh. The consan σ is called he convexiy parameer, or he modulus of srong convexiy. Le rin C denoe he relaive inerior of a convex se C Rockafellar, 97). If h is srongly convex wih modulus σ, hen for any w domh and u rindomh), hw) hu)+ s,w u + σ w u, s hu). See, e.g., Goebel and Rockafellar 8) and Judisky and Nemirovski 8). In he special case of he coordinae vecor space E = R n, we have E = E, and he sandard inner produc s,w = s T w = n i= si) w i), wherew i) denoes he i-hcoordinae of w. For he sandard Euclidean norm, w = w = w,w and s = s. For any w R n, he funcion hw) = σ/) w w is srongly convex wih modulus σ. For anoher example, consider he l -norm w = w = n i= wi) and is associaed dual norm w = w = max i n w i). Le S n be he sandard simplex in R n, i.e., S n = { w R n + n i= wi) = }. Then he negaive enropy funcion hw) = n w i) lnw i) +lnn, 6) i= wihdomh = S n, issronglyconvexwihrespeco wihmodulussee, e.g.,neserov, 5, Lemma 3). In his case, he unique minimizer of h is w = /n,...,/n). For a closed proper convex funcion Ψ, we use Argmin w Ψw) o denoe he convex) se of minimizing soluions. If a convex funcion h has a unique minimizer, e.g., when h is srongly convex, hen we use argmin w hw) o denoe ha single poin. 6

7 Algorihm Regularized dual averaging RDA) mehod inpu: an auxiliary funcion hw) ha is srongly convex on domψ and also saisfies arg min w a nonnegaive and nondecreasing sequence {β }. iniialize: se w = argmin w hw) and ḡ =. hw) ArgminΨw). 7) w for =,,3,... do. Given he funcion f, compue a subgradien g f w ).. Updae he average subgradien: 3. Compue he nex weigh vecor: end for w + = argmin w ḡ = ḡ + g. { ḡ,w +Ψw)+ β } hw). 8). Regularized Dual Averaging Mehod In his secion, we presen he generic RDA mehod Algorihm ) for solving regularized sochasic learning and online opimizaion problems, and give several concree examples. To unify noaion, we use f w) o denoe he cos funcion a each sep. For sochasic learning problems, we simply le f w) = fw,z ). A he inpu o he RDA mehod, we need an auxiliary funcion h ha is srongly convex on dom Ψ. The condiion 7) requires ha is unique minimizer mus also minimize he regularizaion funcion Ψ. This can be done, e.g., by firs choosing a saring poin w Argmin w Ψw) and an arbirary srongly convex funcion h w), hen leing hw) = h w) h w ) h w ),w w. In oher words, hw) is he Bregman divergence from w induced by h w). If h is no differeniable, bu subdiffereniable a w, we can replace h w ) wih a subgradien. The inpu sequence {β } deermines he convergence rae, or regre bound, of he algorihm. There are hree seps in each ieraion of he RDA mehod. Sep is o compue a subgradien of f a w, which is sandard for all subgradien or gradien based mehods. Sep is he online version of compuing he average subgradien: ḡ = g τ. Thenamedual averaging comesfromhefachahesubgradiensliveinhedualspacee. 7

8 Sep 3 is mos ineresing and worh furher explanaion. In paricular, he efficiency in compuing w + deermines how useful he mehod is in pracice. For his reason, we assume he regularizaion funcions Ψw) and hw) are simple. This means he minimizaion problem in 8) can be solved wih lile effor, especially if we are able o find a closed-form soluion for w +. A firs sigh, his assumpion seems o be quie resricive. However, he examples below show ha his indeed is he case for many imporan learning problems in pracice.. RDA Mehods wih General Convex Regularizaion For a general convex regularizaion Ψ, we can choose any posiive sequence {β } ha is order exacly, o obain an O/ ) convergence rae for sochasic learning, or an O ) regre bound for online opimizaion. We will sae he formal convergence heorems in Secions 3 and. Here, we give several concree examples. To be more specific, we choose a parameer γ > and use he sequence β = γ, =,,3,... Neserov s dual averaging mehod. Le Ψw) be he indicaor funcion of a closed convex se C. This recovers he simple dual averaging scheme in Neserov 9). If we choose hw) = /) w, hen he equaion 8) yields ) ) w + = Π C γ ḡ = Π C γ g τ. 9) When C = {w R n w δ} for some δ >, we have hard l -regularizaion. In his case, alhough here is no closed-form soluion for w +, efficien algorihms for projecion ono he l -ball can be found, e.g., in Duchi e al. 8). Sof l -regularizaion. Le Ψw) = λ w for some λ >, and hw) = /) w. In his case, w + has a closed-form soluion see Appendix A for he derivaion): w i) + = if ḡ i) λ, ḡ i) λ sgn ḡ i) ) ) oherwise, γ i =,...,n. ) Here sgn ) is he sign or signum funcion, i.e., sgnω) equals if ω >, if ω <, and if ω =. Whenever a componen of ḡ is less han λ in magniude, he corresponding componen of w + is se o zero. Furher exensions of he l -RDA mehod, and associaed compuaional experimens, are given in Secion 6. Exponeniaed dual averaging mehod. Le Ψw) be he indicaor funcion of he sandard simplex S n, and hw) be he negaive enropy funcion defined in 6). In his case, w i) + = ) exp Z + γ ḡi), i =,...,n, 8

9 where Z + is a normalizaion parameer such ha n i= wi) + =. This is he dual averaging version of he exponeniaed gradien algorihm Kivinen and Warmuh, 997); see also Tseng and Bersekas 993) and Judisky e al. 5). We noe ha his example is also covered by Neserov s dual averaging mehod. We discuss in deail he special case of p-norm RDA mehod in Secion 7.. Several oher examples, including l -norm and a hybrid l /l -norm Berhu) regularizaion, also admi closed-form soluions for w +. Their soluions are similar in form o hose obained in he conex of he Fobos algorihm in Duchi and Singer 9).. RDA Mehods wih Srongly Convex Regularizaion If he regularizaion erm Ψw) is srongly convex, we can use any nonnegaive and nondecreasing sequence {β } ha grows no faser han Oln), o obain an Oln/) convergence rae for sochasic learning, or an Oln ) regre bound for online opimizaion. For simpliciy, in he following examples, we use he zero sequence β = for all. In his case, we do no need he auxiliary funcion hw), and he equaion 8) becomes w + = argmin w { ḡ,w +Ψw) }. l -regularizaion. Le Ψw) = σ/) w for some σ >. In his case, w + = σḡ = σ g τ. Mixed l /l -regularizaion. Le Ψw) = λ w + σ/) w In his case, we have if ḡ i) w i) + = σ ḡ i) λ, λ sgn ḡ i) ) ) oherwise, wih λ > and σ >. i =,...,n. Kullback-Leibler KL) divergence regularizaion. Le Ψw) = σd KL w p), where he given probabiliy disribuion p rins n, and ) n D KL w p) w i) w i) ln p i). i= Here D KL w p) is srongly convex wih respec o w wih modulus. In his case, w i) + = p i) exp ) σḡi), Z + where Z + is a normalizaion parameer such ha n i= wi) + =. KL divergence regularizaion has he pseudo-sparsiy effec, meaning ha mos elemens in w can be replaced by elemens in he consan vecor p wihou significanly increasing he loss funcion e.g., Bradley and Bagnell, 9). 9

10 3. Regre Bounds for Online Opimizaion In his secion, we give he precise regre bounds of he RDA mehod for solving regularized online opimizaion problems. The convergence raes for sochasic learning problems can be esablished based on hese regre bounds, and will be given in he nex secion. For clariy, we gaher here he general assumpions used hroughou his paper: The regularizaion erm Ψw) is a closed proper convex funcion, and dom Ψ is closed. The symbol σ is dedicaed o he convexiy parameer of Ψ. Wihou loss of generaliy, we assume min w Ψw) =. For each, he funcion f w) is convex and subdiffereniable on domψ. The funcion hw) is srongly convex on dom Ψ, and subdiffereniable on rindom Ψ). Wihoulossofgeneraliy, assumehw)hasconvexiyparameerandmin w hw) =. We will no repea hese general assumpions when saing our formal resuls laer. To faciliae regre analysis, we firs give a few definiions. For any consan D >, we define he se F D { w domψ hw) D }, and le Γ D = sup w F D inf g. ) g Ψw) We use he convenion inf g g = +. As a resul, if Ψ is no subdiffereniable everywhere on F D, i.e., if Ψw) = a some w F D, hen we have Γ D = +. Noe ha Γ D is no a Lipschiz-ype consan which would be required o be an upper bound on all he subgradiens; insead, we only require ha a leas one subgradien is bounded in norm by Γ D a every poin in he se F D. We assume ha he sequence of subgradiens {g } generaed by Algorihm is bounded, i.e., here exis a consan G such ha g G,. ) This is rue, for example, if domψ is compac and each f has Lipschiz-coninuous gradien on domψ. We require ha he inpu sequence {β } be chosen such ha max{σ, β } >, 3) where σ is he convexiy parameer of Ψw). For convenience, we le β = max{σ,β } and define he sequence of regre bounds β D + G τ= + β β )G στ +β τ β +σ), =,,3,..., ) where D is he consan used in he definiion of F D. We could always se β σ, so ha β = β and herefore he erm β β )G /β +σ) vanishes in he definiion ). However, when σ >, we would like o keep he flexibiliy of seing β = for all, as we did in Secion..

11 Theorem Le he sequences {w } and {g } be generaed by Algorihm, and assume ) and 3) hold. Then for any and any w F D, we have: a) The regre defined in 5) is bounded by, i.e., R w). 5) b) The primal variables are bounded as w + w σ+β R w) ). 6) c) If w is an inerior poin, i.e., Bw,r) F D for some r >, hen ḡ Γ D σr + r R w) ). 7) In Theorem, he bounds on w + w and g depend on he regre R w). More precisely, hey depend on R w), which is he slack of he regre bound in 5). A smaller slack is equivalen o a larger regre R w), which means w is a beer fixed soluion for he online opimizaion problem he bes one gives he larges regre); correspondingly, he inequaliy 6) gives a igher bound on w + w. In 7), he lef-hand side ḡ does no depend on any paricular inerior poin w o compare wih, bu he righ-hand side depends on boh R w) and how far w is from he boundary of F D. The ighes bound on ḡ can be obained by aking he infimum of he righ-hand side over all w inf D. We furher elaborae on par c) hrough he following wo examples: Consider he case when Ψ is he indicaor funcion of a closed convex se C. In his case, σ = and Ψw) is he normal cone o C a w Rockafellar, 97, Secion 3). By he definiion ), we have Γ D = because he zero vecor is a subgradien a every w C, even hough he normal cones can be unbounded a he boundary of C. In his case, if Bw,r) F D for some r >, hen 7) simplifies o ḡ r R w) ). Consider he funcion Ψw) = σd KL w p) wih domψ = S n assuming p rins n ). In his case, domψ, and hence F D, have empy inerior. Therefore he bound in par c) does no apply. In fac, he quaniy Γ D can be unbounded anyway. In paricular, he subdifferenials of Ψ a he relaive boundary of S n are all empy. In he relaive inerior of S n, he subgradiens acually gradiens) of Ψ always exis, bu can become unbounded for poins approaching he relaive boundary. Neverheless, he bounds in pars a) and b) sill hold. The proof of Theorem is given in Appendix B. In he res of his secion, we discuss more concree regre bounds depending on wheher or no Ψ is srongly convex.

12 3. Regre Bound wih General Convex Regularizaion For a general convex regularizaion erm Ψ, any nonnegaive and nondecreasing sequence β = Θ ) gives an O ) regre bound. Here we give deailed analysis for he sequence used in Secion.. More specifically, we choose a consan γ > and le We have he following corollary of Theorem. β = γ,. 8) Corollary Le he sequences {w } and {g } be generaed by Algorihm using {β } defined in 8), and assume ) holds. Then for any and any w F D : a) The regre is bounded as R w) ) γd + G. γ b) The primal variables are bounded as w + w D + G γ γ R w). c) If w is an inerior poin, i.e., Bw,r) F D for some r >, hen ḡ Γ D + ) γd + G γ r r R w). Proof To simplify regre analysis, le γ σ. Therefore β = β = γ. Then defined in ) becomes = γ ) D + G +. γ τ Nex using he inequaliy we ge + τ τ dτ =, γ D + G + )) = γ ) γd + G. γ Combining he above inequaliy and he conclusions of Theorem proves he corollary. The regre bound in Corollary is essenially he same as he online gradien descen mehod of Zinkevich 3), which has he form 3), wih he sepsize α = /γ ). The main advanage of he RDA mehod is is capabiliy of exploiing he regularizaion srucure, as shown in Secion. The parameers D and G are no used explicily in he

13 algorihm. However, we need good esimaes of hem for choosing a reasonable value for γ. The bes γ ha minimizes he expression γd +G /γ is which leads o he simplified regre bound γ = G D, R w) GD. If he oal number of online ieraions T is known in advance, hen using a consan sepsize in he classical gradien mehod 3), say α = γ T = D, =,...,T, 9) G T gives a slighly improved bound R T w) GD T see, e.g., Nemirovski e al., 9). The bound in par b) does no converge o zero. This resul is sill ineresing because here is no special cauion aken in he RDA mehod, more specifically in 8), o ensure he boundedness of he sequence w. In he case Ψw) =, as poined ou by Neserov 9), his may even look surprising since we are minimizing over E he sum of a linear funcion and a regularizaion erm γ/ )hw) ha evenually goes o zero. Par c) gives a bound on he norm of he dual average. If Ψw) is he indicaor funcion of a closed convex se, hen Γ D = and par c) shows ha ḡ acually converges o zero if here exis an inerior w in F D such ha R w). However, a properly scaled version of ḡ, /γ)ḡ, racks he opimal soluion; see he examples in Secion.. 3. Regre Bounds wih Srongly Convex Regularizaion If he regularizaion funcion Ψw) is srongly convex, i.e., wih a convexiy parameer σ >, hen any nonnegaive, nondecreasing sequence ha saisfies β Oln) will give an Oln) regre bound. If {β } is no he all zero sequence, we can simply choose he auxiliary funcion hw) = /σ)ψw). Here are several possibiliies: Posiive consan sequences. For simpliciy, le β = σ for. In his case, = σd + G σ τ= τ + σd + G σ +ln). Logarihmic sequences. Le β = σ+ln) for. In his case, β = β = σ and ) = σ+ln)d + G ) + σd + G +ln). σ τ ++lnτ σ The zero sequence. Le β = for. In his case, β = σ and ) = G + + G σ τ σ G 6+ln). ) σ Noice ha in his las case, he regre bound does no depend on D. 3

14 When Ψ is srongly convex, we also conclude ha, given wo differen poins u and v, he regres R u) and R v) canno be nonnegaive simulaneously if is large enough. To see his, we noice ha if R u) and R v) are nonnegaive simulaneously for some, hen par b) of Theorem implies ) ln w + u O, and w + v O which again implies ) u v w + u + w + v ) ln O. ln Therefore, if he even R u) and R v) happens for infiniely many, we mus have u = v. If u v, hen evenually a leas one of he regres associaed wih hem will become negaive. However, i is possible o consruc sequences of funcions f such ha he poins wih nonnegaive regres do no converge o a fixed poin. ),. Convergence Raes for Sochasic Learning In his secion, we give convergence raes of he RDA mehod when i is used o solve he regularized sochasic learning problem ), and also he relaed high probabiliy bounds. These raes and bounds are esablished no for he individual w s generaed by he RDA mehod, bu raher for he primal average w = w τ,.. Rae of Convergence in Expecaion Theorem 3 Assume here exiss an opimal soluion w o he problem ) ha saisfies hw ) D for some D >, and le φ = φw ). Le he sequences {w } and {g } be generaed by Algorihm, and assume ) holds. Then for any, we have: a) The expeced cos associaed wih he random variable w is bounded as Eφ w ) φ. b) The primal variables are bounded as E w + w σ+β. c) If w is an inerior poin, i.e., Bw,r) F D for some r >, hen E ḡ Γ D σr + r.

15 Proof Firs, we subsiue all f τ ) by f,z τ ) in he definiion of he regre R w ) = fwτ,z τ )+Ψw τ ) ) fw,z τ )+Ψw ) ). Le z[] denoe he collecion of i.i.d. random variables z,...,z ). All he expecaions in Theorem 3 are aken wih respec o z[], i.e., he symbol E can be wrien more explicily as E z[]. We noe ha he random variable w τ, where τ, is a funcion of z,...,z τ ), and is independen of z τ,...,z ). Therefore and E z[] fwτ,z τ )+Ψw τ ) ) = E z[τ ] Ezτ fw τ,z τ )+Ψw τ ) ) = E z[τ ] φw τ ) = E z[] φw τ ), E z[] fw,z τ )+Ψw ) ) = E zτ fw,z τ )+Ψw ) = φw ) = φ. Since φ = φw ) = min w φw), we have By convexiy of φ, we have E z[] R w ) = φ w ) = φ E z[] φw τ ) φ. ) ) w τ φw τ ) Taking expecaion wih respec o z[] and subracing φ, we have ) E z[] φ w ) φ E z[] φ w τ ) φ = E z[]r w ). Then par a) follows from ha of Theorem, which saes ha R w ) for all realizaions of z[]. Similarly, pars b) and c) follow from hose of Theorem and ). Specific convergence raes can be obained in parallel wih he regre bounds discussed in Secions 3. and 3.. We only need o divide every regre bound by o obain he corresponding rae of convergence in expecaion. More specifically, using appropriae sequences {β }, we have Eφ w ) converging o φ wih rae O/ ) for general convex regularizaion, and Oln /) for srongly convex regularizaion. The bound in par b) applies o boh he case σ = and he case σ >. For he laer, we can derive a slighly differen and more specific bound. When Ψ has convexiy parameer σ >, so is he funcion φ. Therefore, φw ) φw )+ s,w w + σ w w, s φw ). Since w is he minimizer of φ, we mus have φw ) Rockafellar, 97, Secion 7). Seing s = in he above inequaliy and rearranging erms, we have w w σ φw ) φ ). 5

16 Taking expecaion of boh sides of he above inequaliy leads o E w w σ Eφw ) φ ) σ, ) where in he las sep we used par a) of Theorem 3. This bound direcly relae w o. Nex we ake a closer look a he quaniy E w w. By convexiy of, we have E w w E w τ w 3) If σ =, hen i is simply bounded by a consan because each E w τ w for τ is bounded by a consan. When σ >, he opimal soluion w is unique, and we have: Corollary If Ψ is srongly convex wih convexiy parameer σ > and β = Oln), hen ) ln) E w w O. Proof For he ease of presenaion, we consider he case β = for all. Subsiuing he bound on in ) ino he inequaliy ) gives Then by 3), E w w 6+ln)G σ,. E w w 6 τ + lnτ ) G τ σ 6+ln)+ ) G ln) σ. In oher words, E w w converges o zero wih rae Oln) /). This can be shown for any β = Oln); see Secion 3. for oher choices of β. As a furher noe, he conclusions in Theorem 3 sill hold if he assumpion ) is weakened o E g G,. ) However, we need ) in order o prove he high probabiliy bounds presened nex.. High Probabiliy Bounds For sochasic learning problems, in addiion o he raes of convergence in expecaion, i is ofen desirable o obain confidence level bounds for approximae soluions. For his purpose, we sar from par a) of Theorem 3, which saes Eφw ) φ /). By Markov s inequaliy, we have for any ε >, Prob φ w ) φ > ε ) ε. 5) This bound holds even wih he weakened assumpion ). However, i is possible o have much igher bounds under more resricive assumpions. To his end, we have he following resul. 6

17 Theorem 5 Assume here exis consans D and G such ha hw ) D, and hw ) D and g G for all. Then for any δ,), we have, wih probabiliy a leas δ, φ w ) φ + 8GD ln/δ),. 6) Theorem 5 is proved in Appendix C. From our resuls in Secion 3., wih he inpu sequence β = γ for all, we have = O ) regardless of σ = or σ >. Therefore, φ w ) φ = O/ ) wih high probabiliy. To simplify furher discussion, le γ = G/D, hence GD see Secion 3.). In his case, if δ /e.368, hen wih probabiliy a leas δ, φ w ) φ GD ln/δ). Leing ε = GD ln/δ)/, hen he above bound is equivalen o ) Probφ w ) φ > ε) exp ε GD), which is much igher han he one in 5). I follows ha for any chosen accuracy ε and < δ /e, he sample size GD) ln/δ) ε guaranees ha, wih probabiliy a leas δ, w is an ε-opimal soluion of he original sochasic opimizaion problem ). When Ψ is srongly convex σ > ), our resuls in Secion 3. show ha we can obain regre bounds = Oln) using β = Oln). However, he high probabiliy bound in Theorem 5 does no improve: we sill have φ w ) φ = O/ ), no Oln/). The reason is ha he concenraion inequaliy Azuma, 967) used in proving Theorem 5 canno ake advanage of he srong-convexiy propery. By using a refined concenraion inequaliy due o Freedman 975), Kakade and Tewari 9, Theorem ) showed ha for srongly convex sochasic learning problems, wih probabiliy a leas δ ln, φ w ) φ R w ) + R w ) G ln/δ) σ { } 6G ln/δ) +max σ,6b. In our conex, he consan B is an upper bound on fw,z)+φw) for w F D. Using he regre bound Rw ), his gives ) φ w ) φ ln/δ) +O + ln/δ). Here he consans hidden in he O-noaion are deermined by G, σ and D. Plugging in = Oln), we have φ w ) φ = Oln/) wih high probabiliy. The addiional penaly of geing he high probabiliy bound, compared wih he rae of convergence in expecaion, is only O ln/). 7

18 5. Relaed Work As we poined ou in Secion., if Ψ is he indicaor funcion of a convex se C, hen he RDA mehod recovers he simple dual averaging scheme in Neserov 9). This special case also belongs o a more general primal-dual algorihmic framework developed by Shalev-Shwarz and Singer 6), which can be expressed equivalenly in our noaion: { } w + = argmin w C γ dτ,w +hw), where d,...,d ) is he se of dual variables ha can be chosen a ime. The simple dual averaging scheme 9) is in fac he passive exreme of heir framework in which he dual variables are simply chosen as he subgradiens and do no change over ime, i.e., d τ = g τ, τ,. 7) However, wih he addiion of a general regularizaion erm Ψw) as in ), he convergence analysis and O ) regre bound of he RDA mehod do no follow direcly as corollaries of eiher Neserov 9) or Shalev-Shwarz and Singer 6). Our analysis in Appendix B exends he framework of Neserov 9). Shalev-Shwarz and Kakade 9) exended he primal-dual framework of Shalev- Shwarz and Singer 6) o srongly convex funcions and obained Oln ) regre bound. In he conex of his paper, heir algorihm akes he form { } w + = argmin d w C σ τ,w +hw), where σ is he convexiy parameer of Ψ, and hw) = /σ)ψw). The passive exreme of his mehod, wih he dual variables chosen in 7), is equivalen o a special case of he RDA mehod wih β = for all. Oher han improving he ieraion complexiy, he idea of reaing he regularizaion explicily in each sep of a subgradien-based mehod insead of lumping i ogeher wih he loss funcion and aking heir subgradiens) is mainly moivaed by pracical consideraions, such as obaining sparse soluions. In he case of l -regularizaion, his leads o sof-hresholding ype of algorihms, in boh bach learning e.g., Figueiredo e al., 7; Wrigh e al., 9; Bredies and Lorenz, 8; Beck and Teboulle, 9) and he online seing e.g., Langford e al., 9; Duchi and Singer, 9; Shalev-Shwarz and Tewari, 9). Mos of hese algorihms can be viewed as exensions of classical gradien mehods including mirror-descen mehods) in which he new ierae is obained by sepping from he curren ierae along a single subgradien, and hen followed by a runcaion. Oher ypes of algorihms include an inerior-poin based sochasic approximaion scheme by Carboneo e al. 9), and Balakrishnan and Madigan 8), where a modified shrinkage algorihm is developed based on sequenial quadraic approximaions of he loss funcion. The main poin of his paper, is o show ha dual-averaging based mehods can be more effecive in exploiing he regularizaion srucure, especially in a sochasic or online seing. To demonsrae his poin, we compare he RDA mehod wih he Fobos mehod 8

19 sudied in Duchi and Singer 9). In an online seing, each ieraion of he Fobos mehod consiss of he following wo seps: w + = w α g, w + = argmin w { w w + } +α Ψw). For convergence wih opimal raes, he sepsize α is se o be Θ/ ) for general convex regularizaions and Θ/) if Ψ is srongly convex. This mehod is based on a echnique known as forward-backward spliing, which was firs proposed by Lions and Mercier 979) and laer analyzed by Chen and Rockafellar 997) and Tseng ). For easy comparison wih he RDA mehod, we rewrie he Fobos mehod in an equivalen form w + = argmin w { g,w +Ψw)+ α w w }. 8) Compared wih his form of he Fobos mehod, he RDA mehod 8) uses he average subgradien ḡ insead of he curren subgradien g ; i uses a global proximal funcion, say hw) = /) w, insead of is local Bregman divergence /) w w ; moreover, he coefficien for he proximal funcion is β / = Θ/ ) insead of /α = Θ ) for general convex regularizaion, and Oln /) insead of Θ) for srongly convex regularizaion. Alhough hese wo mehods have he same order of ieraion complexiy, he differences lis above conribue o quie differen properies of heir soluions. These differences can be beer undersood in he special case of l -regularizaion, i.e., when Ψw) = λ w. In his case, he Fobos mehod is equivalen o a special case of he Truncaed Gradien TG) mehod of Langford e al. 9). The TG mehod runcaes he soluions obained by he sandard SGD mehod every K seps; more specifically, { ) w i) rnc + = w i) α g i),λ TG,θ if mod,k) =, 9) w i) α g i) oherwise, where λ TG = α λk, mod,k) is he remainder on division of by K, and rncω,λ TG,θ) = if ω λ TG, ω λ TG sgnω) if λ TG < ω θ, ω if ω > θ. When K = and θ = +, he TG mehod is he same as he Fobos mehod 8) wih l -regularizaion. Now comparing he runcaion hreshold λ TG and he hreshold λ used in he l -RDA mehod ): wih α = Θ/ ), we have λ TG = Θ/ )λ. This Θ/ ) discoun facor is also common for oher previous work ha use sof-hresholding, including Shalev-Shwarz and Tewari 9). I is clear ha he RDA mehod uses a much more aggressive runcaion hreshold, hus is able o generae significanly more sparse soluions. This is confirmed by our compuaional experimens in he nex secion. Mos recenly, Duchi e al. ) developed a family of subgradien mehods ha can adapively modifying he proximal funcion squared Mahalanobis norms) a each ieraion, in order o beer incorporae learned knowledge abou geomery of he daa. Their mehods includes exensions for boh he mirror-descen ype of algorihms like 8) and he RDA mehods sudied in his paper. 9

20 Algorihm Enhanced l -RDA mehod Inpu: γ >, ρ Iniialize: w =, ḡ =. for =,,3,... do. Given he funcion f, compue subgradien g f w ).. Compue he dual average ḡ = ḡ + g. 3. Le λ RDA w i) + = = λ+γρ/, and compue w + enry-wise: if ḡ i) λ RDA ḡ i) λ RDA sgn ḡ i) ) ) oherwise, γ, i =,...,n. 3) end for 6. Compuaional Experimens wih l -Regularizaion Inhissecion, weprovidecompuaionalexperimensofhel -RDAmehodonheMNIST daase of handwrien digis LeCun e al., 998). Our purpose here is mainly o illusrae he basic characerisics of he l -RDA mehod, raher han comprehensive performance evaluaion on a wide range of daases. Firs, we describe a varian of he l -RDA mehod ha is capable of geing enhanced sparsiy in he soluion. 6. Enhanced l -RDA Mehod The enhanced l -RDA mehod shown in Algorihm is a special case of Algorihm. I is derived by seing Ψw) = w, β = γ, and replacing hw) wih a paramerized version h ρ w) = w +ρ w, 3) where ρ is a sparsiy-enhancing parameer. Noe ha h ρ w) is srongly convex wih modulus for any ρ. Hence he convergence rae of his algorihm is he same as if we choose hw) = /) w. In his case, he equaion 8) becomes w + = argmin { ḡ,w +λ w + γ )} w +ρ w w { = argmin ḡ,w +λ RDA w + γ } w w, where λ RDA = λ + γρ/. The above minimizaion problem has a closed-form soluion given in 3) see Appendix A for he derivaion). By leing ρ >, he effecive runcaion is larger han λ, especially in he iniial phase of he online process. For problems wihou explici l -regularizaion in he objecive funcion, i.e., when λ =, his sill gives a diminishing runcaion hreshold γρ/. hreshold λ RDA

21 Figure : Sample images from he MNIST daase, wih gray-scale from o 55. We can also resric l -regularizaion on par of he opimizaion variables only. For example, in suppor vecor machines or logisic regression, we usually wan he bias erms o befreeofregularizaion. Inhiscase, wecansimplyreplaceλ RDA by for he corresponding coordinaes in 3). 6. Experimens on he MNIST Daase Each image in he MNIST daase is represened by a 8 8 gray-scale pixel-map, for a oal of 78 feaures. Each of he digis has roughly 6, raining examples and, esing examples. Some of he samples are shown in Figure. From he perspecive of using sochasic and online algorihms, he number of feaures and size of he daase are considered very small. Neverheless, we choose his daase because he compuaional resuls are easy o visualize. No preprocessing of he daa is employed. We use l -regularized logisic regression o do binary classificaion on each of he 5 pairs of digis. More specifically, le z = x,y) where x R 78 represens a gray-scale image and y {+, } is he binary label, and le w = w,b) where w R 78 and b is he bias. Then he loss funcion and regularizaion erm in ) are fw,z) = log +exp y w T x+b) )), Ψw) = λ w. Noehawedonoapplyregularizaiononhebiasermb. Inheexperimens, wecompare he enhanced) l -RDA mehod Algorihm ) wih he SGD mehod w i + = w i α g i +λsgnw i ) ), and he TG mehod 9) wih θ =. These hree online algorihms have similar convergence raes and he same order of compuaional cos per ieraion. We also compare hem wih he bach opimizaion approach, more specifically solving he empirical minimizaion problem ) using an efficien inerior-poin mehod IPM) of Koh e al. 7). Each pair of digis have abou, raining examples and, esing examples. We use online algorihms o go hrough he randomly permued) daa only once, herefore he algorihms sop a T =,. We vary he regularizaion parameer λ from. o. As a reference, he maximum λ for he bach opimizaion case Koh e al., 7) is mosly in he range of 3 5 beyond which he opimal weighs are all zeros). In he

22 λ =. λ =.3 λ =. λ =.3 λ = λ = 3 λ = SGD w T TG w T RDA w T IPM w SGD w T TG w T RDA w T Figure : Sparsiy paerns of w T and w T for classifying he digis 6 and 7 when varying he parameer λ from. o in l -regularized logisic regression. The background gray represens he value zero, brigh spos represen posiive values and dark spos represen negaive values. Each column corresponds o a value of λ labeled a he op. The op hree rows are he weighs w T wihou averaging) from he las ieraion of he hree online algorihms; he middle row shows opimal soluions of he bach opimizaion problem solved by inerior-poin mehod IPM); he boom hree rows show he averaged weighs w T in he hree online algorihms. Boh he TG and RDA mehods were run wih parameers for enhanced l -regularizaion, i.e., K = for TG and γρ = 5 for RDA.

23 NNZs in w λ =.) Lef: K = for TG, ρ = for RDA SGD TG K=) RDA ρ = ) Righ: K = for TG, γρ = 5 for RDA SGD TG K=) RDA γρ = 5) NNZs in w λ = ) Number of samples Number of samples Figure 3: Number of non-zeros NNZs) in w for he hree online algorihms classifying he pair 6 and 7). The lef column shows SGD, TG wih K =, and RDA wih ρ = ; he righ column shows SGD, TG wih K =, and RDA wih γρ = 5. The same curves for SGD are ploed in boh columns for clear comparison. The wo rows correspond o λ =. and λ =, respecively. l -RDA mehod, we use γ = 5,, and se ρ o be eiher for basic regularizaion, or.5 effecively γρ = 5) for enhanced regularizaion effec. These parameers are chosen by cross-validaion. For he SGD and TG mehods, we use a consan sepsize α = /γ) /T for comparable convergence rae; see 9) and relaed discussions. In he TG mehod, he period K is se o be eiher for basic regularizaion same as Fobos), or for periodic enhanced regularizaion effec. Figure shows he sparsiy paerns of he soluions w T and w T for classifying he digis 6 and 7. The algorihmic parameers used are: K = for he TG mehod, and γρ = 5 for he RDA mehod. I is clear ha he RDA mehod gives more sparse soluions han boh SGD and TG mehods. The sparsiy paern obained by he RDA mehod is very similar o he bach opimizaion resuls solved by IPM, especially for larger λ. 3

24 Lef: K= for TG, ρ= for RDA Righ: K= for TG, γρ=5 for RDA Error raes of wt %) SGD TG K=) RDA ρ = ) IPM.... SGD TG K=) RDA γρ = 5) IPM Error raes of wt %) NNZs in wt NNZs in wt.. Regularizaion parameer λ.. Regularizaion parameer λ Figure : Tradeoffs beween esing error raes and NNZs in soluions when varying λ from. o for classifying 6 and 7). The lef column shows SGD, TG wih K =, RDA wih ρ =, and IPM. The righ column shows SGD, TG wih K =, RDA wih γρ = 5, and IPM. The same curves for SGD and IPM are ploed in boh columns for clear comparison. The op wo rows shows he esing error raes and NNZs of he final weighs w T, and he boom wo rows are for he averaged weighs w T. All horizonal axes have logarihmic scale. For verical axes, only he wo plos in he firs row have logarihmic scale.

25 Error raes %), λ =. Error raes %), λ = Error raes %), λ = Parameer γ RDA w T RDA w T IPM NNZs, λ =. NNZs, λ = NNZs, λ = Parameer γ RDA w T RDA w T IPM Figure 5: Tesing error raes and NNZs in soluions for he RDA mehod when varying he parameer γ from, o,, and seing ρ such ha γρ = 5. The hree rows show resuls for λ =.,, and, respecively. The corresponding bach opimizaion resuls found by IPM are shown as a horizonal line in each plo. To have a beer undersanding of he behaviors of he algorihms, we plo he number of non-zeros NNZs) in w in Figure 3. Only he RDA mehod and TG wih K = give explici zero weighs using sof-hresholding a every sep. In order o coun he NNZs in all oher cases, we have o se a small hreshold for rounding he weighs o zero. Considering ha he magniudes of he larges weighs in Figure are mosly on he order of 3, we se 5 as he hreshold and verified ha rounding elemens less han 5 o zero does no affec he esing errors. Noe ha we do no runcae he weighs for RDA and TG wih K = furher, even if some of heir componens are below 5. I can be seen ha he RDA mehod mainains a much more sparse w han he oher online algorihms. While he TG mehod generaes more sparse soluions han he SGD mehod when λ is large, he NNZs in w oscillaes wih a very big range. The oscillaion becomes more severe wih 5

26 IPM λ=. λ= λ= Figure 6: Sparsiy paerns of w T by varying he parameer γ in he RDA mehod from, o, for classifying he pair 6 and 7). The firs column shows resuls of bach opimizaion using IPM, and he oher columns show resuls of RDA mehod using γ labeled a he op. K =. In conras, he RDA mehod demonsraes a much more smooh behavior of he NNZs. For he RDA mehod, he effec of enhanced regularizaion using γρ = 5 is more pronounced for relaively small λ. Nex we illusrae he radeoffs beween sparsiy and esing error raes. Figure shows ha he soluions obained by he RDA mehod mach he bach opimizaion resuls very well. Since he performance of he online algorihms vary when he raining daa are given in differen random sequences permuaions), we run hem on randomly permued sequences of he same raining se, and plo he means and sandard deviaions shown as errorbars. ForheSGDandTGmehods,heesingerrorraesofw T varyalofordifferen random sequences. In conras, he RDA mehod demonsraes very robus performance small sandard deviaions) for w T, even hough he heorems only give convergence bound for he averaged weigh w T. For large values of λ, he averaged weighs w T obained by SGD and TG mehods acually have much smaller error raes han hose of RDA and bach opimizaion. This can be explained by he limiaion of he SGD and TG mehods in obaining sparse soluions: hese lower error raes are obained wih much more nonzero feaures han used by he RDA and bach opimizaion mehods. Figure 5 shows he resuls of choosing differen values for he parameer γ in he RDA mehod. We see ha smaller values of γ, which corresponds o faser learning raes, lead o more sparse w T and higher esing error raes; larger values of γ resul in less sparse w T wih lower esing error raes. Bu ineresingly, he effecs on he averaged soluion w T is almos opposie: smaller values of γ lead o less sparse w T in his case, we coun he NNZs using he rounding hreshold 5 ). For large regularizaion parameer λ, smaller values of γ also give lower esing error raes. Figure 6 shows he sparsiy paerns of w T when varying γ from, o,. We see ha smaller values of γ give more sparse w T, which are also more scaered like he bach opimizaion soluion by IPM. Figure 7 shows summary of classificaion resuls for all he 5 pairs of digis. For clariy, we only show resuls of he l -RDA mehod and bach opimizaion using IPM. We see ha he soluions obained by he l -RDA mehod demonsrae very similar radeoffs beween sparsiy and esing error raes as rendered by he bach opimizaion soluions. 6

Lecture 9: September 25

Lecture 9: September 25 0-725: Opimizaion Fall 202 Lecure 9: Sepember 25 Lecurer: Geoff Gordon/Ryan Tibshirani Scribes: Xuezhi Wang, Subhodeep Moira, Abhimanu Kumar Noe: LaTeX emplae couresy of UC Berkeley EECS dep. Disclaimer:

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

Notes on online convex optimization

Notes on online convex optimization Noes on online convex opimizaion Karl Sraos Online convex opimizaion (OCO) is a principled framework for online learning: OnlineConvexOpimizaion Inpu: convex se S, number of seps T For =, 2,..., T : Selec

More information

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization A Forward-Backward Spliing Mehod wih Componen-wise Lazy Evaluaion for Online Srucured Convex Opimizaion Yukihiro Togari and Nobuo Yamashia March 28, 2016 Absrac: We consider large-scale opimizaion problems

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

Lecture 33: November 29

Lecture 33: November 29 36-705: Inermediae Saisics Fall 2017 Lecurer: Siva Balakrishnan Lecure 33: November 29 Today we will coninue discussing he boosrap, and hen ry o undersand why i works in a simple case. In he las lecure

More information

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs PROC. IEEE CONFERENCE ON DECISION AND CONTROL, 06 A Primal-Dual Type Algorihm wih he O(/) Convergence Rae for Large Scale Consrained Convex Programs Hao Yu and Michael J. Neely Absrac This paper considers

More information

Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection

Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection Appendix o Online l -Dicionary Learning wih Applicaion o Novel Documen Deecion Shiva Prasad Kasiviswanahan Huahua Wang Arindam Banerjee Prem Melville A Background abou ADMM In his secion, we give a brief

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD HAN XIAO 1. Penalized Leas Squares Lasso solves he following opimizaion problem, ˆβ lasso = arg max β R p+1 1 N y i β 0 N x ij β j β j (1.1) for some 0.

More information

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018 MATH 5720: Gradien Mehods Hung Phan, UMass Lowell Ocober 4, 208 Descen Direcion Mehods Consider he problem min { f(x) x R n}. The general descen direcions mehod is x k+ = x k + k d k where x k is he curren

More information

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation: M ah 5 7 Fall 9 L ecure O c. 4, 9 ) Hamilon- J acobi Equaion: Weak S oluion We coninue he sudy of he Hamilon-Jacobi equaion: We have shown ha u + H D u) = R n, ) ; u = g R n { = }. ). In general we canno

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

WEEK-3 Recitation PHYS 131. of the projectile s velocity remains constant throughout the motion, since the acceleration a x

WEEK-3 Recitation PHYS 131. of the projectile s velocity remains constant throughout the motion, since the acceleration a x WEEK-3 Reciaion PHYS 131 Ch. 3: FOC 1, 3, 4, 6, 14. Problems 9, 37, 41 & 71 and Ch. 4: FOC 1, 3, 5, 8. Problems 3, 5 & 16. Feb 8, 018 Ch. 3: FOC 1, 3, 4, 6, 14. 1. (a) The horizonal componen of he projecile

More information

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models. Technical Repor Doc ID: TR--203 06-March-203 (Las revision: 23-Februar-206) On formulaing quadraic funcions in opimizaion models. Auhor: Erling D. Andersen Convex quadraic consrains quie frequenl appear

More information

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details! MAT 257, Handou 6: Ocober 7-2, 20. I. Assignmen. Finish reading Chaper 2 of Spiva, rereading earlier secions as necessary. handou and fill in some missing deails! II. Higher derivaives. Also, read his

More information

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream

More information

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 20: Riccati Equations and Least Squares Feedback Control 34-5 LINEAR SYSTEMS Lecure : Riccai Equaions and Leas Squares Feedback Conrol 5.6.4 Sae Feedback via Riccai Equaions A recursive approach in generaing he marix-valued funcion W ( ) equaion for i for he

More information

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t M ah 5 2 7 Fall 2 0 0 9 L ecure 1 0 O c. 7, 2 0 0 9 Hamilon- J acobi Equaion: Explici Formulas In his lecure we ry o apply he mehod of characerisics o he Hamilon-Jacobi equaion: u + H D u, x = 0 in R n

More information

EXERCISES FOR SECTION 1.5

EXERCISES FOR SECTION 1.5 1.5 Exisence and Uniqueness of Soluions 43 20. 1 v c 21. 1 v c 1 2 4 6 8 10 1 2 2 4 6 8 10 Graph of approximae soluion obained using Euler s mehod wih = 0.1. Graph of approximae soluion obained using Euler

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

Lecture 4: November 13

Lecture 4: November 13 Compuaional Learning Theory Fall Semeser, 2017/18 Lecure 4: November 13 Lecurer: Yishay Mansour Scribe: Guy Dolinsky, Yogev Bar-On, Yuval Lewi 4.1 Fenchel-Conjugae 4.1.1 Moivaion Unil his lecure we saw

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes Some common engineering funcions 2.7 Inroducion This secion provides a caalogue of some common funcions ofen used in Science and Engineering. These include polynomials, raional funcions, he modulus funcion

More information

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin ACE 56 Fall 005 Lecure 4: Simple Linear Regression Model: Specificaion and Esimaion by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Simple Regression: Economic and Saisical Model

More information

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 38, No. 2, May 2013, pp. 209 227 ISSN 0364-765X (prin) ISSN 1526-5471 (online) hp://dx.doi.org/10.1287/moor.1120.0562 2013 INFORMS On Boundedness of Q-Learning Ieraes

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning Journal of Machine Learning Research 13 (2012) 1665-1705 Submied 7/11; Revised 3/12; Published 5/12 Manifold Idenificaion in Dual Averaging for Regularized Sochasic Online Learning Sangkyun Lee Fakulä

More information

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture Notes 2. The Hilbert Space Approach to Time Series Time Series Seven N. Durlauf Universiy of Wisconsin. Basic ideas Lecure Noes. The Hilber Space Approach o Time Series The Hilber space framework provides a very powerful language for discussing he relaionship

More information

GMM - Generalized Method of Moments

GMM - Generalized Method of Moments GMM - Generalized Mehod of Momens Conens GMM esimaion, shor inroducion 2 GMM inuiion: Maching momens 2 3 General overview of GMM esimaion. 3 3. Weighing marix...........................................

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Notes for Lecture 17-18

Notes for Lecture 17-18 U.C. Berkeley CS278: Compuaional Complexiy Handou N7-8 Professor Luca Trevisan April 3-8, 2008 Noes for Lecure 7-8 In hese wo lecures we prove he firs half of he PCP Theorem, he Amplificaion Lemma, up

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8) I. Definiions and Problems A. Perfec Mulicollineariy Econ7 Applied Economerics Topic 7: Mulicollineariy (Sudenmund, Chaper 8) Definiion: Perfec mulicollineariy exiss in a following K-variable regression

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

Optimality Conditions for Unconstrained Problems

Optimality Conditions for Unconstrained Problems 62 CHAPTER 6 Opimaliy Condiions for Unconsrained Problems 1 Unconsrained Opimizaion 11 Exisence Consider he problem of minimizing he funcion f : R n R where f is coninuous on all of R n : P min f(x) x

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

Ensamble methods: Boosting

Ensamble methods: Boosting Lecure 21 Ensamble mehods: Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Schedule Final exam: April 18: 1:00-2:15pm, in-class Term projecs April 23 & April 25: a 1:00-2:30pm in CS seminar room

More information

Solutions from Chapter 9.1 and 9.2

Solutions from Chapter 9.1 and 9.2 Soluions from Chaper 9 and 92 Secion 9 Problem # This basically boils down o an exercise in he chain rule from calculus We are looking for soluions of he form: u( x) = f( k x c) where k x R 3 and k is

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

14 Autoregressive Moving Average Models

14 Autoregressive Moving Average Models 14 Auoregressive Moving Average Models In his chaper an imporan parameric family of saionary ime series is inroduced, he family of he auoregressive moving average, or ARMA, processes. For a large class

More information

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux Gues Lecures for Dr. MacFarlane s EE3350 Par Deux Michael Plane Mon., 08-30-2010 Wrie name in corner. Poin ou his is a review, so I will go faser. Remind hem o go lisen o online lecure abou geing an A

More information

Class Meeting # 10: Introduction to the Wave Equation

Class Meeting # 10: Introduction to the Wave Equation MATH 8.5 COURSE NOTES - CLASS MEETING # 0 8.5 Inroducion o PDEs, Fall 0 Professor: Jared Speck Class Meeing # 0: Inroducion o he Wave Equaion. Wha is he wave equaion? The sandard wave equaion for a funcion

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

Ordinary dierential equations

Ordinary dierential equations Chaper 5 Ordinary dierenial equaions Conens 5.1 Iniial value problem........................... 31 5. Forward Euler's mehod......................... 3 5.3 Runge-Kua mehods.......................... 36

More information

Online Appendix to Solution Methods for Models with Rare Disasters

Online Appendix to Solution Methods for Models with Rare Disasters Online Appendix o Soluion Mehods for Models wih Rare Disasers Jesús Fernández-Villaverde and Oren Levinal In his Online Appendix, we presen he Euler condiions of he model, we develop he pricing Calvo block,

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

Let us start with a two dimensional case. We consider a vector ( x,

Let us start with a two dimensional case. We consider a vector ( x, Roaion marices We consider now roaion marices in wo and hree dimensions. We sar wih wo dimensions since wo dimensions are easier han hree o undersand, and one dimension is a lile oo simple. However, our

More information

Matlab and Python programming: how to get started

Matlab and Python programming: how to get started Malab and Pyhon programming: how o ge sared Equipping readers he skills o wrie programs o explore complex sysems and discover ineresing paerns from big daa is one of he main goals of his book. In his chaper,

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

Chapter 4. Truncation Errors

Chapter 4. Truncation Errors Chaper 4. Truncaion Errors and he Taylor Series Truncaion Errors and he Taylor Series Non-elemenary funcions such as rigonomeric, eponenial, and ohers are epressed in an approimae fashion using Taylor

More information

5.1 - Logarithms and Their Properties

5.1 - Logarithms and Their Properties Chaper 5 Logarihmic Funcions 5.1 - Logarihms and Their Properies Suppose ha a populaion grows according o he formula P 10, where P is he colony size a ime, in hours. When will he populaion be 2500? We

More information

A New Perturbative Approach in Nonlinear Singularity Analysis

A New Perturbative Approach in Nonlinear Singularity Analysis Journal of Mahemaics and Saisics 7 (: 49-54, ISSN 549-644 Science Publicaions A New Perurbaive Approach in Nonlinear Singulariy Analysis Ta-Leung Yee Deparmen of Mahemaics and Informaion Technology, The

More information

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY ECO 504 Spring 2006 Chris Sims RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY 1. INTRODUCTION Lagrange muliplier mehods are sandard fare in elemenary calculus courses, and hey play a cenral role in economic

More information

Manifold Identification of Dual Averaging Methods for Regularized Stochastic Online Learning

Manifold Identification of Dual Averaging Methods for Regularized Stochastic Online Learning for Regularized Sochasic Online Learning Sangkyun Lee sklee@cs.wisc.edu Sephen J. Wrigh swrigh@cs.wisc.edu Compuer Sciences Deparmen, Universiy of Wisconsin, W. Dayon Sree, Madison, WI 5376 USA Absrac

More information

SOLUTIONS TO ECE 3084

SOLUTIONS TO ECE 3084 SOLUTIONS TO ECE 384 PROBLEM 2.. For each sysem below, specify wheher or no i is: (i) memoryless; (ii) causal; (iii) inverible; (iv) linear; (v) ime invarian; Explain your reasoning. If he propery is no

More information

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates Biol. 356 Lab 8. Moraliy, Recruimen, and Migraion Raes (modified from Cox, 00, General Ecology Lab Manual, McGraw Hill) Las week we esimaed populaion size hrough several mehods. One assumpion of all hese

More information

In this chapter the model of free motion under gravity is extended to objects projected at an angle. When you have completed it, you should

In this chapter the model of free motion under gravity is extended to objects projected at an angle. When you have completed it, you should Cambridge Universiy Press 978--36-60033-7 Cambridge Inernaional AS and A Level Mahemaics: Mechanics Coursebook Excerp More Informaion Chaper The moion of projeciles In his chaper he model of free moion

More information

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims Problem Se 5 Graduae Macro II, Spring 2017 The Universiy of Nore Dame Professor Sims Insrucions: You may consul wih oher members of he class, bu please make sure o urn in your own work. Where applicable,

More information

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17 EES 16A Designing Informaion Devices and Sysems I Spring 019 Lecure Noes Noe 17 17.1 apaciive ouchscreen In he las noe, we saw ha a capacior consiss of wo pieces on conducive maerial separaed by a nonconducive

More information

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC his documen was generaed a 1:4 PM, 9/1/13 Copyrigh 213 Richard. Woodward 4. End poins and ransversaliy condiions AGEC 637-213 F z d Recall from Lecure 3 ha a ypical opimal conrol problem is o maimize (,,

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

BU Macro BU Macro Fall 2008, Lecture 4

BU Macro BU Macro Fall 2008, Lecture 4 Dynamic Programming BU Macro 2008 Lecure 4 1 Ouline 1. Cerainy opimizaion problem used o illusrae: a. Resricions on exogenous variables b. Value funcion c. Policy funcion d. The Bellman equaion and an

More information

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t Exercise 7 C P = α + β R P + u C = αp + βr + v (a) (b) C R = α P R + β + w (c) Assumpions abou he disurbances u, v, w : Classical assumions on he disurbance of one of he equaions, eg. on (b): E(v v s P,

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

Rapid Termination Evaluation for Recursive Subdivision of Bezier Curves

Rapid Termination Evaluation for Recursive Subdivision of Bezier Curves Rapid Terminaion Evaluaion for Recursive Subdivision of Bezier Curves Thomas F. Hain School of Compuer and Informaion Sciences, Universiy of Souh Alabama, Mobile, AL, U.S.A. Absrac Bézier curve flaening

More information

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients Secion 3.5 Nonhomogeneous Equaions; Mehod of Undeermined Coefficiens Key Terms/Ideas: Linear Differenial operaor Nonlinear operaor Second order homogeneous DE Second order nonhomogeneous DE Soluion o homogeneous

More information

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015 Explaining Toal Facor Produciviy Ulrich Kohli Universiy of Geneva December 2015 Needed: A Theory of Toal Facor Produciviy Edward C. Presco (1998) 2 1. Inroducion Toal Facor Produciviy (TFP) has become

More information

Lecture 4 Notes (Little s Theorem)

Lecture 4 Notes (Little s Theorem) Lecure 4 Noes (Lile s Theorem) This lecure concerns one of he mos imporan (and simples) heorems in Queuing Theory, Lile s Theorem. More informaion can be found in he course book, Bersekas & Gallagher,

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

A Shooting Method for A Node Generation Algorithm

A Shooting Method for A Node Generation Algorithm A Shooing Mehod for A Node Generaion Algorihm Hiroaki Nishikawa W.M.Keck Foundaion Laboraory for Compuaional Fluid Dynamics Deparmen of Aerospace Engineering, Universiy of Michigan, Ann Arbor, Michigan

More information

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3 and d = c b - b c c d = c b - b c c This process is coninued unil he nh row has been compleed. The complee array of coefficiens is riangular. Noe ha in developing he array an enire row may be divided or

More information

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter Sae-Space Models Iniializaion, Esimaion and Smoohing of he Kalman Filer Iniializaion of he Kalman Filer The Kalman filer shows how o updae pas predicors and he corresponding predicion error variances when

More information

Introduction to Numerical Analysis. In this lesson you will be taken through a pair of techniques that will be used to solve the equations of.

Introduction to Numerical Analysis. In this lesson you will be taken through a pair of techniques that will be used to solve the equations of. Inroducion o Nuerical Analysis oion In his lesson you will be aen hrough a pair of echniques ha will be used o solve he equaions of and v dx d a F d for siuaions in which F is well nown, and he iniial

More information

Lecture 10: The Poincaré Inequality in Euclidean space

Lecture 10: The Poincaré Inequality in Euclidean space Deparmens of Mahemaics Monana Sae Universiy Fall 215 Prof. Kevin Wildrick n inroducion o non-smooh analysis and geomery Lecure 1: The Poincaré Inequaliy in Euclidean space 1. Wha is he Poincaré inequaliy?

More information

Unit Root Time Series. Univariate random walk

Unit Root Time Series. Univariate random walk Uni Roo ime Series Univariae random walk Consider he regression y y where ~ iid N 0, he leas squares esimae of is: ˆ yy y y yy Now wha if = If y y hen le y 0 =0 so ha y j j If ~ iid N 0, hen y ~ N 0, he

More information

How to Deal with Structural Breaks in Practical Cointegration Analysis

How to Deal with Structural Breaks in Practical Cointegration Analysis How o Deal wih Srucural Breaks in Pracical Coinegraion Analysis Roselyne Joyeux * School of Economic and Financial Sudies Macquarie Universiy December 00 ABSTRACT In his noe we consider he reamen of srucural

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information