Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

Size: px
Start display at page:

Download "Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization"

Transcription

1 Stochastc Prmal-Dual Coordate Method for Regularzed Emprcal Rsk Mmzato Yuche Zhag L Xao September 24 Abstract We cosder a geerc covex optmzato problem assocated wth regularzed emprcal rsk mmzato of lear predctors The problem structure allows us to reformulate t as a covexcocave saddle pot problem We propose a stochastc prmal-dual coordate ( method, whch alterates betwee maxmzg over a radomly chose dual varable ad mmzg over the prmal varable A extrapolato step o the prmal varable s performed to obta accelerated covergece rate We also develop a m-batch verso of the method whch facltates parallel computg, ad a exteso wth weghted samplg probabltes o the dual varables, whch has a better complexty tha uform samplg o uormalzed data Both theoretcally ad emprcally, we show that the method has comparable or better performace tha several state-of-the-art optmzato methods Itroducto We cosder a geerc covex optmzato problem that arses ofte mache learg: regularzed emprcal rsk mmzato (ERM of lear predctors More specfcally, let a,,a R d be the feature vectors of data samples, φ : R R be a covex loss fucto assocated wth the lear predcto a T x, for =,,, ad g : Rd R be a covex regularzato fucto for the predctor x R d Our goal s to solve the followg optmzato problem: { } mmze x R d P(x def = φ (a T x+g(x = ( Examples of the above formulato clude may well-kow classfcato ad regresso problems For bary classfcato, each feature vector a s assocated wth a label b {±} We obta the lear SVM (support vector mache by settg φ (z = max{, b z} (the hge loss ad g(x = (λ/2 x 2 2, where λ > s a regularzato parameter Regularzed logstc regresso s obtaed by settg φ (z = log( + exp( b z For lear regresso problems, each feature vector a s assocated wth a depedet varable b R, ad φ (z = (/2(z b 2 The we get rdge regresso wth g(x = (λ/2 x 2 2, ad the Lasso wth g(x = λ x Further backgrouds o regularzed ERM mache learg ad statstcs ca be foud, eg, the book [3] Departmet of Electrcal Egeerg ad Computer Scece, Uversty of Calfora, Berkekey, CA 9472, USA Emal: yuczhag@eecsberkeleyedu (Ths work was perfomed durg a tershp at Mcrosoft Research Mache Learg Groups, Mcrosoft Research, Redmod, WA 9853, USA Emal: lxao@mcrosoftcom

2 We are especally terested developg effcet algorthms for solvg problem ( whe the umber of samples s very large I ths case, evaluatg the full gradet or subgradet of the fucto P(x s very expesve, thus cremetal methods that operate o a sgle compoet fucto φ at each terato ca be very attractve There have bee extesve research o cremetal (subgradet methods (eg [4, 4, 2, 2, 3] as well as varats of the stochastc gradet method (eg, [46, 5,, 7, 43] Whle the computatoal cost per terato of these methods s oly a small fracto, say /, of that of the batch gradet methods, ther terato complextes are much hgher (t takes may more teratos for them to reach the same precso I order to better quatfy the complextes of varous algorthms ad posto our cotrbutos, we eed to make some cocrete assumptos ad troduce the oto of codto umber ad batch complexty Codto umber ad batch complexty Let γ ad λ be two postve real parameters We make the followg assumpto: Assumpto A Each φ s covex ad dfferetable, ad ts dervatve s (/γ-lpschtz cotuous (same as φ beg (/γ-smooth, e, φ (α φ (β (/γ α β, α,β R, =,, I addto, the regularzato fucto g s λ-strogly covex, e, g(y g(x+g (y T (x y+ λ 2 x y 2 2, g (y g(y, x,y R For example, the logstc loss φ (z = log( + exp( b z s (/4-smooth, the squared error φ (z = (/2(z b 2 s -smooth, ad the squared l 2 -orm g(x = (λ/2 x 2 2 s λ-strogly covex The hge loss φ (z = max{, b z} ad the l -regularzato g(x = λ x do ot satsfy Assumpto A Nevertheless, we ca treat them usg smoothg ad strogly covex perturbatos, respectvely, so that our algorthm ad theoretcal framework stll apply (see Secto 3 Uder Assumpto A, the gradet of each compoet fucto, φ (a T x, s also Lpschtz cotuous, wth Lpschtz costat L = a 2 2 /γ R2 /γ, where R = max a 2 I other words, each φ (a T x s (R2 /γ-smooth We defe a codto umber κ = R 2 /(λγ, ad focus o ll-codtoed problems where κ I the statstcal learg cotext, the regularzato parameter λ s usually o the order of / or / (eg, [6], thus κ s o the order of or It ca be eve larger f the strog covexty g s added purely for umercal regularzato purposes (see Secto 3 We ote that the actual codtog of problem ( may be better tha κ, f the emprcal loss fucto (/ = φ (a T x by tself s strogly covex I those cases, our complexty estmates terms of κ ca be loose (upper bouds, but they are stll useful comparg dfferet algorthms for solvg the same gve problem Let P be the optmal value of problem (, e, P = m x R d P(x I order to fd a approxmate soluto ˆx satsfyg P(ˆx P ǫ, the classcal full gradet method ad ts proxmal varats requre O(( + κ log(/ǫ teratos (eg, [24, 26] Accelerated full gradet ( methods [24, 4,, 26] eoy the mproved terato complexty O((+ κlog(/ǫ However, each terato of these batch methods requres a full pass over the dataset, computg the gradet For the aalyss of full gradet methods, we should use (R 2 /γ + λ/λ = + κ as the codto umber of problem (; see [26, Secto 5] Here we used the upper boud +κ < + κ for easy comparso Whe κ, the addtve costat ca be dropped 2

3 of each compoet fucto ad formg ther average, whch cost O(d operatos (assumg the features vectors a R d are dese I cotrast, the stochastc gradet method ad ts proxmal varats operate o oe sgle compoet φ (a T x (chose radomly at each terato, whch oly costs O(d But ther terato complextes are far worse Uder Assumpto A, t takes them O(κ/ǫ teratos to fd a ˆx such that E[P(ˆx P ] ǫ, where the expectato s wth respect to the radom choces made at all the teratos (see, eg, [3, 23,, 7, 43] To make far comparsos wth batch methods, we measure the complexty of stochastc or cremetal gradet methods terms of the umber of equvalet passes over the dataset requred to reach a expected precso ǫ We call ths measure the batch complexty, whch are usually obtaed by dvdg ther terato complextes by For example, the batch complexty of the stochastc gradet method s O(κ/(ǫ The batch complextes of full gradet methods are the same as ther terato complextes By carefully explotg the fte average structure ( ad other smlar problems, several recet work [32, 36, 6, 44] proposed ew varats of the stochastc gradet or dual coordate ascet methods ad obtaed the terato complexty O(( + κ log(/ǫ Sce ther computatoal cost per terato s O(d, the equvalet batch complexty s O(( + κ/ log(/ǫ Ths complexty has much weaker depedece o tha the full gradet methods, ad also much weaker depedece o ǫ tha the stochastc gradet methods I ths paper, we preset a ew algorthm that has the batch complexty O ( (+ κ/log(/ǫ, (2 whch s more effcet whe κ > 2 Outle of the paper Our approach s based o reformulatg problem ( as a covex-cocave saddle pot problem, ad the devsg a prmal-dual algorthm to approxmate the saddle pot More specfcally, we replace each compoet fucto φ (a T x through covex cougato, e, φ (a T x = sup {y a,x φ (y }, y R where φ (y = sup α R {αy φ (α}, ad a,x deotes the er product of a ad x (whch s the same as a T x, but s more coveet for later presetato Ths leads to a covex-cocave saddle pot problem { m max f(x,y def = ( y a,x φ x R d y R (y } +g(x (3 = Uder Assumpto A, each φ s γ-strogly covex (sce φ s (/γ-smooth; see, eg, [4, Theorem 422] ad g s λ-strogly covex As a cosequece, the saddle pot problem (3 has a uque soluto, whch we deote by (x,y I Secto 2, we propose a stochastc prmal-dual coordate ( method, whch alterates betwee maxmzg f over a radomly chose dual coordate y ad mmzg f over the prmal varable x We also apply a extrapolato step to the prmal varable x to accelerate the covergece The method has terato complexty O(( + κlog(/ǫ Sce each terato of oly operates o a sgle dual coordate y, ts batch complexty s gve by (2 We also preset a m-batch algorthm whch s well suted for dstrbuted computg 3

4 Algorthm : The method Iput: parameters τ,σ,θ R +, umber of teratos T, ad tal pots x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Pck a dex k {,2,,} uformly at radom, ad execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β 2σ (β y(t 2 f = k, (4 y (t f k, { } x (t+ = arg m g(x+ u (t +(y (t+ x R d k k a k, x + x x(t 2 2, (5 2τ u (t+ = u (t + (y(t+ k k a k, (6 x (t+ = x (t+ +θ(x (t+ x (t (7 ed Output: x (T ad y (T I Secto 3, we preset two extesos of the method We frst expla how to solve problem ( whe Assumpto A does ot hold The dea s to apply small regularzatos to the saddle pot fucto so that ca stll be appled, whch results accelerated sublear rates The secod exteso s a method wth o-uform samplg The batch complexty of ths algorthm has the same form as (2, but κ s defed as κ = R/(λγ, where R = = a, whch ca be much smaller tha R = max a f there s cosderable varato the orms a I Secto 4, we dscuss related work I partcular, the method ca be vewed as a coordate-update exteso of the batch prmal-dual algorthm developed by Chambolle ad Pock[8] Wealsodscusstwoveryrecetwork [34,8]whchachevethesamebatch complexty (2 I Secto 5, we dscuss effcet mplemetato of the method whe the feature vectors a are sparse We focus o two popular cases: whe g s a squared l 2 -orm pealty ad whe g s a l +l 2 pealty We show that the computatoal cost per terato of oly depeds o the umber of o-zero elemets the feature vectors I Secto 6, we preset expermet results comparg wth several state-of-the-art optmzato methods, cludg two effcet batch methods ( [24] ad L-BFGS [27, Secto 72], the stochastc average gradet ( method [32, 33], ad the stochastc dual coordate ascet ( method [36] O all scearos we tested, has comparable or better performace 2 The method I ths secto, we descrbe ad aalyze the Stochastc Prmal-Dual Coordate ( method The basc dea of s qute smple: to approach the saddle pot of f(x,y defed (3, we alteratvely maxmze f wth respect to y, ad mmze f wth respect to x Sce the dual vector y has coordates ad each coordate s assocated wth a feature vector a R d, maxmzg f wth respect to y takes O(d computato, whch ca be very expesve f s large We reduce the computatoal cost by radomly pckg a sgle coordate of y at a tme, ad 4

5 Algorthm 2: The M-Batch method Iput: m-batch sze m, parameters τ,σ,θ R +, umber of teratos T, ad x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Radomly pck a subset of dces K {,2,,} of sze m, such that the probablty of each dex beg pcked s equal to m/ Execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β 2σ (β y(t 2 f K, (8 y (t x (t+ = arg m x R d u (t+ = u (t + f / K, { g(x+ u (t + (y (t+ m k k a k, x k K (y (t+ k k a k, x (t+ = x (t+ +θ(x (t+ x (t ed Output: x (T ad y (T k K + x x(t 2 2 2τ }, (9 maxmzg f oly wth respect to ths coordate Cosequetly, the computatoal cost of each terato s O(d WegvethedetalsofthemethodAlgorthm Thedualcoordateupdateadprmal vector update are gve equatos (4 ad (5 respectvely Istead of maxmzg f over y k ad mmzg f over x drectly, we add two quadratc regularzato terms to pealze y (t+ k ad x (t+ from devatg from y (t k ad x (t The parameters σ ad τ cotrol ther regularzato stregth, whch we wll specfy the covergece aalyss (Theorem Moreover, we troduce two auxlary varables u (t ad x (t From the talzato u ( = (/ = y( a ad the update rules (4 ad (6, we have u (t = = y (t a, t =,,T Equato (7 obtas x (t+ based o extrapolato from x (t ad x (t+ Ths step s smlar to Nesterov s accelerato techque [24, Secto 22], ad yelds faster covergece rate Before presetg the theoretcal results, we troduce a M-Batch method Algorthm 2, whch s a atural exteso of Algorthm The dfferece betwee these two algorthms s that, the M-Batch method may smultaeously select more tha oe dual coordates to update Let m be the m-batch sze Durg each terato, the M-Batch method radomly pcks a subset of dces K {,,} of sze m, such that the probablty of each dex beg pcked s equal to m/ The followg s a smple procedure to acheve ths Frst, partto the set of dces to m dsot subsets, so that the cardalty of each subset s equal to /m (assumg m dvdes The, durg each terato, radomly select a sgle dex from each subset ad add t to K Other approaches for m-batch selecto are also possble Wth a sgle processor, each terato of Algorthm 2 takes O(md tme to accomplsh Sce 5

6 the updates of each coordate y k are depedet of each other, we ca use parallel computg to accelerate the M-Batch method Cocretely, we ca use m processors to update the m coordates the subset K parallel, the aggregate them to update x (t+ Such a procedure ca be acheved by a sgle roud of commucato, for example, usg the Allreduce operato MPI [2] or MapReduce [] If we gore the commucato delay, the each terato takes O(d tme, whch s the same as rug oe terato of the basc algorthm Not surprsgly, we wll show that the M-Batch algorthm coverges faster tha terms of the terato complexty, because t processes multple dual coordates a sgle terato 2 Covergece aalyss Sce the basc algorthm s a specal case of M-Batch wth m =, we oly preset a covergece theorem for the m-batch verso Theorem Assume that each φ s (/γ-smooth ad g s λ-strogly covex (Assumpto A Let R = max{ a 2 : =,,} If the parameters τ,σ ad θ Algorthm 2 are chose such that τ = mγ 2R λ, σ = λ 2R mγ, θ = (/m+r (/m/(λγ, ( the for each t, the M-Batch algorthm acheves ( 2τ +λ E [ x (t x 2 ] ( [ E y (t 2 + 4σ +γ y 2 2] m ( ( ( y θ t 2τ +λ x ( x 2 ( 2 + 2σ +γ y 2 2 m The proof of Theorem s gve Appedx A The followg corollary establshes the expected terato complexty of M-Batch for obtag a ǫ-accurate soluto Corollary Suppose Assumpto A holds ad the parameters τ, σ ad θ are set as ( I order for Algorthm 2 to obta t suffces to have the umber of teratos T satsfy ( T m +R mλγ where C = E[ x (T x 2 2] ǫ, E[ y (T y 2 2] ǫ, ( log ( C, ǫ ( /(2τ+λ x ( x 2 2 +( /(2σ+γ y ( y 2 2 /m m { /(2τ+λ, (/(4σ+γ/m } Proof By Theorem, we have E[ x (T x 2 2 ] θt C ad E[ y (T y 2 2 ] θt C To obta (, t suffces to esure that θ T C ǫ, whch s equvalet to T log(c/ǫ log(θ = log(c/ǫ ( ( log (/m+r (/m/(λγ Applyg the equalty log( x x to the deomator above completes the proof 6

7 Recallthedeftoofthecodtoumberκ = R 2 /(λγsecto Corollaryestablshes that the terato complexty of the M-Batch method for achevg ( s ( ((/m+ O κ(/m log(/ǫ So a larger batch sze m leads to less umber of teratos I the extreme case of = m, we obta a full batch algorthm, whch has terato or batch complexty O((+ κlog(/ǫ Ths complexty s also shared by the methods [24, 26] (see Secto, as well as the batch prmal-dual algorthm of Chambolle ad Pock [8] (see dscussos o related work Secto 4 Sce a equvalet pass over the dataset correspods to /m teratos, the batch complexty (the umber of equvalet passes over the data of M-Batch s ( (+ O κ(m/ log(/ǫ The above expresso mples that a smaller batch sze m leads to less umber of passes through the data I ths sese, the basc method wth m = s the most effcet oe However, f we prefer the least amout of wall-clock tme, the the best choce s to choose a m-batch sze m that matches the umber of parallel processors avalable 22 Covergece of prmal obectve I the prevous subsecto, we establshed terato complexty of the M-Batch method terms of approxmatg the saddle pot of the mmax problem (3, more specfcally, to meet the requremet ( Next we show that t has the same order of complexty reducg the prmal obectve gap P(x (T P(x But we eed a extra assumpto Assumpto B There exst costats G ad H such that for ay x R d, g(x g(x G x x 2 + H 2 x x 2 2 We ote that Assumpto B s weaker tha ether G-Lpschtz cotuty or H-smoothess It s satsfed by the l orm, the squared l 2 -orm, ad mxed l +l 2 regularzatos Corollary 2 Suppose both Assumptos A ad B hold, ad the parameters τ, σ ad θ are set as ( To guaratee E[P(x (T P(x ] ǫ, t suffces to ru Algorthm 2 for T teratos, wth ( ( C(4G 2 T m +R +H +/γ log mλγ ǫ 2, where C = x ( x ( /(2σ+γ y ( y 2 2 /(2τ+λ m Proof Usg the (/γ-smoothess of P g ad Assumpto B, we have P(x (T P(x (P g (x,x (T x + 2γ x(t x 2 2 +g(x (T g(x ( (P g (x 2 +G x (T x 2 + H +/γ x (T x

8 Sce x mmzes P, we have (P g (x g(x Hece, Assumpto B mples that (P g (x 2 G Substtutg ths relato to the above equalty, ad usg Hölder s equalty, we have E[P(x (T P(x ] 2G ( E[ x (T x 2 2] /2 H +/γ + E[ x (T x 2 2 2] To make E[P(x (T P(x ] ǫ, t suffces to let the rght-had sde of the above equalty bouded by ǫ Sce ǫ, ths s guarateed by E[ x (T x 2 2] ǫ 2 4G 2 +H +/γ (2 By Theorem, we have E[ x (T x 2 2 ] θt C To secure equalty (2, t s suffcet to make θ T ǫ 2, whch s equvalet to C(4G 2 +H+/γ T log(c(4g2 +H +/γ/ǫ 2 log(θ = log(c(4g 2 +H +/γ/ǫ 2 ( ( log (/m+r (/m/(λγ Applyg log( x x to the deomator above completes the proof 3 Extesos of I ths secto, we derve two extesos of the method The frst oe hadles problems for whch Assumpto A does ot hold The secod oe employs a o-uform samplg scheme to mprove the terato complexty whe the feature vectors a are uormalzed 3 No-smooth or o-strogly covex fuctos The complexty bouds establshed Secto 2 requre each φ to be γ-strogly covex, whch correspodstothecodtothatthefrstdervatveofφ s(/γ-lpschtzcotuous Iaddto, thefuctog eedstobeλ-stroglycovex Forgeerallossfuctoswhereetherorbothofthese codtos fal (eg, the hge loss ad l -regularzato, we ca slghtly perturb the saddle-pot fucto f(x,y so that the method ca stll be appled For smplcty, here we cosder the case where ether φ s smooth or g s strogly covex Formally, we assume that each φ ad g are covex ad Lpschtz cotuous, ad f(x,y has a saddle pot (x,y We choose a scalar δ > ad cosder the modfed saddle-pot fucto: f δ (x,y def = = ( ( y a,x φ (y + δy2 +g(x+ δ 2 2 x 2 2 (3 Deote by (x δ,y δ the saddle-pot of f δ We employ the M-Batch method (Algorthm 2 to approxmate (x δ,y δ, treatg φ + δ 2 ( 2 as φ ad g + δ as g, whch ow are all δ-strogly covex We ote that addg strogly covex perturbato o φ s equvalet to smoothg φ, whch becomes (/δ-smooth Lettg γ = λ = δ, the parameters τ, σ ad θ ( become τ = 2R m, σ = 2R m, ad θ = ( m + R δ m 8

9 Although (x δ,y δ s ot exactly the saddle pot of f, the followg corollary shows that applyg the method to the perturbed fucto f δ effectvely mmzes the orgal loss fucto P Corollary 3 Assume that each φ s covex ad G φ -Lpschtz cotuous, ad g s covex ad G g -Lpschtz cotuous Defe two costats: ( /(2σ+δ y C = ( x 2 2 +G 2 φ, C 2 = (G φ R+G g ( x 2 ( x ( y δ 2 δ /(2τ+δ m If we choose δ ǫ/c, ad ru the M-Batch algorthm for T teratos where ( T m + R ( 4C2 log δ m ǫ 2, the E[P(x (T P(x ] ǫ Proof Let ỹ = argmax y f(x δ,y be a shorthad otato We have P(x δ ( = f(x ( δ,ỹ f δ (x δ,ỹ+ δ ỹ (v f(x,yδ + δ x δ ỹ ( f δ (x δ,y δ + δ ỹ (v f(x,y + δ x δ ỹ (v f δ (x,y δ + δ ỹ (v = P(x + δ x δ ỹ Here, equatos ( ad (v use the defto of the fucto f, equaltes ( ad (v use the defto of the fucto f δ, equaltes ( ad (v use the fact that (x δ,y δ s the saddle pot of f δ, ad equalty (v s due to the fact that (x,y s the saddle pot of f Sce φ s G φ -Lpschtz cotuous, the doma of φ s the terval [ G φ,g φ ], whch mples ỹ 2 2 G2 φ (see, eg, [34, Lemma ] Thus, we have P(x δ P(x δ 2 ( x 2 2 +G 2 φ = δ 2 C (4 O the other had, sce P s (G φ R+G g -Lpschtz cotuous, Theorem mples E[P(x (T P(x δ ] (G φr+g g E[ x (T x δ 2] ( C 2 ( m + R T/2 (5 δ m Combg equalty (4 ad equalty (5, to guaratee E[P(x (T P(x ] ǫ, t suffces to have C δ ǫ ad ( ( C2 m + R T/2 ǫ δ m 2 (6 The corollary s establshed by fdg the smallest T that satsfes equalty (6 There are two other cases that ca be cosdered: whe φ s ot smooth but g s strogly covex, ad whe φ s smooth but g s ot strogly covex They ca be hadled wth the same techque descrbed above, ad we omt the detals here (Alteratvely, t s possble to use the techques descrbed [8, Secto 5] to obta accelerated sublear covergece rates wthout usg strogly covex perturbatos I Table, we lst the complextes of the M-Batch method for fdg a ǫ-optmal soluto of problem ( uder varous assumptos Smlar results are also obtaed [34] 9

10 φ g terato complexty Õ( (/γ-smooth λ-strogly covex /m+ (/m/(λγ (/γ-smooth o-strogly covex /m+ (/m/(ǫγ o-smooth λ-strogly covex /m+ (/m/(ǫλ o-smooth o-strogly covex /m+ /m/ǫ Table: Iteratocomplextesofthemethoduderdfferetassumptosothefuctosφ ad g For the last three cases, we solve the perturbed saddle-pot problem wth δ = ǫ/c 32 wth o-uform samplg Oe potetal drawback of the algorthm s that, ts covergece rate depeds o a problemspecfc costat R, whch s the largest l 2 -orm of the feature vectors a As a cosequece, the algorthm may perform badly o uormalzed data, especally f the l 2 -orms of some feature vectors are substatally larger tha others I ths secto, we propose a exteso of the method to mtgate ths problem, whch s gve Algorthm 3 The basc dea s to use o-uform samplg pckg the dual coordate to update at each terato I Algorthm 3, we pck coordate k wth the probablty p k = 2 + a k 2 2 = a, k =,, (7 2 Therefore, staces wth large feature orms are sampled more frequetly Smultaeously, we adopt a adaptve regularzato step (8, mposg stroger regularzato o such staces I addto, we adust the weght of a k (9 for updatg the prmal varable As a cosequece, the covergece rate of Algorthm 3 depeds o the average orm of feature vectors Ths s summarzed by the followg theorem Theorem 2 Suppose Assumpto A holds Let R = = a 2 If the parameters τ,σ,θ Algorthm 3 are chose such that τ = γ 4 R λ, σ = λ 4 R γ, θ = 2+2 R /(λγ, the for each t, we have ( 2τ +λ E [ x (t x 2 ] ( θ t ( ( 2τ +λ x ( x σ + 2γ E [ y (t y 2 ] 2 ( 2σ +2γ y ( y 2 2 Comparg the costat θ Theorem 2 to that of Theorem, we ca fd two dffereces Frst, there s a addtoal factor of 2 multpled to the deomator R /(λγ, makg the value of θ larger Secod, the costat R here s determed by the average orm of features, stead of the largest oe, whch makes the value of θ smaller The secod dfferece makes the algorthm more robust to uormalzed feature vectors For example, f the a s are sampled d

11 Algorthm 3: method wth weghted samplg Iput: parameters τ,σ,θ R +, umber of teratos T, ad tal pots x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Radomly pck k {,2,,}, wth probablty p k = 2 + a k 2 2 = a 2 Execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β p 2σ (β y(t 2 = k, (8 y (t k, { x (t+ = arg m g(x+ u (t + } x R d p k (y(t+ k k a k, x + x x(t 2 2, (9 2τ u (t+ = u (t + (y(t+ k k a k, x (t+ = x (t+ +θ(x (t+ x (t ed Output: x (T ad y (T from a multvarate ormal dstrbuto, the max { a 2 } almost surely goes to fty as, but the average orm = a 2 coverges to E[ a 2 ] For smplcty of presetato, we descrbed Algorthm 3 a weghted samplg method wth sgle dual coordate update, e, the case of m = It s ot hard to see that the ouform samplg scheme ca also be exteded to M-Batch wth m > Moreover, the o-uform samplg scheme ca also be appled to solve problems wth o-smooth φ or ostrogly covex g, leadg to smlar coclusos as Corollary 3 Here, we omt the techcal detals 4 Related Work Chambolle ad Pock [8] cosdered a class of covex optmzato problems wth the followg saddle-pot structure: m max { Kx,y +G(x F (y }, (2 x R d y R where K R m d, G ad F are proper closed covex fuctos, wth F tself beg the cougate of a covex fucto F They developed the followg frst-order prmal-dual algorthm: { y (t+ = argmax Kx (t,y F (y } y R 2σ y y(t 2 2, (2 { x (t+ = arg m K T y (t+,x +G(x+ } x R d 2τ x x(t 2 2, (22 x (t+ = x (t+ +θ(x (t+ x (t (23 Whe both F ad G are strogly covex ad the parameters τ, σ ad θ are chose approprately, ths algorthm obtas accelerated lear covergece rate [8, Theorem 3]

12 algorthm τ σ θ batch complexty ( Chambolle-Pock [8] γ λ A 2 λ A 2 γ + A 2 /(2 + A 2 λγ 2 log(/ǫ λγ ( wth m = γ λ 2R λ 2R γ +R/ + R λγ λγ log(/ǫ ( wth m = γ λ 2R λ 2R γ + R +R /λγ λγ log(/ǫ Table 2: Comparg wth Chambolle ad Pock [8, Algorthm 3, Theorem 3] We ca map the saddle-pot problem (3 to the form of (2 by lettg A = [a,,a ] T ad K = A, G(x = g(x, F (y = φ (y (24 The method developed ths paper ca be vewed as a exteso of the batch method (2-(23, where the dual update step (2 s replaced by a sgle coordate update (4 or a mbatch update (8 However, order to obta accelerated covergece rate, more subtle chages are ecessary the prmal update step More specfcally, we troduced the auxlary varable = y(t a = K T y (t, ad replaced the prmal update step (22 by (5 ad (9 The prmal u (t = extrapolato step (23 stays the same To compare the batch complexty of wth that of (2-(23, we use the followg facts mpled by Assumpto A ad the relatos (24: K 2 = A 2, G(x s λ-strogly covex, ad F (y s (γ/-strogly covex Based o these codtos, we lst Table 2 the equvalet parameters used [8, Algorthm 3] ad the batch complexty obtaed [8, Theorem 3], ad compare them wth The batch complexty of the Chambolle-Pock algorthm s Õ( + A 2/(2 λγ, where the Õ( otato hdes the log(/ǫ factor We ca boud the spectral orm A 2 by the Frobeus orm A F ad obta A 2 A F max{ a 2 } = R (Note that the secod equalty above would be a equalty f the colums of A are ormalzed So the worst case, the batch complexty of the Chambolle-Pock algorthm becomes ( Õ +R/ λγ = Õ( + κ, where κ = R 2 /(λγ, whch matches the worst-case complexty of the methods [24, 26] (see Secto ad also the dscussos [8, Secto 5] Ths s also of the same order as the complexty of wth m = (see Secto 2 Whe the codto umber κ, they ca be worse tha the batch complexty of wth m =, whch s Õ(+ κ/ If ether G(x or F (y (2 s ot strogly covex, Chambolle ad Pock proposed varats of the prmal-dual batch algorthm to acheve accelerated sublear covergece rates [8, Secto 5] It s also possble to exted them to coordate update methods for solvg problem ( whe ether φ or g s ot strogly covex Ther complextes would be smlar to those Table 2 =

13 4 Dual coordate ascet methods We ca also solve the prmal problem ( va ts dual: { maxmze y R D(y def = = φ (y g ( } y a, (25 where g (u = sup x R d{x T u g(x} s the cougate fucto of g Here aga, coordate ascet methods (eg, [29, 9, 5, 36] ca be more effcet tha full gradet methods I the stochastc dual coordate ascet ( method [36], a dual coordate y s pcked at radom durg each terato ad updated to crease the dual obectve value Shalev-Shwartz ad Zhag [36] showed that the terato complexty of s O(( + κ log(/ǫ, whch correspods to the batch complexty Õ(+κ/ Therefore, the method, whch has batch complexty Õ(+ κ/, ca be much better whe κ >, e, for ll-codtoed problems For more geeral covex optmzato problems, there s a vast lterature o coordate descet methods I partcular, Nesterov s work o radomzed coordate descet [25] sparked a lot of recet actvtes o ths topc Rchtárk ad Takáč [3] exteded the algorthm ad aalyss to composte covex optmzato Whe appled to the dual problem (25, t becomes oe varat of studed [36] M-batch ad dstrbuted versos of have bee proposed ad aalyzed [39] ad [45] respectvely No-uform samplg schemes smlar to the oe used Algorthm 3 have bee studed for both stochastc gradet ad methods (eg, [22, 44, 48] Shalev-Shwartz ad Zhag [35] proposed a accelerated m-batch method whch corporates addtoal prmal updates tha, ad bears some smlarty to our M-Batch method They showed that ts complexty terpolates betwee that of ad by varyg the m-batch sze m I partcular, for m =, t matches that of the methods (as does But for m =, the complexty of ther method s the same as, whch s worse tha for ll-codtoed problems I addto, Shalev-Shwartz ad Zhag [34] developed a accelerated proxmal method whch acheves the same batch complexty Õ( + κ/ as Ther method s a er-outer terato procedure, where the outer loop s a full-dmesoal accelerated gradet method the prmal space x R d At each terato of the outer loop, the method [36] s called to solve the dual problem (25 wth customzed regularzato parameter ad precso I cotrast, s a straghtforward sgle-loop coordate optmzato methods More recetly, L et al [8] developed a accelerated proxmal coordate gradet (APCG method for solvg a more geeral class of composte covex optmzato problems Whe appled to the dual problem (25, APCG eoys the same batch complexty Õ( + κ/ as of However, t eeds a extra prmal proxmal-gradet step to have theoretcal guaratees o the covergece of prmal-dual gap [8, Secto 5] The computatoal cost of ths addtoal step s equvalet to oe pass of the dataset, thus t does ot affect the overall complexty 42 Other related work Aother way to approach problem ( s to reformulate t as a costraed optmzato problem mmze φ (z +g(x (26 = subect to a T x = z,, =,,, = 3

14 ad solve t by ADMM type of operator-splttg methods (eg, [9] I fact, as show [8], the batch prmal-dual algorthm (2-(23 s equvalet to a pre-codtoed ADMM (or exact Uzawa method; see, eg, [47] Several authors [42, 28, 37, 49] have cosdered a more geeral formulato tha (26, where each φ s a fucto of the whole vector z R They proposed ole or stochastc versos of ADMM whch operate o oly oe φ each terato, ad obtaed sublear covergece rates However, ther cost per terato s O(d stead of O(d Suzuk[38] cosdered a problem smlar to(, but wth more complex regularzato fucto g, meag that g does ot have a smple proxmal mappg Thus prmal updates such as step (5 or(9adsmlarstepscaotbecomputedeffcetly Heproposedaalgorthm that combes [36] ad ADMM (eg, [7], ad showed that t has lear rate of covergece uder smlar codtos as Assumpto A It would be terestg to see f the method ca be exteded to ther settg to obta accelerated lear covergece rate 5 Effcet Implemetato wth Sparse Data Durg each terato of the methods, the updates of prmal varables (e, computg x (t+ requre full d-dmesoal vector operatos; see the step (5 of Algorthm, the step (9 of Algorthm 2 ad the step (9 of Algorthm 3 So the computatoal cost per terato s O(d, ad ths ca be too expesve f the dmeso d s very hgh I ths secto, we show how to explot problem structure to avod hgh-dmesoal vector operatos whe the feature vectors a are sparse We llustrate the effcet mplemetato for two popular cases: whe g s a squared-l 2 pealty ad whe g s a l +l 2 pealty For both cases, we show that the computato cost per terato oly depeds o the umber of o-zero compoets of the feature vector 5 Squared l 2 -orm pealty Suppose that g(x = λ 2 x 2 2 For ths case, the updates for each coordate of x are depedet of each other More specfcally, x (t+ ca be computed coordate-wse closed form: where u deotes (y (t+ k (y (t+ k x (t+ = +λτ (x(t τu (t τ u, =,,, (27 k a k Algorthm, or m k K (y(t+ k k a k Algorthm 2, or k a k/(p k Algorthm 3, ad u represets the -th coordate of u Although the dmeso d ca be very large, we assume that each feature vector a k s sparse We deote by J (t the set of o-zero coordates at terato t, that s, f for some dex k K pcked at terato t we have a k, the J (t If / J (t, the the algorthm (ad ts varats updates y (t+ wthout usg the value of x (t or x (t Ths ca be see from the updates (4, (8 ad (8, where the value of the er product a k,x (t does ot deped o the value of x (t As a cosequece, we ca delay the updates o x ad x wheever / J (t wthout affectg the updates o y (t, ad process all the mssg updates at the ext tme whe J (t Such a delayed update ca be carred out very effcetly We assume that t s the last tme whe J (t, ad t s the curret terato where we wat to update x ad x Sce / J (t mples u =, we have x t+ = +λτ (x(t τu (t, t = t +,t +2,,t (28 4

15 Notce that u (t s updated oly at teratos where J (t The value of u (t does t chage durg teratos [t +,t ], so we have u (t u (t + for t [t +,t ] Substtutg ths equato to the recursve formula (28, we obta x (t = (+λτ t t ( x (t + + u(t+ λ u(t + (29 λ The update (29 takes O( tme to compute Usg the same formula, we ca compute x (t ad subsequetly compute x (t = x (t +θ(x (t x (t Thus, the computatoal complexty of a sgle terato s proportoal to J (t, depedet of the dmeso d 52 (l +l 2 -orm pealty Suppose that g(x = λ x + λ 2 2 x 2 2 Sce both the l -orm ad the squared l 2 -orm are decomposable, the updates for each coordate of x (t+ are depedet More specfcally, { } x (t+ = argm λ α + λ 2α 2 +(u (t + u α+ (α x(t 2, (3 α R 2 2τ where u follows the defto Secto 5 If / J (t, the u = ad equato (3 ca be smplfed as x (t+ = +λ 2 τ (x(t τu (t τλ f x (t τu (t > τλ, +λ 2 τ (x(t τu (t +τλ f x (t τu (t < τλ, otherwse Smlar to the approach of Secto 5, we delay the update of x utl J (t We assume t to be the last terato whe J (t, ad let t be the curret terato whe we wat to update x Durg teratos [t +,t ], the value of u (t does t chage, so we have u (t u (t + for t [t +,t ] Usg equato (3 ad the varace of u (t for t [t +,t ], we have a O( tme algorthm to calculate x (t, whch we detal Appedx C The vector x (t ca be updated by the same algorthm sce t s a lear combato of x (t ad x (t As a cosequece, the computatoal complexty of each terato s proportoal to J (t, depedet of the dmeso d 6 Expermets I ths secto, we compare the basc method (Algorthm wth several state-of-the-art optmzato algorthms for solvg problem ( They clude two batch-update algorthms: the accelerated full gradet (FAG method [24, Secto 22], ad the lmted-memory quas-newto method L-BFGS [27, Secto 72] For the method, we adopt a adaptve le search scheme (eg, [26] to mprove ts effcecy For the L-BFGS method, we use the memory sze 3 as suggested by [27] We also compare wth two stochastc algorthms: the stochastc average gradet ( method [32, 33], ad the stochastc dual coordate descet ( method [36] We coduct expermets o a sythetc dataset ad three real datasets (3 5

16 (a λ = (b λ = (c λ = (d λ = Fgure : Comparg wth other methods o sythetc data, wth the regularzato coeffcet λ { 3,,, } The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc gap log(p(x (T P(x 6 Rdge regresso wth sythetc data We frst compare wth other algorthms o a smple quadratc problem usg sythetc data We geerate = 5 d trag examples {a,b } = accordg to the model b = a,x +ε, a N(,Σ, ε N(,, where a R d ad d = 5, ad x s the all-oes vector To make the problem ll-codtoed, the covarace matrx Σ s set to be dagoal wth Σ =, for =,,d Gve the set of examples {a,b } =, we the solved a stadard rdge regresso problem { mmze x R d P(x def = = } 2 (at x b 2 + λ 2 x 2 2 I the form of problem (, we have φ (z = z 2 /2 ad g(x = (/2 x 2 2 As a cosequece, the dervatve of φ s -Lpschtz cotuous ad g s λ-strogly covex 6

17 Dataset ame umber of samples umber of features d sparsty Covtype 58, % RCV 2,242 47,236 6% News2 9,996,355,9 4% Table 3: Characterstcs of three real datasets obtaed from LIBSVM data [2] We evaluate the algorthms by the logarthmc optmalty gap log(p(x (t P(x, where x (t s the output of the algorthms after t passes over the etre dataset, ad x s the global mmum Whe the regularzato coeffcet s relatvely large, eg, λ = or, the problem s wellcodtoed ad we observe fast covergece of the stochastc algorthms, ad, whch are substatally faster tha the two batch methods ad L-BFGS Fgure shows the covergece of the fve dfferet algorthms whe we vared λ from 3 to As the plot shows, whe the codto umber s greater tha, the algorthm also coverges substatally faster tha the other two stochastc methods ad It s also otably faster tha L-BFGS These results support our theory that eoys a faster covergece rate o ll-codtoed problems I terms of ther batch complextes, s up to tmes faster tha, ad (λ /2 tmes faster tha ad 62 Bary classfcato wth real data Fally we show the results of solvg the bary classfcato problem o three real datasets The datasets are obtaed from LIBSVM data [2] ad summarzed Table 3 The three datasets are selected to reflect dfferet relatos betwee the sample sze ad the feature dmesoalty d, whch cover d (Covtype, d (RCV ad d (News2 For all tasks, the data pots take the form of (a,b, where a R d s the feature vector, ad b {,} s the bary class label Our goal s to mmze the regularzed emprcal rsk: P(x = φ (a T x+ λ 2 x 2 2 where φ (z = = f b z 2 b z f b z 2 ( b z 2 otherwse Here, φ s the smoothed hge loss (see, eg, [36] It s easy to verfy that the cougate fucto of φ s φ (β = b β + 2 β2 for b β [,] ad otherwse The performace of the fve algorthms are plotted Fgure 2 ad Fgure 3 I Fgure 2, we compare wth the two batch methods: ad L-BFGS The results show that s substatally faster tha ad L-BFGS for relatvely large λ, llustratg the advatage of stochastc methods over batch methods o well-codtoed problems As λ decreases to, the batch methods (especally L-BFGS become comparable to I Fgure 3, we compare wth the two stochastc methods: ad Here, the observatos are ust the opposte to that of Fgure 2 The three stochastc algorthms have comparable performace o relatvely large λ, but becomes substatally faster whe λ gets closer to zero Summarzg Fgure 2 ad Fgure 3, the performace of are always comparable or better tha the other methods comparso 7

18 λ RCV Covtype News Fgure 2: Comparg wth ad L-BFGS o three real datasets wth smoothed hge loss The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc optmalty gap log(p(x (t P(x The algorthm s faster tha the two batch methods whe λ s relatvely large 8

19 λ RCV Covtype News Fgure 3: Comparg wth ad o three real datasets wth smoothed hge loss The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc optmalty gap log(p(x (T P(x The algorthm s faster tha the other two stochastc methods whe λ s small 9

20 A Proof of Theorem We focus o characterzg the values of x ad y after the t-th update Algorthm 2 For ay {,,}, let ỹ be the value of y (t+ f K, e, ỹ = argmax {y a,x (t φ (β (y y(t y R 2σ Sce φ s (/γ-smooth by assumpto, ts cougate φ s γ-strogly covex (eg, [4, Theorem 422] Thus the fucto beg maxmzed above s (/σ + γ-strogly cocave Therefore, y a,x (t +φ (y+ (y y(t 2 2σ 2 } ỹ a,x (t +φ (ỹ + (ỹ 2 ( + σ +γ (ỹ y 2 2 O the other had, sce y mmzes φ k (y y a,x (by property of the saddle-pot, we have φ (ỹ ỹ a,x φ (y y a,x + γ 2 (ỹ y 2 Summg up the above two equaltes, we obta (y (t y 2 2σ ( 2σ +γ (ỹ y 2 + (ỹ 2 2σ 2σ +(ỹ y a,x x (t (32 Accordg to Algorthm 2, the set K of dces to be updated are chose radomly For every specfc dex, the evet K happes wth probablty m/ If K, the y (t+ s updated to the value ỹ, whch satsfes equalty (32 Otherwse, y (t+ s assged by ts old value y (t Let F t be the sgma feld geerated by all radom varables defed before roud t, ad takg expectato codtoed o F t, we have E[(y (t+ y 2 F t ] = m(ỹ y 2 2 F t ] = m(ỹ E[(y (t+ 2 + ( m(y(t E[y (t+ F t ] = mỹ + ( my(t, y 2 As a result, we ca represet (ỹ y 2, (ỹ 2 ad ỹ terms of the codtoal expectatos o (y (t+ y 2, (y (t+ 2 ad y (t+ Pluggg these represetatos to equalty (32, we have ( 2mσ + ( mγ ( (y (t y m 2 2mσ + γ m ( + a,x x (t E[(y (t+ y 2 F t ]+ E[(y(t+ y (t y + m E[y(t+, F t ] 2 F t ] 2mσ (33 2

21 The summg over all dces =,2,, ad dvdg both sdes of the equalty by, we have ( 2mσ + ( mγ ( y (t y 2 2 m 2mσ + γ E[ y (t+ y 2 m 2 F t ]+ E[ y(t+ 2 2 F t] 2mσ [ +E u (t u + m k K (y (t+ k = y(t k a k, x x (t Ft ], where u = = y a s a shorthad otato, ad u (t = a s defed Algorthm 2 We used the fact that = (y(t+ y (t a = k K (y(t+ k y (t k a k, sce oly the coordates K are updated We stll eed a equalty characterzg the relato betwee x (t+ ad x (t Followg the same steps for dervg equalty (32, ad usg the λ-strog covexty of fucto g, t s ot dffcult to show that x (t x 2 2 2τ ( 2τ +λ x (t+ x x(t+ x (t 2 2 2τ + u (t u + (y (t+ m k k K (34 k a k,x (t+ x (35 Takg expectato over both sde of equalty (35, the addg t to equalty (34, we have x (t x 2 ( 2 + 2τ 2σ + ( mγ y (t y 2 ( 2 m 2τ +λ E[ x (t+ x 2 2 F t ] ( E[ y (t+ + 2σ +γ y 2 2 F t] + E[ x(t+ x (t 2 2 F t] + E[ y(t+ 2 2 F t] m 2τ 2mσ ( T +E y (t y + y(t+ A(x (t+ x (t θ(x (t x (t F t (36 m For the last term of equalty (36, we have plugged the deftos of u (t+, u ad x (t, ad used the relato that (y (t+ T A = k K (y(t+ k k at k The matrx A s a -by-d matrx, whose -th row s equal to the vector a T For the rest of the proof, we lower boud the last term o the rght-had-sde of equalty (36 I partcular, we have ( y (t y T + y(t+ A(x (t+ x (t θ(x (t x (t = (y(t+ y T A(x (t+ x (t m θ(y(t y T A(x (t x (t + m m (y(t+ T A(x (t+ x (t θ m (y(t+ T A(x (t x (t (37 2

22 Recall that a k 2 R ad /τ = 4σR 2 accordg to ( We have (y (t+ T A(x (t+ x (t x(t+ x (t 2 2 /m = x(t+ x (t 2 2 /m + (y(t+ T A 2 2 m/τ + ( k K y(t+ k k a k 2 2 4mσR 2 Smlarly, we have m x(t+ x (t y(t+ 2 2, 4σ (y (t+ T A(x (t x (t m x(t x (t 2 2 The above upper bouds o the absolute values mply + y(t σ (y (t+ T A(x (t+ x (t m x(t+ x (t 2 2 (y (t+ T A(x (t x (t m x(t x (t 2 2 y(t+ 2 2, 4σ y(t σ Combg the above two equaltes wth lower bouds (36 ad (37, we obta x (t x 2 2 2τ + ( + ( 2σ +γ 2σ + ( mγ y (t y 2 2 m E[ y (t+ y 2 2 F t] m ( 2τ +λ E[ x (t+ x 2 2 F t ] + E[ x(t+ x (t 2 2 F t] θ x (t x (t E[(y(t+ y T A(x (t+ x (t F t ] θ(y (t y T A(x (t x (t (38 Recall that the parameters τ, σ, ad θ are chose to be τ = mγ 2R λ, σ = λ 2R mγ, ad θ = (/m+r (/m/(λγ Pluggg these assgmets, we fd that /(2τ /(2τ+λ = +/(2τλ θ ad /(2σ+( mγ/ = /(2σ+γ /m+/(2mσγ = θ Therefore, f we defe a sequece (t such that (t = ( ( E[ y 2τ +λ E[ x (t x 2 (t 2]+ 2σ +γ y 2 2 ] m + E[ x(t x (t 2 2 ] + E[(y(t y T A(x (t x (t ], 22

23 the equalty (38 mples the recursve relato (t+ θ (t, whch mples where ( ( E[ y 2τ +λ E[ x (t x 2 (t 2]+ 2σ +γ y 2 2 ] m + E[ x(t x (t 2 2 ] ( = + E[(y(t y T A(x (t x (t ] ( ( y 2τ +λ x ( x 2 ( 2 + 2σ +γ y 2 2 m θ t (, (39 To elmate the last two terms o the left-had sde of equalty (39, we otce that (y (t y T A(x (t x (t x(t x (t A 2 2 y(t y /τ x(t x (t R2 y (t y /τ = x(t x (t 2 2 x(t x (t y(t y 2 2 4σ + y(t y 2 2, 4mσ where the secod equalty we used A 2 2 A 2 F R2, the equalty we used τσ = /(4R 2, ad the last equalty we used m The above upper boud o absolute value mples (y (t y T A(x (t x (t x(t x (t 2 2 y(t y 2 2 4mσ The theorem s establshed by combg the above equalty wth equalty (39 B Proof of Theorem 2 The proof of Theorem 2 mmcs the steps for provg Theorem We start by establshg relato betwee(y (t,y (t+ adbetwee(x (t,x (t+ Supposethatthequattyỹ mmzesthefucto φ (y y a,x (t + p 2σ (y y(t 2 The, followgthesameargumetforestablshgequalty(32, we obta p ( 2σ (y(t y 2 p 2σ +γ (ỹ y 2 + p(ỹ 2 + a,x x (t (ỹ y 2σ (4 Note that = k wth probablty p Therefore, we have (ỹ y 2 = p E[(y (t+ (ỹ 2 = E[(y (t+ 2 F t ], p y 2 F t ] p p (y (t y 2, ỹ = E[y (t+ F t ] p y (t, p p 23

24 where F t represets the sgma feld geerated by all radom varables defed before terato t Substtutg the above equatos to equalty (4, ad averagg over =,2,,, we have = ( 2σ + ( p γ (y (t y p 2 = ( 2σ + γ E[(y (t+ p +E[ (u (t u +(y (t+ k y 2 F t ]+ E[(y(t+ k k 2 F t ] 2σ k a k/(p k,x x (t F t ], where u = = y a ad u (t = = y(t a have the same defto as the proof of Theorem For the relato betwee x (t ad x (t+, we follow the steps the proof of Theorem to obta x (t x 2 ( 2 2τ 2τ +λ x (t+ x x(t+ x (t 2 2 2τ = + (u (t u +(y (t+ k (4 k a k/(p k,x (t+ x (42 Takg expectato over both sdes of equalty (42 ad addg t to equalty (4 yelds x (t x 2 ( 2 + 2τ 2σ + ( p ( γ (y (t y p = 2 2τ +λ E[ x (t+ x 2 2 F t ] ( + 2σ + γ E[(y (t+ y p 2 F t ]+ x(t+ x (t E[(y(t+ k k 2 F t ] 2τ 2σ [( (y (t y T A +E + (y(t+ k k at k p k ((x (t+ x (t θ(x (t x (t F t ], (43 } {{ } v where the matrx A s a -by-d matrx, whose -th row s equal to the vector a T Next, we lower boud the last term o the rght-had sde of equalty (43 Ideed, t ca be expaded as v = (y(t+ y T A(x (t+ x (t θ(y(t y T A(x (t x (t + p k p k (y(t+ k k at k (x(t+ x (t θ p k (y(t+ k k at k (x(t x (t (44 Note that the probablty p k gve (7 satsfes p k a k 2 2 = a = a k R, k =,, Sce the parameters τ ad σ satsfes στ R 2 = /6, we have p 2 k 2 /τ 4σ a k 2 2 ad cosequetly (y (t+ k k at k (x(t+ x (t x(t+ x (t 2 2 p k x(t+ x (t (y(t+ k + (y(t+ k k a k 2 2 p 2 k 2 /τ k 2 4σ 24

25 Smlarly, we have (y (t+ k k at k (x(t x (t x(t x (t 2 2 p k + (y(t+ k k 2 Combg the above two equaltes wth lower bouds (43 ad (44, we obta x (t x τ + = = ( 2σ + γ p ( 2σ + ( p γ p 4σ ( (y (t y 2 2τ +λ E[ x (t+ x 2 2 F t ] E[(y (t+ y 2 F t ]+ E[ x(t+ x (t 2 2 F t] θ x (t x (t E[(y(t+ y T A(x (t+ x (t F t ] θ(y (t y A(x (t x (t (45 Recall that the parameters τ, σ, ad θ are chose to be τ = γ 4 R λ, σ = λ 4 R γ, ad θ = 2+2 R /(λγ Pluggg these assgmets ad usg the fact that p /(2, we fd that /(2τ /(2τ+λ θ ad /(2σ+( p γ/(p θ for =,2,, /(2σ+γ/(p Therefore, f we defe a sequece (t such that (t = ( 2τ +λ E[ x (t x 2 2]+ + E[ x(t x (t 2 2 ] = ( 2σ + γ E[(y (t y p 2 ] + E[(y(t y T A(x (t x (t ], the equalty (45 mples the recursve relato (t+ θ (t, whch mples ( ( 2τ +λ E[ x (t x 2 2]+ 2σ + 2γ E[ y (t y 2 2] where + E[ x(t x (t 2 2 ] ( ( = 2τ +λ x ( x ( 2τ +λ x ( x E[(y(t y T A(x (t x (t ] = ( 2σ + γ p ( 2σ +2γ θ T (, (46 (y ( y 2 y ( y

26 To elmate the last two terms o the left-had sde of equalty (46, we otce that (y (t y T A(x (t x (t x(t x (t 2 2 x(t x (t 2 2 = x(t x (t y(t y 2 2 A /τ + y(t y 2 2 A 2 F 2 /τ + y(t y 2 2 = a 2 2 6σ( = a 2 2 x(t x (t y(t y 2 2, 6σ where the equalty we used 2 /τ = 2 6σ R 2 = 6σ( = a 2 2 Ths mples (y (t y T A(x (t x (t x(t x (t 2 2 Substtutg the above equalty to equalty (46 completes the proof C Effcet update for (l +l 2 -orm pealty y(t y 2 2 6σ From Secto 52, we have the followg recursve formula for t [t +,t ], τu (t+ τλ f x (t τu (t + > τλ, x (t+ = +λ 2 τ (x(t +λ 2 τ (x(t τu (t + +τλ f x (t τu (t + < τλ, otherwse (47 Gve x (t + at terato t, we preset a effcet algorthm for calculatg x (t We beg by examg the sg of x (t + Case I (x (t + = : If u (t + Cosequetly, we have a closed-form formula for x (t : x (t = > λ, the equato (47 mples x (t > for all t > t + ( (+λ 2 τ t t x (t + + u(t+ +λ λ 2 u(t + +λ λ 2 (48 If u (t + < λ, the equato (47 mples x (t < for all t > t + Therefore, we have the closed-form formula: x (t = ( (+λ 2 τ t t x (t + + u(t+ λ λ 2 Fally, f u (t + [ λ,λ ], the equato (47 mples x (t = u(t + λ λ 2 (49 26

27 Case II (x (t + > : If u (t + λ, the t s easy to verfy that x (t s obtaed by equato (48 Otherwse, We use the recursve formula (47 to derve the latest tme t + [t +,t ] such that x t+ > s true Ideed, sce x (t > for all t [t +,t + ], we have a closed-form formula for x t+ : x t+ = ( x (t + (+λ 2 τ t+ t + u(t+ +λ λ 2 u(t + +λ λ 2 (5 We look for the largest t + such that the rght-had sde of equato (5 s postve, whch s equvalet of t + t < log (+ λ 2x (t + u (t /log(+λ 2 τ (5 + +λ Thus, t + s the largest teger [t +,t ] such that equalty (5 holds If t + = t, the x (t s obtaed by (5 Otherwse, we ca calculate x t+ + by formula (47, the resort to Case I or Case III, treatg t + as t Case III (x (t + < : If u (t + λ, the x (t s obtaed by equato (49 Otherwse, we calculate the largest teger t [t +,t ] such that x t < s true Usg the same argumet as for Case II, we have the closed-form expresso x t = ( x (t + (+λ 2 τ t t + u(t+ λ λ 2 u(t + λ λ 2 (52 where t s the largest teger [t +,t ] such that the followg equalty holds: t t < log (+ λ 2x (t + u (t /log(+λ 2 τ (53 + λ If t = t, the x (t s obtaed by (52 Otherwse, we ca calculate x t + by formula (47, the resort to Case I or Case II, treatg t as t Fally, we ote that formula (47 mples the mootocty of x (t (t = t +,t +2, As a cosequece, the procedure of ether Case I, Case II or Case III s executed for at most oce Hece, the algorthm for calculatg x (t has O( tme complexty Refereces [] A Beck ad M Teboulle A fast teratve shrkage-threshold algorthm for lear verse problems SIAM Joural o Imagg Sceces, 2(:83 22, 29 [2] D P Bertsekas Icremetal proxmal methods for large scale covex optmzato Mathematcal Programmg, Ser B, 29:63 95, 2 27

28 [3] D P Bertsekas Icremetal gradet, subgradet, ad proxmal methods for covex optmzato: a survey I S Sra, S Nowoz, ad S J Wrght, edtors, Optmzato for Mache Learg, chapter 4 The MIT Press, 22 [4] D Blatt, A O Hero, ad H Gauchma A coverget cremetal gradet method wth a costat step sze SIAM Joural o Optmzato, 8(:29 5, 27 [5] L Bottou Large-scale mache learg wth stochastc gradet descet I Y Lechevaller ad G Saporta, edtors, Proceedgs of the 9th Iteratoal Coferece o Computatoal Statstcs (COMPSTAT 2, pages 77 87, Pars, Frace, August 2 Sprger [6] O Bousquet ad A Elsseeff Stablty ad geeralzato Joural of Mache Learg Research, 2: , 22 [7] S Boyd, N Parkh, E Chu, B Peleato, ad J Eckste Dstrbuted optmzato ad statstcal learg va the alteratg drecto method of multplers Foudatos ad Treds Mache Learg, 3(: 22, 2 [8] A Chambolle ad T Pock A frst-order prmal-dual algorthm for covex problems wth applcatos to magg Joural of Mathematcal Imagg ad Vso, 4(:2 45, 2 [9] K-W Chag, C-J Hseh, ad C-J L Coordate descet method for large-scale l 2 -loss lear support vector maches Joural of Mache Learg Research, 9: , 28 [] J Dea ad S Ghemawat MapReduce: Smplfed data processg o large clusters Commucatos of the ACM, 5(:7 3, 28 [] J Duch ad Y Sger Effcet ole ad batch learg usg forward backward splttg Joural of Mache Learg Research, : , 29 [2] R-E Fa ad C-J L LIBSVM data: Classfcato, regresso ad mult-label URL: cl/lbsvmtools/datasets, 2 [3] T Haste, R Tbshra, ad J Fredma The Elemets of Statstcal Learg: Data Mg, Iferece, ad Predcto Sprger, New York, 2d edto, 29 [4] J-B Hrart-Urruty ad C Lemaréchal Fudametals of Covex Aalyss Sprger, 2 [5] C-J Hseh, K-W Chag, C-J L, S Keerth, ad S Sudararaa A dual coordate descet method for large-scale lear svm I Proceedgs of the 25th Iteratoal Coferece o Mache Learg (ICML, pages 48 45, 28 [6] R Johso ad T Zhag Acceleratg stochastc gradet descet usg predctve varace reducto I Advaces Neural Iformato Processg Systems 26, pages [7] J Lagford, L L, ad T Zhag Sparse ole learg va trucated gradet Joural of Mache Learg Research, :777 8, 29 [8] Q L, Z Lu, ad L Xao A accelerated proxmal coordate gradet method ad ts applcato to regularzed emprcal rsk mmzato Techcal Report MSR-TR-24-94, Mcrosoft Research, 24 arxv:

arxiv: v1 [cs.lg] 22 Feb 2015

arxiv: v1 [cs.lg] 22 Feb 2015 SDCA wthout Dualty Sha Shalev-Shwartz arxv:50.0677v cs.lg Feb 05 Abstract Stochastc Dual Coordate Ascet s a popular method for solvg regularzed loss mmzato for the case of covex losses. I ths paper we

More information

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture) CSE 546: Mache Learg Lecture 6 Feature Selecto: Part 2 Istructor: Sham Kakade Greedy Algorthms (cotued from the last lecture) There are varety of greedy algorthms ad umerous amg covetos for these algorthms.

More information

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971)) art 4b Asymptotc Results for MRR usg RESS Recall that the RESS statstc s a specal type of cross valdato procedure (see Alle (97)) partcular to the regresso problem ad volves fdg Y $,, the estmate at the

More information

Dimensionality Reduction and Learning

Dimensionality Reduction and Learning CMSC 35900 (Sprg 009) Large Scale Learg Lecture: 3 Dmesoalty Reducto ad Learg Istructors: Sham Kakade ad Greg Shakharovch L Supervsed Methods ad Dmesoalty Reducto The theme of these two lectures s that

More information

Rademacher Complexity. Examples

Rademacher Complexity. Examples Algorthmc Foudatos of Learg Lecture 3 Rademacher Complexty. Examples Lecturer: Patrck Rebesch Verso: October 16th 018 3.1 Itroducto I the last lecture we troduced the oto of Rademacher complexty ad showed

More information

An Accelerated Proximal Coordinate Gradient Method

An Accelerated Proximal Coordinate Gradient Method A Accelerated Proxmal Coordate Gradet Method Qhag L Uversty of Iowa Iowa Cty IA USA qhag-l@uowaedu Zhaosog Lu Smo Fraser Uversty Buraby BC Caada zhaosog@sfuca L Xao Mcrosoft Research Redmod WA USA lxao@mcrosoftcom

More information

Bayes (Naïve or not) Classifiers: Generative Approach

Bayes (Naïve or not) Classifiers: Generative Approach Logstc regresso Bayes (Naïve or ot) Classfers: Geeratve Approach What do we mea by Geeratve approach: Lear p(y), p(x y) ad the apply bayes rule to compute p(y x) for makg predctos Ths s essetally makg

More information

Chapter 5 Properties of a Random Sample

Chapter 5 Properties of a Random Sample Lecture 6 o BST 63: Statstcal Theory I Ku Zhag, /0/008 Revew for the prevous lecture Cocepts: t-dstrbuto, F-dstrbuto Theorems: Dstrbutos of sample mea ad sample varace, relatoshp betwee sample mea ad sample

More information

CHAPTER 4 RADICAL EXPRESSIONS

CHAPTER 4 RADICAL EXPRESSIONS 6 CHAPTER RADICAL EXPRESSIONS. The th Root of a Real Number A real umber a s called the th root of a real umber b f Thus, for example: s a square root of sce. s also a square root of sce ( ). s a cube

More information

Lecture 9: Tolerant Testing

Lecture 9: Tolerant Testing Lecture 9: Tolerat Testg Dael Kae Scrbe: Sakeerth Rao Aprl 4, 07 Abstract I ths lecture we prove a quas lear lower boud o the umber of samples eeded to do tolerat testg for L dstace. Tolerat Testg We have

More information

TESTS BASED ON MAXIMUM LIKELIHOOD

TESTS BASED ON MAXIMUM LIKELIHOOD ESE 5 Toy E. Smth. The Basc Example. TESTS BASED ON MAXIMUM LIKELIHOOD To llustrate the propertes of maxmum lkelhood estmates ad tests, we cosder the smplest possble case of estmatg the mea of the ormal

More information

Econometric Methods. Review of Estimation

Econometric Methods. Review of Estimation Ecoometrc Methods Revew of Estmato Estmatg the populato mea Radom samplg Pot ad terval estmators Lear estmators Ubased estmators Lear Ubased Estmators (LUEs) Effcecy (mmum varace) ad Best Lear Ubased Estmators

More information

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines It J Cotemp Math Sceces, Vol 5, 2010, o 19, 921-929 Solvg Costraed Flow-Shop Schedulg Problems wth Three Maches P Pada ad P Rajedra Departmet of Mathematcs, School of Advaced Sceces, VIT Uversty, Vellore-632

More information

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions. Ordary Least Squares egresso. Smple egresso. Algebra ad Assumptos. I ths part of the course we are gog to study a techque for aalysg the lear relatoshp betwee two varables Y ad X. We have pars of observatos

More information

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Multivariate Transformation of Variables and Maximum Likelihood Estimation Marquette Uversty Multvarate Trasformato of Varables ad Maxmum Lkelhood Estmato Dael B. Rowe, Ph.D. Assocate Professor Departmet of Mathematcs, Statstcs, ad Computer Scece Copyrght 03 by Marquette Uversty

More information

Lecture 3 Probability review (cont d)

Lecture 3 Probability review (cont d) STATS 00: Itroducto to Statstcal Iferece Autum 06 Lecture 3 Probablty revew (cot d) 3. Jot dstrbutos If radom varables X,..., X k are depedet, the ther dstrbuto may be specfed by specfyg the dvdual dstrbuto

More information

Simple Linear Regression

Simple Linear Regression Statstcal Methods I (EST 75) Page 139 Smple Lear Regresso Smple regresso applcatos are used to ft a model descrbg a lear relatoshp betwee two varables. The aspects of least squares regresso ad correlato

More information

Functions of Random Variables

Functions of Random Variables Fuctos of Radom Varables Chapter Fve Fuctos of Radom Varables 5. Itroducto A geeral egeerg aalyss model s show Fg. 5.. The model output (respose) cotas the performaces of a system or product, such as weght,

More information

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then Secto 5 Vectors of Radom Varables Whe workg wth several radom varables,,..., to arrage them vector form x, t s ofte coveet We ca the make use of matrx algebra to help us orgaze ad mapulate large umbers

More information

Summary of the lecture in Biostatistics

Summary of the lecture in Biostatistics Summary of the lecture Bostatstcs Probablty Desty Fucto For a cotuos radom varable, a probablty desty fucto s a fucto such that: 0 dx a b) b a dx A probablty desty fucto provdes a smple descrpto of the

More information

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x CS 75 Mache Learg Lecture 8 Lear regresso Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 75 Mache Learg Lear regresso Fucto f : X Y s a lear combato of put compoets f + + + K d d K k - parameters

More information

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution: Chapter 4 Exercses Samplg Theory Exercse (Smple radom samplg: Let there be two correlated radom varables X ad A sample of sze s draw from a populato by smple radom samplg wthout replacemet The observed

More information

Analysis of Lagrange Interpolation Formula

Analysis of Lagrange Interpolation Formula P IJISET - Iteratoal Joural of Iovatve Scece, Egeerg & Techology, Vol. Issue, December 4. www.jset.com ISS 348 7968 Aalyss of Lagrage Iterpolato Formula Vjay Dahya PDepartmet of MathematcsMaharaja Surajmal

More information

Point Estimation: definition of estimators

Point Estimation: definition of estimators Pot Estmato: defto of estmators Pot estmator: ay fucto W (X,..., X ) of a data sample. The exercse of pot estmato s to use partcular fuctos of the data order to estmate certa ukow populato parameters.

More information

ENGI 3423 Simple Linear Regression Page 12-01

ENGI 3423 Simple Linear Regression Page 12-01 ENGI 343 mple Lear Regresso Page - mple Lear Regresso ometmes a expermet s set up where the expermeter has cotrol over the values of oe or more varables X ad measures the resultg values of aother varable

More information

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS Postpoed exam: ECON430 Statstcs Date of exam: Jauary 0, 0 Tme for exam: 09:00 a.m. :00 oo The problem set covers 5 pages Resources allowed: All wrtte ad prted

More information

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek Partally Codtoal Radom Permutato Model 7- vestgato of Partally Codtoal RP Model wth Respose Error TRODUCTO Ed Staek We explore the predctor that wll result a smple radom sample wth respose error whe a

More information

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity BULLETIN of the MALAYSIAN MATHEMATICAL SCIENCES SOCIETY Bull. Malays. Math. Sc. Soc. () 7 (004), 5 35 Strog Covergece of Weghted Averaged Appromats of Asymptotcally Noepasve Mappgs Baach Spaces wthout

More information

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements Aoucemets No-Parametrc Desty Estmato Techques HW assged Most of ths lecture was o the blacboard. These sldes cover the same materal as preseted DHS Bometrcs CSE 90-a Lecture 7 CSE90a Fall 06 CSE90a Fall

More information

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems Commucato-Effcet Dstrbuted Prmal-Dual Algorthm for Saddle Pot Problems Yaodog Yu Nayag Techologcal Uversty ydyu@tu.edu.sg Sul Lu Nayag Techologcal Uversty lusl@tu.edu.sg So Jal Pa Nayag Techologcal Uversty

More information

X ε ) = 0, or equivalently, lim

X ε ) = 0, or equivalently, lim Revew for the prevous lecture Cocepts: order statstcs Theorems: Dstrbutos of order statstcs Examples: How to get the dstrbuto of order statstcs Chapter 5 Propertes of a Radom Sample Secto 55 Covergece

More information

Supervised learning: Linear regression Logistic regression

Supervised learning: Linear regression Logistic regression CS 57 Itroducto to AI Lecture 4 Supervsed learg: Lear regresso Logstc regresso Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 57 Itro to AI Data: D { D D.. D D Supervsed learg d a set of eamples s

More information

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America SOLUTION OF SYSTEMS OF SIMULTANEOUS LINEAR EQUATIONS Gauss-Sedel Method 006 Jame Traha, Autar Kaw, Kev Mart Uversty of South Florda Uted States of Amerca kaw@eg.usf.edu Itroducto Ths worksheet demostrates

More information

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model Lecture 7. Cofdece Itervals ad Hypothess Tests the Smple CLR Model I lecture 6 we troduced the Classcal Lear Regresso (CLR) model that s the radom expermet of whch the data Y,,, K, are the outcomes. The

More information

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions CO-511: Learg Theory prg 2017 Lecturer: Ro Lv Lecture 16: Bacpropogato Algorthm Dsclamer: These otes have ot bee subected to the usual scruty reserved for formal publcatos. They may be dstrbuted outsde

More information

Objectives of Multiple Regression

Objectives of Multiple Regression Obectves of Multple Regresso Establsh the lear equato that best predcts values of a depedet varable Y usg more tha oe eplaator varable from a large set of potetal predctors {,,... k }. Fd that subset of

More information

Class 13,14 June 17, 19, 2015

Class 13,14 June 17, 19, 2015 Class 3,4 Jue 7, 9, 05 Pla for Class3,4:. Samplg dstrbuto of sample mea. The Cetral Lmt Theorem (CLT). Cofdece terval for ukow mea.. Samplg Dstrbuto for Sample mea. Methods used are based o CLT ( Cetral

More information

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ  1 STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS Recall Assumpto E(Y x) η 0 + η x (lear codtoal mea fucto) Data (x, y ), (x 2, y 2 ),, (x, y ) Least squares estmator ˆ E (Y x) ˆ " 0 + ˆ " x, where ˆ

More information

The Mathematical Appendix

The Mathematical Appendix The Mathematcal Appedx Defto A: If ( Λ, Ω, where ( λ λ λ whch the probablty dstrbutos,,..., Defto A. uppose that ( Λ,,..., s a expermet type, the σ-algebra o λ λ λ are defed s deoted by ( (,,...,, σ Ω.

More information

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings Hdaw Publshg Corporato Iteratoal Joural of Mathematcs ad Mathematcal Sceces Volume 009, Artcle ID 391839, 9 pages do:10.1155/009/391839 Research Artcle A New Iteratve Method for Commo Fxed Pots of a Fte

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Marquette Uverst Maxmum Lkelhood Estmato Dael B. Rowe, Ph.D. Professor Departmet of Mathematcs, Statstcs, ad Computer Scece Coprght 08 b Marquette Uverst Maxmum Lkelhood Estmato We have bee sag that ~

More information

ρ < 1 be five real numbers. The

ρ < 1 be five real numbers. The Lecture o BST 63: Statstcal Theory I Ku Zhag, /0/006 Revew for the prevous lecture Deftos: covarace, correlato Examples: How to calculate covarace ad correlato Theorems: propertes of correlato ad covarace

More information

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights CIS 800/002 The Algorthmc Foudatos of Data Prvacy October 13, 2011 Lecturer: Aaro Roth Lecture 9 Scrbe: Aaro Roth Database Update Algorthms: Multplcatve Weghts We ll recall aga) some deftos from last tme:

More information

Support vector machines

Support vector machines CS 75 Mache Learg Lecture Support vector maches Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 75 Mache Learg Outle Outle: Algorthms for lear decso boudary Support vector maches Mamum marg hyperplae.

More information

Distributed Accelerated Proximal Coordinate Gradient Methods

Distributed Accelerated Proximal Coordinate Gradient Methods Dstrbuted Accelerated Proxmal Coordate Gradet Methods Yog Re, Ju Zhu Ceter for Bo-Ispred Computg Research State Key Lab for Itell. Tech. & Systems Dept. of Comp. Sc. & Tech., TNLst Lab, Tsghua Uversty

More information

Introduction to local (nonparametric) density estimation. methods

Introduction to local (nonparametric) density estimation. methods Itroducto to local (oparametrc) desty estmato methods A slecture by Yu Lu for ECE 66 Sprg 014 1. Itroducto Ths slecture troduces two local desty estmato methods whch are Parze desty estmato ad k-earest

More information

Lecture 3. Sampling, sampling distributions, and parameter estimation

Lecture 3. Sampling, sampling distributions, and parameter estimation Lecture 3 Samplg, samplg dstrbutos, ad parameter estmato Samplg Defto Populato s defed as the collecto of all the possble observatos of terest. The collecto of observatos we take from the populato s called

More information

ESS Line Fitting

ESS Line Fitting ESS 5 014 17. Le Fttg A very commo problem data aalyss s lookg for relatoshpetwee dfferet parameters ad fttg les or surfaces to data. The smplest example s fttg a straght le ad we wll dscuss that here

More information

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections ENGI 441 Jot Probablty Dstrbutos Page 7-01 Jot Probablty Dstrbutos [Navd sectos.5 ad.6; Devore sectos 5.1-5.] The jot probablty mass fucto of two dscrete radom quattes, s, P ad p x y x y The margal probablty

More information

Median as a Weighted Arithmetic Mean of All Sample Observations

Median as a Weighted Arithmetic Mean of All Sample Observations Meda as a Weghted Arthmetc Mea of All Sample Observatos SK Mshra Dept. of Ecoomcs NEHU, Shllog (Ida). Itroducto: Iumerably may textbooks Statstcs explctly meto that oe of the weakesses (or propertes) of

More information

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set. Addtoal Decrease ad Coquer Algorthms For combatoral problems we mght eed to geerate all permutatos, combatos, or subsets of a set. Geeratg Permutatos If we have a set f elemets: { a 1, a 2, a 3, a } the

More information

Simulation Output Analysis

Simulation Output Analysis Smulato Output Aalyss Summary Examples Parameter Estmato Sample Mea ad Varace Pot ad Iterval Estmato ermatg ad o-ermatg Smulato Mea Square Errors Example: Sgle Server Queueg System x(t) S 4 S 4 S 3 S 5

More information

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS Numercal Computg -I UNIT SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS Structure Page Nos..0 Itroducto 6. Objectves 7. Ital Approxmato to a Root 7. Bsecto Method 8.. Error Aalyss 9.4 Regula Fals Method

More information

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods Malaysa Umodalty Joural Tests of Mathematcal for Global Optmzato Sceces (): of 05 Sgle - 5 Varable (007) Fuctos Usg Statstcal Methods Umodalty Tests for Global Optmzato of Sgle Varable Fuctos Usg Statstcal

More information

Non-uniform Turán-type problems

Non-uniform Turán-type problems Joural of Combatoral Theory, Seres A 111 2005 106 110 wwwelsevercomlocatecta No-uform Turá-type problems DhruvMubay 1, Y Zhao 2 Departmet of Mathematcs, Statstcs, ad Computer Scece, Uversty of Illos at

More information

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best Error Aalyss Preamble Wheever a measuremet s made, the result followg from that measuremet s always subject to ucertaty The ucertaty ca be reduced by makg several measuremets of the same quatty or by mprovg

More information

Unsupervised Learning and Other Neural Networks

Unsupervised Learning and Other Neural Networks CSE 53 Soft Computg NOT PART OF THE FINAL Usupervsed Learg ad Other Neural Networs Itroducto Mture Destes ad Idetfablty ML Estmates Applcato to Normal Mtures Other Neural Networs Itroducto Prevously, all

More information

Third handout: On the Gini Index

Third handout: On the Gini Index Thrd hadout: O the dex Corrado, a tala statstca, proposed (, 9, 96) to measure absolute equalt va the mea dfferece whch s defed as ( / ) where refers to the total umber of dvduals socet. Assume that. The

More information

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model Chapter 3 Asmptotc Theor ad Stochastc Regressors The ature of eplaator varable s assumed to be o-stochastc or fed repeated samples a regresso aalss Such a assumpto s approprate for those epermets whch

More information

L5 Polynomial / Spline Curves

L5 Polynomial / Spline Curves L5 Polyomal / Sple Curves Cotets Coc sectos Polyomal Curves Hermte Curves Bezer Curves B-Sples No-Uform Ratoal B-Sples (NURBS) Mapulato ad Represetato of Curves Types of Curve Equatos Implct: Descrbe a

More information

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test Fal verso The teral structure of atural umbers oe method for the defto of large prme umbers ad a factorzato test Emmaul Maousos APM Isttute for the Advacemet of Physcs ad Mathematcs 3 Poulou str. 53 Athes

More information

CHAPTER VI Statistical Analysis of Experimental Data

CHAPTER VI Statistical Analysis of Experimental Data Chapter VI Statstcal Aalyss of Expermetal Data CHAPTER VI Statstcal Aalyss of Expermetal Data Measuremets do ot lead to a uque value. Ths s a result of the multtude of errors (maly radom errors) that ca

More information

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem CS86. Lecture 4: Dur s Proof of the PCP Theorem Scrbe: Thom Bohdaowcz Prevously, we have prove a weak verso of the PCP theorem: NP PCP 1,1/ (r = poly, q = O(1)). Wth ths result we have the desred costat

More information

8.1 Hashing Algorithms

8.1 Hashing Algorithms CS787: Advaced Algorthms Scrbe: Mayak Maheshwar, Chrs Hrchs Lecturer: Shuch Chawla Topc: Hashg ad NP-Completeess Date: September 21 2007 Prevously we looked at applcatos of radomzed algorthms, ad bega

More information

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b CS 70 Dscrete Mathematcs ad Probablty Theory Fall 206 Sesha ad Walrad DIS 0b. Wll I Get My Package? Seaky delvery guy of some compay s out delverg packages to customers. Not oly does he had a radom package

More information

Kernel-based Methods and Support Vector Machines

Kernel-based Methods and Support Vector Machines Kerel-based Methods ad Support Vector Maches Larr Holder CptS 570 Mache Learg School of Electrcal Egeerg ad Computer Scece Washgto State Uverst Refereces Muller et al. A Itroducto to Kerel-Based Learg

More information

Some Notes on the Probability Space of Statistical Surveys

Some Notes on the Probability Space of Statistical Surveys Metodološk zvezk, Vol. 7, No., 200, 7-2 ome Notes o the Probablty pace of tatstcal urveys George Petrakos Abstract Ths paper troduces a formal presetato of samplg process usg prcples ad cocepts from Probablty

More information

MATH 247/Winter Notes on the adjoint and on normal operators.

MATH 247/Winter Notes on the adjoint and on normal operators. MATH 47/Wter 00 Notes o the adjot ad o ormal operators I these otes, V s a fte dmesoal er product space over, wth gve er * product uv, T, S, T, are lear operators o V U, W are subspaces of V Whe we say

More information

Mu Sequences/Series Solutions National Convention 2014

Mu Sequences/Series Solutions National Convention 2014 Mu Sequeces/Seres Solutos Natoal Coveto 04 C 6 E A 6C A 6 B B 7 A D 7 D C 7 A B 8 A B 8 A C 8 E 4 B 9 B 4 E 9 B 4 C 9 E C 0 A A 0 D B 0 C C Usg basc propertes of arthmetc sequeces, we fd a ad bm m We eed

More information

Naïve Bayes MIT Course Notes Cynthia Rudin

Naïve Bayes MIT Course Notes Cynthia Rudin Thaks to Şeyda Ertek Credt: Ng, Mtchell Naïve Bayes MIT 5.097 Course Notes Cytha Rud The Naïve Bayes algorthm comes from a geeratve model. There s a mportat dstcto betwee geeratve ad dscrmatve models.

More information

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy Bouds o the expected etropy ad KL-dvergece of sampled multomal dstrbutos Brado C. Roy bcroy@meda.mt.edu Orgal: May 18, 2011 Revsed: Jue 6, 2011 Abstract Iformato theoretc quattes calculated from a sampled

More information

1 Onto functions and bijections Applications to Counting

1 Onto functions and bijections Applications to Counting 1 Oto fuctos ad bectos Applcatos to Coutg Now we move o to a ew topc. Defto 1.1 (Surecto. A fucto f : A B s sad to be surectve or oto f for each b B there s some a A so that f(a B. What are examples of

More information

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory ROAD MAP... AE301 Aerodyamcs I UNIT C: 2-D Arfols C-1: Aerodyamcs of Arfols 1 C-2: Aerodyamcs of Arfols 2 C-3: Pael Methods C-4: Th Arfol Theory AE301 Aerodyamcs I Ut C-3: Lst of Subects Problem Solutos?

More information

Chapter 9 Jordan Block Matrices

Chapter 9 Jordan Block Matrices Chapter 9 Jorda Block atrces I ths chapter we wll solve the followg problem. Gve a lear operator T fd a bass R of F such that the matrx R (T) s as smple as possble. f course smple s a matter of taste.

More information

Lecture Note to Rice Chapter 8

Lecture Note to Rice Chapter 8 ECON 430 HG revsed Nov 06 Lecture Note to Rce Chapter 8 Radom matrces Let Y, =,,, m, =,,, be radom varables (r.v. s). The matrx Y Y Y Y Y Y Y Y Y Y = m m m s called a radom matrx ( wth a ot m-dmesoal dstrbuto,

More information

Chapter 4 Multiple Random Variables

Chapter 4 Multiple Random Variables Revew for the prevous lecture: Theorems ad Examples: How to obta the pmf (pdf) of U = g (, Y) ad V = g (, Y) Chapter 4 Multple Radom Varables Chapter 44 Herarchcal Models ad Mxture Dstrbutos Examples:

More information

Chapter 8. Inferences about More Than Two Population Central Values

Chapter 8. Inferences about More Than Two Population Central Values Chapter 8. Ifereces about More Tha Two Populato Cetral Values Case tudy: Effect of Tmg of the Treatmet of Port-We tas wth Lasers ) To vestgate whether treatmet at a youg age would yeld better results tha

More information

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem Joural of Amerca Scece ;6( Cubc Nopolyomal Sple Approach to the Soluto of a Secod Order Two-Pot Boudary Value Problem W.K. Zahra, F.A. Abd El-Salam, A.A. El-Sabbagh ad Z.A. ZAk * Departmet of Egeerg athematcs

More information

A Remark on the Uniform Convergence of Some Sequences of Functions

A Remark on the Uniform Convergence of Some Sequences of Functions Advaces Pure Mathematcs 05 5 57-533 Publshed Ole July 05 ScRes. http://www.scrp.org/joural/apm http://dx.do.org/0.436/apm.05.59048 A Remark o the Uform Covergece of Some Sequeces of Fuctos Guy Degla Isttut

More information

An Introduction to. Support Vector Machine

An Introduction to. Support Vector Machine A Itroducto to Support Vector Mache Support Vector Mache (SVM) A classfer derved from statstcal learg theory by Vapk, et al. 99 SVM became famous whe, usg mages as put, t gave accuracy comparable to eural-etwork

More information

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis) We have covered: Selecto, Iserto, Mergesort, Bubblesort, Heapsort Next: Selecto the Qucksort The Selecto Problem - Varable Sze Decrease/Coquer (Practce wth algorthm aalyss) Cosder the problem of fdg the

More information

Chapter 14 Logistic Regression Models

Chapter 14 Logistic Regression Models Chapter 4 Logstc Regresso Models I the lear regresso model X β + ε, there are two types of varables explaatory varables X, X,, X k ad study varable y These varables ca be measured o a cotuous scale as

More information

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions Iteratoal Joural of Computatoal Egeerg Research Vol, 0 Issue, Estmato of Stress- Stregth Relablty model usg fte mxture of expoetal dstrbutos K.Sadhya, T.S.Umamaheswar Departmet of Mathematcs, Lal Bhadur

More information

LECTURE 24 LECTURE OUTLINE

LECTURE 24 LECTURE OUTLINE LECTURE 24 LECTURE OUTLINE Gradet proxmal mmzato method Noquadratc proxmal algorthms Etropy mmzato algorthm Expoetal augmeted Lagraga mehod Etropc descet algorthm **************************************

More information

A New Family of Transformations for Lifetime Data

A New Family of Transformations for Lifetime Data Proceedgs of the World Cogress o Egeerg 4 Vol I, WCE 4, July - 4, 4, Lodo, U.K. A New Famly of Trasformatos for Lfetme Data Lakhaa Watthaacheewakul Abstract A famly of trasformatos s the oe of several

More information

Generative classification models

Generative classification models CS 75 Mache Learg Lecture Geeratve classfcato models Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square Data: D { d, d,.., d} d, Classfcato represets a dscrete class value Goal: lear f : X Y Bar classfcato

More information

Extreme Value Theory: An Introduction

Extreme Value Theory: An Introduction (correcto d Extreme Value Theory: A Itroducto by Laures de Haa ad Aa Ferrera Wth ths webpage the authors ted to form the readers of errors or mstakes foud the book after publcato. We also gve extesos for

More information

Investigating Cellular Automata

Investigating Cellular Automata Researcher: Taylor Dupuy Advsor: Aaro Wootto Semester: Fall 4 Ivestgatg Cellular Automata A Overvew of Cellular Automata: Cellular Automata are smple computer programs that geerate rows of black ad whte

More information

PTAS for Bin-Packing

PTAS for Bin-Packing CS 663: Patter Matchg Algorthms Scrbe: Che Jag /9/00. Itroducto PTAS for B-Packg The B-Packg problem s NP-hard. If we use approxmato algorthms, the B-Packg problem could be solved polyomal tme. For example,

More information

Taylor s Series and Interpolation. Interpolation & Curve-fitting. CIS Interpolation. Basic Scenario. Taylor Series interpolates at a specific

Taylor s Series and Interpolation. Interpolation & Curve-fitting. CIS Interpolation. Basic Scenario. Taylor Series interpolates at a specific CIS 54 - Iterpolato Roger Crawfs Basc Scearo We are able to prod some fucto, but do ot kow what t really s. Ths gves us a lst of data pots: [x,f ] f(x) f f + x x + August 2, 25 OSU/CIS 54 3 Taylor s Seres

More information

Overcoming Limitations of Sampling for Aggregation Queries

Overcoming Limitations of Sampling for Aggregation Queries CIS 6930 Approxmate Quer Processg Paper Presetato Sprg 2004 - Istructor: Dr Al Dobra Overcomg Lmtatos of Samplg for Aggregato Queres Authors: Surajt Chaudhur, Gautam Das, Maur Datar, Rajeev Motwa, ad Vvek

More information

Beam Warming Second-Order Upwind Method

Beam Warming Second-Order Upwind Method Beam Warmg Secod-Order Upwd Method Petr Valeta Jauary 6, 015 Ths documet s a part of the assessmet work for the subject 1DRP Dfferetal Equatos o Computer lectured o FNSPE CTU Prague. Abstract Ths documet

More information

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames KLT Tracker Tracker. Detect Harrs corers the frst frame 2. For each Harrs corer compute moto betwee cosecutve frames (Algmet). 3. Lk moto vectors successve frames to get a track 4. Itroduce ew Harrs pots

More information

AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET

AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET Abstract. The Permaet versus Determat problem s the followg: Gve a matrx X of determates over a feld of characterstc dfferet from

More information

A tighter lower bound on the circuit size of the hardest Boolean functions

A tighter lower bound on the circuit size of the hardest Boolean functions Electroc Colloquum o Computatoal Complexty, Report No. 86 2011) A tghter lower boud o the crcut sze of the hardest Boolea fuctos Masak Yamamoto Abstract I [IPL2005], Fradse ad Mlterse mproved bouds o the

More information

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines CS 675 Itroducto to Mache Learg Lecture Support vector maches Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square Mdterm eam October 9, 7 I-class eam Closed book Stud materal: Lecture otes Correspodg chapters

More information

18.413: Error Correcting Codes Lab March 2, Lecture 8

18.413: Error Correcting Codes Lab March 2, Lecture 8 18.413: Error Correctg Codes Lab March 2, 2004 Lecturer: Dael A. Spelma Lecture 8 8.1 Vector Spaces A set C {0, 1} s a vector space f for x all C ad y C, x + y C, where we take addto to be compoet wse

More information

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression Overvew Basc cocepts of Bayesa learg Most probable model gve data Co tosses Lear regresso Logstc regresso Bayesa predctos Co tosses Lear regresso 30 Recap: regresso problems Iput to learg problem: trag

More information

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab Lear Regresso Lear Regresso th Shrkage Some sldes are due to Tomm Jaakkola, MIT AI Lab Itroducto The goal of regresso s to make quattatve real valued predctos o the bass of a vector of features or attrbutes.

More information

Lecture Notes Types of economic variables

Lecture Notes Types of economic variables Lecture Notes 3 1. Types of ecoomc varables () Cotuous varable takes o a cotuum the sample space, such as all pots o a le or all real umbers Example: GDP, Polluto cocetrato, etc. () Dscrete varables fte

More information