Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization
|
|
- Noel Hensley
- 5 years ago
- Views:
Transcription
1 Stochastc Prmal-Dual Coordate Method for Regularzed Emprcal Rsk Mmzato Yuche Zhag L Xao September 24 Abstract We cosder a geerc covex optmzato problem assocated wth regularzed emprcal rsk mmzato of lear predctors The problem structure allows us to reformulate t as a covexcocave saddle pot problem We propose a stochastc prmal-dual coordate ( method, whch alterates betwee maxmzg over a radomly chose dual varable ad mmzg over the prmal varable A extrapolato step o the prmal varable s performed to obta accelerated covergece rate We also develop a m-batch verso of the method whch facltates parallel computg, ad a exteso wth weghted samplg probabltes o the dual varables, whch has a better complexty tha uform samplg o uormalzed data Both theoretcally ad emprcally, we show that the method has comparable or better performace tha several state-of-the-art optmzato methods Itroducto We cosder a geerc covex optmzato problem that arses ofte mache learg: regularzed emprcal rsk mmzato (ERM of lear predctors More specfcally, let a,,a R d be the feature vectors of data samples, φ : R R be a covex loss fucto assocated wth the lear predcto a T x, for =,,, ad g : Rd R be a covex regularzato fucto for the predctor x R d Our goal s to solve the followg optmzato problem: { } mmze x R d P(x def = φ (a T x+g(x = ( Examples of the above formulato clude may well-kow classfcato ad regresso problems For bary classfcato, each feature vector a s assocated wth a label b {±} We obta the lear SVM (support vector mache by settg φ (z = max{, b z} (the hge loss ad g(x = (λ/2 x 2 2, where λ > s a regularzato parameter Regularzed logstc regresso s obtaed by settg φ (z = log( + exp( b z For lear regresso problems, each feature vector a s assocated wth a depedet varable b R, ad φ (z = (/2(z b 2 The we get rdge regresso wth g(x = (λ/2 x 2 2, ad the Lasso wth g(x = λ x Further backgrouds o regularzed ERM mache learg ad statstcs ca be foud, eg, the book [3] Departmet of Electrcal Egeerg ad Computer Scece, Uversty of Calfora, Berkekey, CA 9472, USA Emal: yuczhag@eecsberkeleyedu (Ths work was perfomed durg a tershp at Mcrosoft Research Mache Learg Groups, Mcrosoft Research, Redmod, WA 9853, USA Emal: lxao@mcrosoftcom
2 We are especally terested developg effcet algorthms for solvg problem ( whe the umber of samples s very large I ths case, evaluatg the full gradet or subgradet of the fucto P(x s very expesve, thus cremetal methods that operate o a sgle compoet fucto φ at each terato ca be very attractve There have bee extesve research o cremetal (subgradet methods (eg [4, 4, 2, 2, 3] as well as varats of the stochastc gradet method (eg, [46, 5,, 7, 43] Whle the computatoal cost per terato of these methods s oly a small fracto, say /, of that of the batch gradet methods, ther terato complextes are much hgher (t takes may more teratos for them to reach the same precso I order to better quatfy the complextes of varous algorthms ad posto our cotrbutos, we eed to make some cocrete assumptos ad troduce the oto of codto umber ad batch complexty Codto umber ad batch complexty Let γ ad λ be two postve real parameters We make the followg assumpto: Assumpto A Each φ s covex ad dfferetable, ad ts dervatve s (/γ-lpschtz cotuous (same as φ beg (/γ-smooth, e, φ (α φ (β (/γ α β, α,β R, =,, I addto, the regularzato fucto g s λ-strogly covex, e, g(y g(x+g (y T (x y+ λ 2 x y 2 2, g (y g(y, x,y R For example, the logstc loss φ (z = log( + exp( b z s (/4-smooth, the squared error φ (z = (/2(z b 2 s -smooth, ad the squared l 2 -orm g(x = (λ/2 x 2 2 s λ-strogly covex The hge loss φ (z = max{, b z} ad the l -regularzato g(x = λ x do ot satsfy Assumpto A Nevertheless, we ca treat them usg smoothg ad strogly covex perturbatos, respectvely, so that our algorthm ad theoretcal framework stll apply (see Secto 3 Uder Assumpto A, the gradet of each compoet fucto, φ (a T x, s also Lpschtz cotuous, wth Lpschtz costat L = a 2 2 /γ R2 /γ, where R = max a 2 I other words, each φ (a T x s (R2 /γ-smooth We defe a codto umber κ = R 2 /(λγ, ad focus o ll-codtoed problems where κ I the statstcal learg cotext, the regularzato parameter λ s usually o the order of / or / (eg, [6], thus κ s o the order of or It ca be eve larger f the strog covexty g s added purely for umercal regularzato purposes (see Secto 3 We ote that the actual codtog of problem ( may be better tha κ, f the emprcal loss fucto (/ = φ (a T x by tself s strogly covex I those cases, our complexty estmates terms of κ ca be loose (upper bouds, but they are stll useful comparg dfferet algorthms for solvg the same gve problem Let P be the optmal value of problem (, e, P = m x R d P(x I order to fd a approxmate soluto ˆx satsfyg P(ˆx P ǫ, the classcal full gradet method ad ts proxmal varats requre O(( + κ log(/ǫ teratos (eg, [24, 26] Accelerated full gradet ( methods [24, 4,, 26] eoy the mproved terato complexty O((+ κlog(/ǫ However, each terato of these batch methods requres a full pass over the dataset, computg the gradet For the aalyss of full gradet methods, we should use (R 2 /γ + λ/λ = + κ as the codto umber of problem (; see [26, Secto 5] Here we used the upper boud +κ < + κ for easy comparso Whe κ, the addtve costat ca be dropped 2
3 of each compoet fucto ad formg ther average, whch cost O(d operatos (assumg the features vectors a R d are dese I cotrast, the stochastc gradet method ad ts proxmal varats operate o oe sgle compoet φ (a T x (chose radomly at each terato, whch oly costs O(d But ther terato complextes are far worse Uder Assumpto A, t takes them O(κ/ǫ teratos to fd a ˆx such that E[P(ˆx P ] ǫ, where the expectato s wth respect to the radom choces made at all the teratos (see, eg, [3, 23,, 7, 43] To make far comparsos wth batch methods, we measure the complexty of stochastc or cremetal gradet methods terms of the umber of equvalet passes over the dataset requred to reach a expected precso ǫ We call ths measure the batch complexty, whch are usually obtaed by dvdg ther terato complextes by For example, the batch complexty of the stochastc gradet method s O(κ/(ǫ The batch complextes of full gradet methods are the same as ther terato complextes By carefully explotg the fte average structure ( ad other smlar problems, several recet work [32, 36, 6, 44] proposed ew varats of the stochastc gradet or dual coordate ascet methods ad obtaed the terato complexty O(( + κ log(/ǫ Sce ther computatoal cost per terato s O(d, the equvalet batch complexty s O(( + κ/ log(/ǫ Ths complexty has much weaker depedece o tha the full gradet methods, ad also much weaker depedece o ǫ tha the stochastc gradet methods I ths paper, we preset a ew algorthm that has the batch complexty O ( (+ κ/log(/ǫ, (2 whch s more effcet whe κ > 2 Outle of the paper Our approach s based o reformulatg problem ( as a covex-cocave saddle pot problem, ad the devsg a prmal-dual algorthm to approxmate the saddle pot More specfcally, we replace each compoet fucto φ (a T x through covex cougato, e, φ (a T x = sup {y a,x φ (y }, y R where φ (y = sup α R {αy φ (α}, ad a,x deotes the er product of a ad x (whch s the same as a T x, but s more coveet for later presetato Ths leads to a covex-cocave saddle pot problem { m max f(x,y def = ( y a,x φ x R d y R (y } +g(x (3 = Uder Assumpto A, each φ s γ-strogly covex (sce φ s (/γ-smooth; see, eg, [4, Theorem 422] ad g s λ-strogly covex As a cosequece, the saddle pot problem (3 has a uque soluto, whch we deote by (x,y I Secto 2, we propose a stochastc prmal-dual coordate ( method, whch alterates betwee maxmzg f over a radomly chose dual coordate y ad mmzg f over the prmal varable x We also apply a extrapolato step to the prmal varable x to accelerate the covergece The method has terato complexty O(( + κlog(/ǫ Sce each terato of oly operates o a sgle dual coordate y, ts batch complexty s gve by (2 We also preset a m-batch algorthm whch s well suted for dstrbuted computg 3
4 Algorthm : The method Iput: parameters τ,σ,θ R +, umber of teratos T, ad tal pots x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Pck a dex k {,2,,} uformly at radom, ad execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β 2σ (β y(t 2 f = k, (4 y (t f k, { } x (t+ = arg m g(x+ u (t +(y (t+ x R d k k a k, x + x x(t 2 2, (5 2τ u (t+ = u (t + (y(t+ k k a k, (6 x (t+ = x (t+ +θ(x (t+ x (t (7 ed Output: x (T ad y (T I Secto 3, we preset two extesos of the method We frst expla how to solve problem ( whe Assumpto A does ot hold The dea s to apply small regularzatos to the saddle pot fucto so that ca stll be appled, whch results accelerated sublear rates The secod exteso s a method wth o-uform samplg The batch complexty of ths algorthm has the same form as (2, but κ s defed as κ = R/(λγ, where R = = a, whch ca be much smaller tha R = max a f there s cosderable varato the orms a I Secto 4, we dscuss related work I partcular, the method ca be vewed as a coordate-update exteso of the batch prmal-dual algorthm developed by Chambolle ad Pock[8] Wealsodscusstwoveryrecetwork [34,8]whchachevethesamebatch complexty (2 I Secto 5, we dscuss effcet mplemetato of the method whe the feature vectors a are sparse We focus o two popular cases: whe g s a squared l 2 -orm pealty ad whe g s a l +l 2 pealty We show that the computatoal cost per terato of oly depeds o the umber of o-zero elemets the feature vectors I Secto 6, we preset expermet results comparg wth several state-of-the-art optmzato methods, cludg two effcet batch methods ( [24] ad L-BFGS [27, Secto 72], the stochastc average gradet ( method [32, 33], ad the stochastc dual coordate ascet ( method [36] O all scearos we tested, has comparable or better performace 2 The method I ths secto, we descrbe ad aalyze the Stochastc Prmal-Dual Coordate ( method The basc dea of s qute smple: to approach the saddle pot of f(x,y defed (3, we alteratvely maxmze f wth respect to y, ad mmze f wth respect to x Sce the dual vector y has coordates ad each coordate s assocated wth a feature vector a R d, maxmzg f wth respect to y takes O(d computato, whch ca be very expesve f s large We reduce the computatoal cost by radomly pckg a sgle coordate of y at a tme, ad 4
5 Algorthm 2: The M-Batch method Iput: m-batch sze m, parameters τ,σ,θ R +, umber of teratos T, ad x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Radomly pck a subset of dces K {,2,,} of sze m, such that the probablty of each dex beg pcked s equal to m/ Execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β 2σ (β y(t 2 f K, (8 y (t x (t+ = arg m x R d u (t+ = u (t + f / K, { g(x+ u (t + (y (t+ m k k a k, x k K (y (t+ k k a k, x (t+ = x (t+ +θ(x (t+ x (t ed Output: x (T ad y (T k K + x x(t 2 2 2τ }, (9 maxmzg f oly wth respect to ths coordate Cosequetly, the computatoal cost of each terato s O(d WegvethedetalsofthemethodAlgorthm Thedualcoordateupdateadprmal vector update are gve equatos (4 ad (5 respectvely Istead of maxmzg f over y k ad mmzg f over x drectly, we add two quadratc regularzato terms to pealze y (t+ k ad x (t+ from devatg from y (t k ad x (t The parameters σ ad τ cotrol ther regularzato stregth, whch we wll specfy the covergece aalyss (Theorem Moreover, we troduce two auxlary varables u (t ad x (t From the talzato u ( = (/ = y( a ad the update rules (4 ad (6, we have u (t = = y (t a, t =,,T Equato (7 obtas x (t+ based o extrapolato from x (t ad x (t+ Ths step s smlar to Nesterov s accelerato techque [24, Secto 22], ad yelds faster covergece rate Before presetg the theoretcal results, we troduce a M-Batch method Algorthm 2, whch s a atural exteso of Algorthm The dfferece betwee these two algorthms s that, the M-Batch method may smultaeously select more tha oe dual coordates to update Let m be the m-batch sze Durg each terato, the M-Batch method radomly pcks a subset of dces K {,,} of sze m, such that the probablty of each dex beg pcked s equal to m/ The followg s a smple procedure to acheve ths Frst, partto the set of dces to m dsot subsets, so that the cardalty of each subset s equal to /m (assumg m dvdes The, durg each terato, radomly select a sgle dex from each subset ad add t to K Other approaches for m-batch selecto are also possble Wth a sgle processor, each terato of Algorthm 2 takes O(md tme to accomplsh Sce 5
6 the updates of each coordate y k are depedet of each other, we ca use parallel computg to accelerate the M-Batch method Cocretely, we ca use m processors to update the m coordates the subset K parallel, the aggregate them to update x (t+ Such a procedure ca be acheved by a sgle roud of commucato, for example, usg the Allreduce operato MPI [2] or MapReduce [] If we gore the commucato delay, the each terato takes O(d tme, whch s the same as rug oe terato of the basc algorthm Not surprsgly, we wll show that the M-Batch algorthm coverges faster tha terms of the terato complexty, because t processes multple dual coordates a sgle terato 2 Covergece aalyss Sce the basc algorthm s a specal case of M-Batch wth m =, we oly preset a covergece theorem for the m-batch verso Theorem Assume that each φ s (/γ-smooth ad g s λ-strogly covex (Assumpto A Let R = max{ a 2 : =,,} If the parameters τ,σ ad θ Algorthm 2 are chose such that τ = mγ 2R λ, σ = λ 2R mγ, θ = (/m+r (/m/(λγ, ( the for each t, the M-Batch algorthm acheves ( 2τ +λ E [ x (t x 2 ] ( [ E y (t 2 + 4σ +γ y 2 2] m ( ( ( y θ t 2τ +λ x ( x 2 ( 2 + 2σ +γ y 2 2 m The proof of Theorem s gve Appedx A The followg corollary establshes the expected terato complexty of M-Batch for obtag a ǫ-accurate soluto Corollary Suppose Assumpto A holds ad the parameters τ, σ ad θ are set as ( I order for Algorthm 2 to obta t suffces to have the umber of teratos T satsfy ( T m +R mλγ where C = E[ x (T x 2 2] ǫ, E[ y (T y 2 2] ǫ, ( log ( C, ǫ ( /(2τ+λ x ( x 2 2 +( /(2σ+γ y ( y 2 2 /m m { /(2τ+λ, (/(4σ+γ/m } Proof By Theorem, we have E[ x (T x 2 2 ] θt C ad E[ y (T y 2 2 ] θt C To obta (, t suffces to esure that θ T C ǫ, whch s equvalet to T log(c/ǫ log(θ = log(c/ǫ ( ( log (/m+r (/m/(λγ Applyg the equalty log( x x to the deomator above completes the proof 6
7 Recallthedeftoofthecodtoumberκ = R 2 /(λγsecto Corollaryestablshes that the terato complexty of the M-Batch method for achevg ( s ( ((/m+ O κ(/m log(/ǫ So a larger batch sze m leads to less umber of teratos I the extreme case of = m, we obta a full batch algorthm, whch has terato or batch complexty O((+ κlog(/ǫ Ths complexty s also shared by the methods [24, 26] (see Secto, as well as the batch prmal-dual algorthm of Chambolle ad Pock [8] (see dscussos o related work Secto 4 Sce a equvalet pass over the dataset correspods to /m teratos, the batch complexty (the umber of equvalet passes over the data of M-Batch s ( (+ O κ(m/ log(/ǫ The above expresso mples that a smaller batch sze m leads to less umber of passes through the data I ths sese, the basc method wth m = s the most effcet oe However, f we prefer the least amout of wall-clock tme, the the best choce s to choose a m-batch sze m that matches the umber of parallel processors avalable 22 Covergece of prmal obectve I the prevous subsecto, we establshed terato complexty of the M-Batch method terms of approxmatg the saddle pot of the mmax problem (3, more specfcally, to meet the requremet ( Next we show that t has the same order of complexty reducg the prmal obectve gap P(x (T P(x But we eed a extra assumpto Assumpto B There exst costats G ad H such that for ay x R d, g(x g(x G x x 2 + H 2 x x 2 2 We ote that Assumpto B s weaker tha ether G-Lpschtz cotuty or H-smoothess It s satsfed by the l orm, the squared l 2 -orm, ad mxed l +l 2 regularzatos Corollary 2 Suppose both Assumptos A ad B hold, ad the parameters τ, σ ad θ are set as ( To guaratee E[P(x (T P(x ] ǫ, t suffces to ru Algorthm 2 for T teratos, wth ( ( C(4G 2 T m +R +H +/γ log mλγ ǫ 2, where C = x ( x ( /(2σ+γ y ( y 2 2 /(2τ+λ m Proof Usg the (/γ-smoothess of P g ad Assumpto B, we have P(x (T P(x (P g (x,x (T x + 2γ x(t x 2 2 +g(x (T g(x ( (P g (x 2 +G x (T x 2 + H +/γ x (T x
8 Sce x mmzes P, we have (P g (x g(x Hece, Assumpto B mples that (P g (x 2 G Substtutg ths relato to the above equalty, ad usg Hölder s equalty, we have E[P(x (T P(x ] 2G ( E[ x (T x 2 2] /2 H +/γ + E[ x (T x 2 2 2] To make E[P(x (T P(x ] ǫ, t suffces to let the rght-had sde of the above equalty bouded by ǫ Sce ǫ, ths s guarateed by E[ x (T x 2 2] ǫ 2 4G 2 +H +/γ (2 By Theorem, we have E[ x (T x 2 2 ] θt C To secure equalty (2, t s suffcet to make θ T ǫ 2, whch s equvalet to C(4G 2 +H+/γ T log(c(4g2 +H +/γ/ǫ 2 log(θ = log(c(4g 2 +H +/γ/ǫ 2 ( ( log (/m+r (/m/(λγ Applyg log( x x to the deomator above completes the proof 3 Extesos of I ths secto, we derve two extesos of the method The frst oe hadles problems for whch Assumpto A does ot hold The secod oe employs a o-uform samplg scheme to mprove the terato complexty whe the feature vectors a are uormalzed 3 No-smooth or o-strogly covex fuctos The complexty bouds establshed Secto 2 requre each φ to be γ-strogly covex, whch correspodstothecodtothatthefrstdervatveofφ s(/γ-lpschtzcotuous Iaddto, thefuctog eedstobeλ-stroglycovex Forgeerallossfuctoswhereetherorbothofthese codtos fal (eg, the hge loss ad l -regularzato, we ca slghtly perturb the saddle-pot fucto f(x,y so that the method ca stll be appled For smplcty, here we cosder the case where ether φ s smooth or g s strogly covex Formally, we assume that each φ ad g are covex ad Lpschtz cotuous, ad f(x,y has a saddle pot (x,y We choose a scalar δ > ad cosder the modfed saddle-pot fucto: f δ (x,y def = = ( ( y a,x φ (y + δy2 +g(x+ δ 2 2 x 2 2 (3 Deote by (x δ,y δ the saddle-pot of f δ We employ the M-Batch method (Algorthm 2 to approxmate (x δ,y δ, treatg φ + δ 2 ( 2 as φ ad g + δ as g, whch ow are all δ-strogly covex We ote that addg strogly covex perturbato o φ s equvalet to smoothg φ, whch becomes (/δ-smooth Lettg γ = λ = δ, the parameters τ, σ ad θ ( become τ = 2R m, σ = 2R m, ad θ = ( m + R δ m 8
9 Although (x δ,y δ s ot exactly the saddle pot of f, the followg corollary shows that applyg the method to the perturbed fucto f δ effectvely mmzes the orgal loss fucto P Corollary 3 Assume that each φ s covex ad G φ -Lpschtz cotuous, ad g s covex ad G g -Lpschtz cotuous Defe two costats: ( /(2σ+δ y C = ( x 2 2 +G 2 φ, C 2 = (G φ R+G g ( x 2 ( x ( y δ 2 δ /(2τ+δ m If we choose δ ǫ/c, ad ru the M-Batch algorthm for T teratos where ( T m + R ( 4C2 log δ m ǫ 2, the E[P(x (T P(x ] ǫ Proof Let ỹ = argmax y f(x δ,y be a shorthad otato We have P(x δ ( = f(x ( δ,ỹ f δ (x δ,ỹ+ δ ỹ (v f(x,yδ + δ x δ ỹ ( f δ (x δ,y δ + δ ỹ (v f(x,y + δ x δ ỹ (v f δ (x,y δ + δ ỹ (v = P(x + δ x δ ỹ Here, equatos ( ad (v use the defto of the fucto f, equaltes ( ad (v use the defto of the fucto f δ, equaltes ( ad (v use the fact that (x δ,y δ s the saddle pot of f δ, ad equalty (v s due to the fact that (x,y s the saddle pot of f Sce φ s G φ -Lpschtz cotuous, the doma of φ s the terval [ G φ,g φ ], whch mples ỹ 2 2 G2 φ (see, eg, [34, Lemma ] Thus, we have P(x δ P(x δ 2 ( x 2 2 +G 2 φ = δ 2 C (4 O the other had, sce P s (G φ R+G g -Lpschtz cotuous, Theorem mples E[P(x (T P(x δ ] (G φr+g g E[ x (T x δ 2] ( C 2 ( m + R T/2 (5 δ m Combg equalty (4 ad equalty (5, to guaratee E[P(x (T P(x ] ǫ, t suffces to have C δ ǫ ad ( ( C2 m + R T/2 ǫ δ m 2 (6 The corollary s establshed by fdg the smallest T that satsfes equalty (6 There are two other cases that ca be cosdered: whe φ s ot smooth but g s strogly covex, ad whe φ s smooth but g s ot strogly covex They ca be hadled wth the same techque descrbed above, ad we omt the detals here (Alteratvely, t s possble to use the techques descrbed [8, Secto 5] to obta accelerated sublear covergece rates wthout usg strogly covex perturbatos I Table, we lst the complextes of the M-Batch method for fdg a ǫ-optmal soluto of problem ( uder varous assumptos Smlar results are also obtaed [34] 9
10 φ g terato complexty Õ( (/γ-smooth λ-strogly covex /m+ (/m/(λγ (/γ-smooth o-strogly covex /m+ (/m/(ǫγ o-smooth λ-strogly covex /m+ (/m/(ǫλ o-smooth o-strogly covex /m+ /m/ǫ Table: Iteratocomplextesofthemethoduderdfferetassumptosothefuctosφ ad g For the last three cases, we solve the perturbed saddle-pot problem wth δ = ǫ/c 32 wth o-uform samplg Oe potetal drawback of the algorthm s that, ts covergece rate depeds o a problemspecfc costat R, whch s the largest l 2 -orm of the feature vectors a As a cosequece, the algorthm may perform badly o uormalzed data, especally f the l 2 -orms of some feature vectors are substatally larger tha others I ths secto, we propose a exteso of the method to mtgate ths problem, whch s gve Algorthm 3 The basc dea s to use o-uform samplg pckg the dual coordate to update at each terato I Algorthm 3, we pck coordate k wth the probablty p k = 2 + a k 2 2 = a, k =,, (7 2 Therefore, staces wth large feature orms are sampled more frequetly Smultaeously, we adopt a adaptve regularzato step (8, mposg stroger regularzato o such staces I addto, we adust the weght of a k (9 for updatg the prmal varable As a cosequece, the covergece rate of Algorthm 3 depeds o the average orm of feature vectors Ths s summarzed by the followg theorem Theorem 2 Suppose Assumpto A holds Let R = = a 2 If the parameters τ,σ,θ Algorthm 3 are chose such that τ = γ 4 R λ, σ = λ 4 R γ, θ = 2+2 R /(λγ, the for each t, we have ( 2τ +λ E [ x (t x 2 ] ( θ t ( ( 2τ +λ x ( x σ + 2γ E [ y (t y 2 ] 2 ( 2σ +2γ y ( y 2 2 Comparg the costat θ Theorem 2 to that of Theorem, we ca fd two dffereces Frst, there s a addtoal factor of 2 multpled to the deomator R /(λγ, makg the value of θ larger Secod, the costat R here s determed by the average orm of features, stead of the largest oe, whch makes the value of θ smaller The secod dfferece makes the algorthm more robust to uormalzed feature vectors For example, f the a s are sampled d
11 Algorthm 3: method wth weghted samplg Iput: parameters τ,σ,θ R +, umber of teratos T, ad tal pots x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Radomly pck k {,2,,}, wth probablty p k = 2 + a k 2 2 = a 2 Execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β p 2σ (β y(t 2 = k, (8 y (t k, { x (t+ = arg m g(x+ u (t + } x R d p k (y(t+ k k a k, x + x x(t 2 2, (9 2τ u (t+ = u (t + (y(t+ k k a k, x (t+ = x (t+ +θ(x (t+ x (t ed Output: x (T ad y (T from a multvarate ormal dstrbuto, the max { a 2 } almost surely goes to fty as, but the average orm = a 2 coverges to E[ a 2 ] For smplcty of presetato, we descrbed Algorthm 3 a weghted samplg method wth sgle dual coordate update, e, the case of m = It s ot hard to see that the ouform samplg scheme ca also be exteded to M-Batch wth m > Moreover, the o-uform samplg scheme ca also be appled to solve problems wth o-smooth φ or ostrogly covex g, leadg to smlar coclusos as Corollary 3 Here, we omt the techcal detals 4 Related Work Chambolle ad Pock [8] cosdered a class of covex optmzato problems wth the followg saddle-pot structure: m max { Kx,y +G(x F (y }, (2 x R d y R where K R m d, G ad F are proper closed covex fuctos, wth F tself beg the cougate of a covex fucto F They developed the followg frst-order prmal-dual algorthm: { y (t+ = argmax Kx (t,y F (y } y R 2σ y y(t 2 2, (2 { x (t+ = arg m K T y (t+,x +G(x+ } x R d 2τ x x(t 2 2, (22 x (t+ = x (t+ +θ(x (t+ x (t (23 Whe both F ad G are strogly covex ad the parameters τ, σ ad θ are chose approprately, ths algorthm obtas accelerated lear covergece rate [8, Theorem 3]
12 algorthm τ σ θ batch complexty ( Chambolle-Pock [8] γ λ A 2 λ A 2 γ + A 2 /(2 + A 2 λγ 2 log(/ǫ λγ ( wth m = γ λ 2R λ 2R γ +R/ + R λγ λγ log(/ǫ ( wth m = γ λ 2R λ 2R γ + R +R /λγ λγ log(/ǫ Table 2: Comparg wth Chambolle ad Pock [8, Algorthm 3, Theorem 3] We ca map the saddle-pot problem (3 to the form of (2 by lettg A = [a,,a ] T ad K = A, G(x = g(x, F (y = φ (y (24 The method developed ths paper ca be vewed as a exteso of the batch method (2-(23, where the dual update step (2 s replaced by a sgle coordate update (4 or a mbatch update (8 However, order to obta accelerated covergece rate, more subtle chages are ecessary the prmal update step More specfcally, we troduced the auxlary varable = y(t a = K T y (t, ad replaced the prmal update step (22 by (5 ad (9 The prmal u (t = extrapolato step (23 stays the same To compare the batch complexty of wth that of (2-(23, we use the followg facts mpled by Assumpto A ad the relatos (24: K 2 = A 2, G(x s λ-strogly covex, ad F (y s (γ/-strogly covex Based o these codtos, we lst Table 2 the equvalet parameters used [8, Algorthm 3] ad the batch complexty obtaed [8, Theorem 3], ad compare them wth The batch complexty of the Chambolle-Pock algorthm s Õ( + A 2/(2 λγ, where the Õ( otato hdes the log(/ǫ factor We ca boud the spectral orm A 2 by the Frobeus orm A F ad obta A 2 A F max{ a 2 } = R (Note that the secod equalty above would be a equalty f the colums of A are ormalzed So the worst case, the batch complexty of the Chambolle-Pock algorthm becomes ( Õ +R/ λγ = Õ( + κ, where κ = R 2 /(λγ, whch matches the worst-case complexty of the methods [24, 26] (see Secto ad also the dscussos [8, Secto 5] Ths s also of the same order as the complexty of wth m = (see Secto 2 Whe the codto umber κ, they ca be worse tha the batch complexty of wth m =, whch s Õ(+ κ/ If ether G(x or F (y (2 s ot strogly covex, Chambolle ad Pock proposed varats of the prmal-dual batch algorthm to acheve accelerated sublear covergece rates [8, Secto 5] It s also possble to exted them to coordate update methods for solvg problem ( whe ether φ or g s ot strogly covex Ther complextes would be smlar to those Table 2 =
13 4 Dual coordate ascet methods We ca also solve the prmal problem ( va ts dual: { maxmze y R D(y def = = φ (y g ( } y a, (25 where g (u = sup x R d{x T u g(x} s the cougate fucto of g Here aga, coordate ascet methods (eg, [29, 9, 5, 36] ca be more effcet tha full gradet methods I the stochastc dual coordate ascet ( method [36], a dual coordate y s pcked at radom durg each terato ad updated to crease the dual obectve value Shalev-Shwartz ad Zhag [36] showed that the terato complexty of s O(( + κ log(/ǫ, whch correspods to the batch complexty Õ(+κ/ Therefore, the method, whch has batch complexty Õ(+ κ/, ca be much better whe κ >, e, for ll-codtoed problems For more geeral covex optmzato problems, there s a vast lterature o coordate descet methods I partcular, Nesterov s work o radomzed coordate descet [25] sparked a lot of recet actvtes o ths topc Rchtárk ad Takáč [3] exteded the algorthm ad aalyss to composte covex optmzato Whe appled to the dual problem (25, t becomes oe varat of studed [36] M-batch ad dstrbuted versos of have bee proposed ad aalyzed [39] ad [45] respectvely No-uform samplg schemes smlar to the oe used Algorthm 3 have bee studed for both stochastc gradet ad methods (eg, [22, 44, 48] Shalev-Shwartz ad Zhag [35] proposed a accelerated m-batch method whch corporates addtoal prmal updates tha, ad bears some smlarty to our M-Batch method They showed that ts complexty terpolates betwee that of ad by varyg the m-batch sze m I partcular, for m =, t matches that of the methods (as does But for m =, the complexty of ther method s the same as, whch s worse tha for ll-codtoed problems I addto, Shalev-Shwartz ad Zhag [34] developed a accelerated proxmal method whch acheves the same batch complexty Õ( + κ/ as Ther method s a er-outer terato procedure, where the outer loop s a full-dmesoal accelerated gradet method the prmal space x R d At each terato of the outer loop, the method [36] s called to solve the dual problem (25 wth customzed regularzato parameter ad precso I cotrast, s a straghtforward sgle-loop coordate optmzato methods More recetly, L et al [8] developed a accelerated proxmal coordate gradet (APCG method for solvg a more geeral class of composte covex optmzato problems Whe appled to the dual problem (25, APCG eoys the same batch complexty Õ( + κ/ as of However, t eeds a extra prmal proxmal-gradet step to have theoretcal guaratees o the covergece of prmal-dual gap [8, Secto 5] The computatoal cost of ths addtoal step s equvalet to oe pass of the dataset, thus t does ot affect the overall complexty 42 Other related work Aother way to approach problem ( s to reformulate t as a costraed optmzato problem mmze φ (z +g(x (26 = subect to a T x = z,, =,,, = 3
14 ad solve t by ADMM type of operator-splttg methods (eg, [9] I fact, as show [8], the batch prmal-dual algorthm (2-(23 s equvalet to a pre-codtoed ADMM (or exact Uzawa method; see, eg, [47] Several authors [42, 28, 37, 49] have cosdered a more geeral formulato tha (26, where each φ s a fucto of the whole vector z R They proposed ole or stochastc versos of ADMM whch operate o oly oe φ each terato, ad obtaed sublear covergece rates However, ther cost per terato s O(d stead of O(d Suzuk[38] cosdered a problem smlar to(, but wth more complex regularzato fucto g, meag that g does ot have a smple proxmal mappg Thus prmal updates such as step (5 or(9adsmlarstepscaotbecomputedeffcetly Heproposedaalgorthm that combes [36] ad ADMM (eg, [7], ad showed that t has lear rate of covergece uder smlar codtos as Assumpto A It would be terestg to see f the method ca be exteded to ther settg to obta accelerated lear covergece rate 5 Effcet Implemetato wth Sparse Data Durg each terato of the methods, the updates of prmal varables (e, computg x (t+ requre full d-dmesoal vector operatos; see the step (5 of Algorthm, the step (9 of Algorthm 2 ad the step (9 of Algorthm 3 So the computatoal cost per terato s O(d, ad ths ca be too expesve f the dmeso d s very hgh I ths secto, we show how to explot problem structure to avod hgh-dmesoal vector operatos whe the feature vectors a are sparse We llustrate the effcet mplemetato for two popular cases: whe g s a squared-l 2 pealty ad whe g s a l +l 2 pealty For both cases, we show that the computato cost per terato oly depeds o the umber of o-zero compoets of the feature vector 5 Squared l 2 -orm pealty Suppose that g(x = λ 2 x 2 2 For ths case, the updates for each coordate of x are depedet of each other More specfcally, x (t+ ca be computed coordate-wse closed form: where u deotes (y (t+ k (y (t+ k x (t+ = +λτ (x(t τu (t τ u, =,,, (27 k a k Algorthm, or m k K (y(t+ k k a k Algorthm 2, or k a k/(p k Algorthm 3, ad u represets the -th coordate of u Although the dmeso d ca be very large, we assume that each feature vector a k s sparse We deote by J (t the set of o-zero coordates at terato t, that s, f for some dex k K pcked at terato t we have a k, the J (t If / J (t, the the algorthm (ad ts varats updates y (t+ wthout usg the value of x (t or x (t Ths ca be see from the updates (4, (8 ad (8, where the value of the er product a k,x (t does ot deped o the value of x (t As a cosequece, we ca delay the updates o x ad x wheever / J (t wthout affectg the updates o y (t, ad process all the mssg updates at the ext tme whe J (t Such a delayed update ca be carred out very effcetly We assume that t s the last tme whe J (t, ad t s the curret terato where we wat to update x ad x Sce / J (t mples u =, we have x t+ = +λτ (x(t τu (t, t = t +,t +2,,t (28 4
15 Notce that u (t s updated oly at teratos where J (t The value of u (t does t chage durg teratos [t +,t ], so we have u (t u (t + for t [t +,t ] Substtutg ths equato to the recursve formula (28, we obta x (t = (+λτ t t ( x (t + + u(t+ λ u(t + (29 λ The update (29 takes O( tme to compute Usg the same formula, we ca compute x (t ad subsequetly compute x (t = x (t +θ(x (t x (t Thus, the computatoal complexty of a sgle terato s proportoal to J (t, depedet of the dmeso d 52 (l +l 2 -orm pealty Suppose that g(x = λ x + λ 2 2 x 2 2 Sce both the l -orm ad the squared l 2 -orm are decomposable, the updates for each coordate of x (t+ are depedet More specfcally, { } x (t+ = argm λ α + λ 2α 2 +(u (t + u α+ (α x(t 2, (3 α R 2 2τ where u follows the defto Secto 5 If / J (t, the u = ad equato (3 ca be smplfed as x (t+ = +λ 2 τ (x(t τu (t τλ f x (t τu (t > τλ, +λ 2 τ (x(t τu (t +τλ f x (t τu (t < τλ, otherwse Smlar to the approach of Secto 5, we delay the update of x utl J (t We assume t to be the last terato whe J (t, ad let t be the curret terato whe we wat to update x Durg teratos [t +,t ], the value of u (t does t chage, so we have u (t u (t + for t [t +,t ] Usg equato (3 ad the varace of u (t for t [t +,t ], we have a O( tme algorthm to calculate x (t, whch we detal Appedx C The vector x (t ca be updated by the same algorthm sce t s a lear combato of x (t ad x (t As a cosequece, the computatoal complexty of each terato s proportoal to J (t, depedet of the dmeso d 6 Expermets I ths secto, we compare the basc method (Algorthm wth several state-of-the-art optmzato algorthms for solvg problem ( They clude two batch-update algorthms: the accelerated full gradet (FAG method [24, Secto 22], ad the lmted-memory quas-newto method L-BFGS [27, Secto 72] For the method, we adopt a adaptve le search scheme (eg, [26] to mprove ts effcecy For the L-BFGS method, we use the memory sze 3 as suggested by [27] We also compare wth two stochastc algorthms: the stochastc average gradet ( method [32, 33], ad the stochastc dual coordate descet ( method [36] We coduct expermets o a sythetc dataset ad three real datasets (3 5
16 (a λ = (b λ = (c λ = (d λ = Fgure : Comparg wth other methods o sythetc data, wth the regularzato coeffcet λ { 3,,, } The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc gap log(p(x (T P(x 6 Rdge regresso wth sythetc data We frst compare wth other algorthms o a smple quadratc problem usg sythetc data We geerate = 5 d trag examples {a,b } = accordg to the model b = a,x +ε, a N(,Σ, ε N(,, where a R d ad d = 5, ad x s the all-oes vector To make the problem ll-codtoed, the covarace matrx Σ s set to be dagoal wth Σ =, for =,,d Gve the set of examples {a,b } =, we the solved a stadard rdge regresso problem { mmze x R d P(x def = = } 2 (at x b 2 + λ 2 x 2 2 I the form of problem (, we have φ (z = z 2 /2 ad g(x = (/2 x 2 2 As a cosequece, the dervatve of φ s -Lpschtz cotuous ad g s λ-strogly covex 6
17 Dataset ame umber of samples umber of features d sparsty Covtype 58, % RCV 2,242 47,236 6% News2 9,996,355,9 4% Table 3: Characterstcs of three real datasets obtaed from LIBSVM data [2] We evaluate the algorthms by the logarthmc optmalty gap log(p(x (t P(x, where x (t s the output of the algorthms after t passes over the etre dataset, ad x s the global mmum Whe the regularzato coeffcet s relatvely large, eg, λ = or, the problem s wellcodtoed ad we observe fast covergece of the stochastc algorthms, ad, whch are substatally faster tha the two batch methods ad L-BFGS Fgure shows the covergece of the fve dfferet algorthms whe we vared λ from 3 to As the plot shows, whe the codto umber s greater tha, the algorthm also coverges substatally faster tha the other two stochastc methods ad It s also otably faster tha L-BFGS These results support our theory that eoys a faster covergece rate o ll-codtoed problems I terms of ther batch complextes, s up to tmes faster tha, ad (λ /2 tmes faster tha ad 62 Bary classfcato wth real data Fally we show the results of solvg the bary classfcato problem o three real datasets The datasets are obtaed from LIBSVM data [2] ad summarzed Table 3 The three datasets are selected to reflect dfferet relatos betwee the sample sze ad the feature dmesoalty d, whch cover d (Covtype, d (RCV ad d (News2 For all tasks, the data pots take the form of (a,b, where a R d s the feature vector, ad b {,} s the bary class label Our goal s to mmze the regularzed emprcal rsk: P(x = φ (a T x+ λ 2 x 2 2 where φ (z = = f b z 2 b z f b z 2 ( b z 2 otherwse Here, φ s the smoothed hge loss (see, eg, [36] It s easy to verfy that the cougate fucto of φ s φ (β = b β + 2 β2 for b β [,] ad otherwse The performace of the fve algorthms are plotted Fgure 2 ad Fgure 3 I Fgure 2, we compare wth the two batch methods: ad L-BFGS The results show that s substatally faster tha ad L-BFGS for relatvely large λ, llustratg the advatage of stochastc methods over batch methods o well-codtoed problems As λ decreases to, the batch methods (especally L-BFGS become comparable to I Fgure 3, we compare wth the two stochastc methods: ad Here, the observatos are ust the opposte to that of Fgure 2 The three stochastc algorthms have comparable performace o relatvely large λ, but becomes substatally faster whe λ gets closer to zero Summarzg Fgure 2 ad Fgure 3, the performace of are always comparable or better tha the other methods comparso 7
18 λ RCV Covtype News Fgure 2: Comparg wth ad L-BFGS o three real datasets wth smoothed hge loss The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc optmalty gap log(p(x (t P(x The algorthm s faster tha the two batch methods whe λ s relatvely large 8
19 λ RCV Covtype News Fgure 3: Comparg wth ad o three real datasets wth smoothed hge loss The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc optmalty gap log(p(x (T P(x The algorthm s faster tha the other two stochastc methods whe λ s small 9
20 A Proof of Theorem We focus o characterzg the values of x ad y after the t-th update Algorthm 2 For ay {,,}, let ỹ be the value of y (t+ f K, e, ỹ = argmax {y a,x (t φ (β (y y(t y R 2σ Sce φ s (/γ-smooth by assumpto, ts cougate φ s γ-strogly covex (eg, [4, Theorem 422] Thus the fucto beg maxmzed above s (/σ + γ-strogly cocave Therefore, y a,x (t +φ (y+ (y y(t 2 2σ 2 } ỹ a,x (t +φ (ỹ + (ỹ 2 ( + σ +γ (ỹ y 2 2 O the other had, sce y mmzes φ k (y y a,x (by property of the saddle-pot, we have φ (ỹ ỹ a,x φ (y y a,x + γ 2 (ỹ y 2 Summg up the above two equaltes, we obta (y (t y 2 2σ ( 2σ +γ (ỹ y 2 + (ỹ 2 2σ 2σ +(ỹ y a,x x (t (32 Accordg to Algorthm 2, the set K of dces to be updated are chose radomly For every specfc dex, the evet K happes wth probablty m/ If K, the y (t+ s updated to the value ỹ, whch satsfes equalty (32 Otherwse, y (t+ s assged by ts old value y (t Let F t be the sgma feld geerated by all radom varables defed before roud t, ad takg expectato codtoed o F t, we have E[(y (t+ y 2 F t ] = m(ỹ y 2 2 F t ] = m(ỹ E[(y (t+ 2 + ( m(y(t E[y (t+ F t ] = mỹ + ( my(t, y 2 As a result, we ca represet (ỹ y 2, (ỹ 2 ad ỹ terms of the codtoal expectatos o (y (t+ y 2, (y (t+ 2 ad y (t+ Pluggg these represetatos to equalty (32, we have ( 2mσ + ( mγ ( (y (t y m 2 2mσ + γ m ( + a,x x (t E[(y (t+ y 2 F t ]+ E[(y(t+ y (t y + m E[y(t+, F t ] 2 F t ] 2mσ (33 2
21 The summg over all dces =,2,, ad dvdg both sdes of the equalty by, we have ( 2mσ + ( mγ ( y (t y 2 2 m 2mσ + γ E[ y (t+ y 2 m 2 F t ]+ E[ y(t+ 2 2 F t] 2mσ [ +E u (t u + m k K (y (t+ k = y(t k a k, x x (t Ft ], where u = = y a s a shorthad otato, ad u (t = a s defed Algorthm 2 We used the fact that = (y(t+ y (t a = k K (y(t+ k y (t k a k, sce oly the coordates K are updated We stll eed a equalty characterzg the relato betwee x (t+ ad x (t Followg the same steps for dervg equalty (32, ad usg the λ-strog covexty of fucto g, t s ot dffcult to show that x (t x 2 2 2τ ( 2τ +λ x (t+ x x(t+ x (t 2 2 2τ + u (t u + (y (t+ m k k K (34 k a k,x (t+ x (35 Takg expectato over both sde of equalty (35, the addg t to equalty (34, we have x (t x 2 ( 2 + 2τ 2σ + ( mγ y (t y 2 ( 2 m 2τ +λ E[ x (t+ x 2 2 F t ] ( E[ y (t+ + 2σ +γ y 2 2 F t] + E[ x(t+ x (t 2 2 F t] + E[ y(t+ 2 2 F t] m 2τ 2mσ ( T +E y (t y + y(t+ A(x (t+ x (t θ(x (t x (t F t (36 m For the last term of equalty (36, we have plugged the deftos of u (t+, u ad x (t, ad used the relato that (y (t+ T A = k K (y(t+ k k at k The matrx A s a -by-d matrx, whose -th row s equal to the vector a T For the rest of the proof, we lower boud the last term o the rght-had-sde of equalty (36 I partcular, we have ( y (t y T + y(t+ A(x (t+ x (t θ(x (t x (t = (y(t+ y T A(x (t+ x (t m θ(y(t y T A(x (t x (t + m m (y(t+ T A(x (t+ x (t θ m (y(t+ T A(x (t x (t (37 2
22 Recall that a k 2 R ad /τ = 4σR 2 accordg to ( We have (y (t+ T A(x (t+ x (t x(t+ x (t 2 2 /m = x(t+ x (t 2 2 /m + (y(t+ T A 2 2 m/τ + ( k K y(t+ k k a k 2 2 4mσR 2 Smlarly, we have m x(t+ x (t y(t+ 2 2, 4σ (y (t+ T A(x (t x (t m x(t x (t 2 2 The above upper bouds o the absolute values mply + y(t σ (y (t+ T A(x (t+ x (t m x(t+ x (t 2 2 (y (t+ T A(x (t x (t m x(t x (t 2 2 y(t+ 2 2, 4σ y(t σ Combg the above two equaltes wth lower bouds (36 ad (37, we obta x (t x 2 2 2τ + ( + ( 2σ +γ 2σ + ( mγ y (t y 2 2 m E[ y (t+ y 2 2 F t] m ( 2τ +λ E[ x (t+ x 2 2 F t ] + E[ x(t+ x (t 2 2 F t] θ x (t x (t E[(y(t+ y T A(x (t+ x (t F t ] θ(y (t y T A(x (t x (t (38 Recall that the parameters τ, σ, ad θ are chose to be τ = mγ 2R λ, σ = λ 2R mγ, ad θ = (/m+r (/m/(λγ Pluggg these assgmets, we fd that /(2τ /(2τ+λ = +/(2τλ θ ad /(2σ+( mγ/ = /(2σ+γ /m+/(2mσγ = θ Therefore, f we defe a sequece (t such that (t = ( ( E[ y 2τ +λ E[ x (t x 2 (t 2]+ 2σ +γ y 2 2 ] m + E[ x(t x (t 2 2 ] + E[(y(t y T A(x (t x (t ], 22
23 the equalty (38 mples the recursve relato (t+ θ (t, whch mples where ( ( E[ y 2τ +λ E[ x (t x 2 (t 2]+ 2σ +γ y 2 2 ] m + E[ x(t x (t 2 2 ] ( = + E[(y(t y T A(x (t x (t ] ( ( y 2τ +λ x ( x 2 ( 2 + 2σ +γ y 2 2 m θ t (, (39 To elmate the last two terms o the left-had sde of equalty (39, we otce that (y (t y T A(x (t x (t x(t x (t A 2 2 y(t y /τ x(t x (t R2 y (t y /τ = x(t x (t 2 2 x(t x (t y(t y 2 2 4σ + y(t y 2 2, 4mσ where the secod equalty we used A 2 2 A 2 F R2, the equalty we used τσ = /(4R 2, ad the last equalty we used m The above upper boud o absolute value mples (y (t y T A(x (t x (t x(t x (t 2 2 y(t y 2 2 4mσ The theorem s establshed by combg the above equalty wth equalty (39 B Proof of Theorem 2 The proof of Theorem 2 mmcs the steps for provg Theorem We start by establshg relato betwee(y (t,y (t+ adbetwee(x (t,x (t+ Supposethatthequattyỹ mmzesthefucto φ (y y a,x (t + p 2σ (y y(t 2 The, followgthesameargumetforestablshgequalty(32, we obta p ( 2σ (y(t y 2 p 2σ +γ (ỹ y 2 + p(ỹ 2 + a,x x (t (ỹ y 2σ (4 Note that = k wth probablty p Therefore, we have (ỹ y 2 = p E[(y (t+ (ỹ 2 = E[(y (t+ 2 F t ], p y 2 F t ] p p (y (t y 2, ỹ = E[y (t+ F t ] p y (t, p p 23
24 where F t represets the sgma feld geerated by all radom varables defed before terato t Substtutg the above equatos to equalty (4, ad averagg over =,2,,, we have = ( 2σ + ( p γ (y (t y p 2 = ( 2σ + γ E[(y (t+ p +E[ (u (t u +(y (t+ k y 2 F t ]+ E[(y(t+ k k 2 F t ] 2σ k a k/(p k,x x (t F t ], where u = = y a ad u (t = = y(t a have the same defto as the proof of Theorem For the relato betwee x (t ad x (t+, we follow the steps the proof of Theorem to obta x (t x 2 ( 2 2τ 2τ +λ x (t+ x x(t+ x (t 2 2 2τ = + (u (t u +(y (t+ k (4 k a k/(p k,x (t+ x (42 Takg expectato over both sdes of equalty (42 ad addg t to equalty (4 yelds x (t x 2 ( 2 + 2τ 2σ + ( p ( γ (y (t y p = 2 2τ +λ E[ x (t+ x 2 2 F t ] ( + 2σ + γ E[(y (t+ y p 2 F t ]+ x(t+ x (t E[(y(t+ k k 2 F t ] 2τ 2σ [( (y (t y T A +E + (y(t+ k k at k p k ((x (t+ x (t θ(x (t x (t F t ], (43 } {{ } v where the matrx A s a -by-d matrx, whose -th row s equal to the vector a T Next, we lower boud the last term o the rght-had sde of equalty (43 Ideed, t ca be expaded as v = (y(t+ y T A(x (t+ x (t θ(y(t y T A(x (t x (t + p k p k (y(t+ k k at k (x(t+ x (t θ p k (y(t+ k k at k (x(t x (t (44 Note that the probablty p k gve (7 satsfes p k a k 2 2 = a = a k R, k =,, Sce the parameters τ ad σ satsfes στ R 2 = /6, we have p 2 k 2 /τ 4σ a k 2 2 ad cosequetly (y (t+ k k at k (x(t+ x (t x(t+ x (t 2 2 p k x(t+ x (t (y(t+ k + (y(t+ k k a k 2 2 p 2 k 2 /τ k 2 4σ 24
25 Smlarly, we have (y (t+ k k at k (x(t x (t x(t x (t 2 2 p k + (y(t+ k k 2 Combg the above two equaltes wth lower bouds (43 ad (44, we obta x (t x τ + = = ( 2σ + γ p ( 2σ + ( p γ p 4σ ( (y (t y 2 2τ +λ E[ x (t+ x 2 2 F t ] E[(y (t+ y 2 F t ]+ E[ x(t+ x (t 2 2 F t] θ x (t x (t E[(y(t+ y T A(x (t+ x (t F t ] θ(y (t y A(x (t x (t (45 Recall that the parameters τ, σ, ad θ are chose to be τ = γ 4 R λ, σ = λ 4 R γ, ad θ = 2+2 R /(λγ Pluggg these assgmets ad usg the fact that p /(2, we fd that /(2τ /(2τ+λ θ ad /(2σ+( p γ/(p θ for =,2,, /(2σ+γ/(p Therefore, f we defe a sequece (t such that (t = ( 2τ +λ E[ x (t x 2 2]+ + E[ x(t x (t 2 2 ] = ( 2σ + γ E[(y (t y p 2 ] + E[(y(t y T A(x (t x (t ], the equalty (45 mples the recursve relato (t+ θ (t, whch mples ( ( 2τ +λ E[ x (t x 2 2]+ 2σ + 2γ E[ y (t y 2 2] where + E[ x(t x (t 2 2 ] ( ( = 2τ +λ x ( x ( 2τ +λ x ( x E[(y(t y T A(x (t x (t ] = ( 2σ + γ p ( 2σ +2γ θ T (, (46 (y ( y 2 y ( y
26 To elmate the last two terms o the left-had sde of equalty (46, we otce that (y (t y T A(x (t x (t x(t x (t 2 2 x(t x (t 2 2 = x(t x (t y(t y 2 2 A /τ + y(t y 2 2 A 2 F 2 /τ + y(t y 2 2 = a 2 2 6σ( = a 2 2 x(t x (t y(t y 2 2, 6σ where the equalty we used 2 /τ = 2 6σ R 2 = 6σ( = a 2 2 Ths mples (y (t y T A(x (t x (t x(t x (t 2 2 Substtutg the above equalty to equalty (46 completes the proof C Effcet update for (l +l 2 -orm pealty y(t y 2 2 6σ From Secto 52, we have the followg recursve formula for t [t +,t ], τu (t+ τλ f x (t τu (t + > τλ, x (t+ = +λ 2 τ (x(t +λ 2 τ (x(t τu (t + +τλ f x (t τu (t + < τλ, otherwse (47 Gve x (t + at terato t, we preset a effcet algorthm for calculatg x (t We beg by examg the sg of x (t + Case I (x (t + = : If u (t + Cosequetly, we have a closed-form formula for x (t : x (t = > λ, the equato (47 mples x (t > for all t > t + ( (+λ 2 τ t t x (t + + u(t+ +λ λ 2 u(t + +λ λ 2 (48 If u (t + < λ, the equato (47 mples x (t < for all t > t + Therefore, we have the closed-form formula: x (t = ( (+λ 2 τ t t x (t + + u(t+ λ λ 2 Fally, f u (t + [ λ,λ ], the equato (47 mples x (t = u(t + λ λ 2 (49 26
27 Case II (x (t + > : If u (t + λ, the t s easy to verfy that x (t s obtaed by equato (48 Otherwse, We use the recursve formula (47 to derve the latest tme t + [t +,t ] such that x t+ > s true Ideed, sce x (t > for all t [t +,t + ], we have a closed-form formula for x t+ : x t+ = ( x (t + (+λ 2 τ t+ t + u(t+ +λ λ 2 u(t + +λ λ 2 (5 We look for the largest t + such that the rght-had sde of equato (5 s postve, whch s equvalet of t + t < log (+ λ 2x (t + u (t /log(+λ 2 τ (5 + +λ Thus, t + s the largest teger [t +,t ] such that equalty (5 holds If t + = t, the x (t s obtaed by (5 Otherwse, we ca calculate x t+ + by formula (47, the resort to Case I or Case III, treatg t + as t Case III (x (t + < : If u (t + λ, the x (t s obtaed by equato (49 Otherwse, we calculate the largest teger t [t +,t ] such that x t < s true Usg the same argumet as for Case II, we have the closed-form expresso x t = ( x (t + (+λ 2 τ t t + u(t+ λ λ 2 u(t + λ λ 2 (52 where t s the largest teger [t +,t ] such that the followg equalty holds: t t < log (+ λ 2x (t + u (t /log(+λ 2 τ (53 + λ If t = t, the x (t s obtaed by (52 Otherwse, we ca calculate x t + by formula (47, the resort to Case I or Case II, treatg t as t Fally, we ote that formula (47 mples the mootocty of x (t (t = t +,t +2, As a cosequece, the procedure of ether Case I, Case II or Case III s executed for at most oce Hece, the algorthm for calculatg x (t has O( tme complexty Refereces [] A Beck ad M Teboulle A fast teratve shrkage-threshold algorthm for lear verse problems SIAM Joural o Imagg Sceces, 2(:83 22, 29 [2] D P Bertsekas Icremetal proxmal methods for large scale covex optmzato Mathematcal Programmg, Ser B, 29:63 95, 2 27
28 [3] D P Bertsekas Icremetal gradet, subgradet, ad proxmal methods for covex optmzato: a survey I S Sra, S Nowoz, ad S J Wrght, edtors, Optmzato for Mache Learg, chapter 4 The MIT Press, 22 [4] D Blatt, A O Hero, ad H Gauchma A coverget cremetal gradet method wth a costat step sze SIAM Joural o Optmzato, 8(:29 5, 27 [5] L Bottou Large-scale mache learg wth stochastc gradet descet I Y Lechevaller ad G Saporta, edtors, Proceedgs of the 9th Iteratoal Coferece o Computatoal Statstcs (COMPSTAT 2, pages 77 87, Pars, Frace, August 2 Sprger [6] O Bousquet ad A Elsseeff Stablty ad geeralzato Joural of Mache Learg Research, 2: , 22 [7] S Boyd, N Parkh, E Chu, B Peleato, ad J Eckste Dstrbuted optmzato ad statstcal learg va the alteratg drecto method of multplers Foudatos ad Treds Mache Learg, 3(: 22, 2 [8] A Chambolle ad T Pock A frst-order prmal-dual algorthm for covex problems wth applcatos to magg Joural of Mathematcal Imagg ad Vso, 4(:2 45, 2 [9] K-W Chag, C-J Hseh, ad C-J L Coordate descet method for large-scale l 2 -loss lear support vector maches Joural of Mache Learg Research, 9: , 28 [] J Dea ad S Ghemawat MapReduce: Smplfed data processg o large clusters Commucatos of the ACM, 5(:7 3, 28 [] J Duch ad Y Sger Effcet ole ad batch learg usg forward backward splttg Joural of Mache Learg Research, : , 29 [2] R-E Fa ad C-J L LIBSVM data: Classfcato, regresso ad mult-label URL: cl/lbsvmtools/datasets, 2 [3] T Haste, R Tbshra, ad J Fredma The Elemets of Statstcal Learg: Data Mg, Iferece, ad Predcto Sprger, New York, 2d edto, 29 [4] J-B Hrart-Urruty ad C Lemaréchal Fudametals of Covex Aalyss Sprger, 2 [5] C-J Hseh, K-W Chag, C-J L, S Keerth, ad S Sudararaa A dual coordate descet method for large-scale lear svm I Proceedgs of the 25th Iteratoal Coferece o Mache Learg (ICML, pages 48 45, 28 [6] R Johso ad T Zhag Acceleratg stochastc gradet descet usg predctve varace reducto I Advaces Neural Iformato Processg Systems 26, pages [7] J Lagford, L L, ad T Zhag Sparse ole learg va trucated gradet Joural of Mache Learg Research, :777 8, 29 [8] Q L, Z Lu, ad L Xao A accelerated proxmal coordate gradet method ad ts applcato to regularzed emprcal rsk mmzato Techcal Report MSR-TR-24-94, Mcrosoft Research, 24 arxv:
arxiv: v1 [cs.lg] 22 Feb 2015
SDCA wthout Dualty Sha Shalev-Shwartz arxv:50.0677v cs.lg Feb 05 Abstract Stochastc Dual Coordate Ascet s a popular method for solvg regularzed loss mmzato for the case of covex losses. I ths paper we
More informationFeature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)
CSE 546: Mache Learg Lecture 6 Feature Selecto: Part 2 Istructor: Sham Kakade Greedy Algorthms (cotued from the last lecture) There are varety of greedy algorthms ad umerous amg covetos for these algorthms.
More informationPart 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))
art 4b Asymptotc Results for MRR usg RESS Recall that the RESS statstc s a specal type of cross valdato procedure (see Alle (97)) partcular to the regresso problem ad volves fdg Y $,, the estmate at the
More informationDimensionality Reduction and Learning
CMSC 35900 (Sprg 009) Large Scale Learg Lecture: 3 Dmesoalty Reducto ad Learg Istructors: Sham Kakade ad Greg Shakharovch L Supervsed Methods ad Dmesoalty Reducto The theme of these two lectures s that
More informationRademacher Complexity. Examples
Algorthmc Foudatos of Learg Lecture 3 Rademacher Complexty. Examples Lecturer: Patrck Rebesch Verso: October 16th 018 3.1 Itroducto I the last lecture we troduced the oto of Rademacher complexty ad showed
More informationAn Accelerated Proximal Coordinate Gradient Method
A Accelerated Proxmal Coordate Gradet Method Qhag L Uversty of Iowa Iowa Cty IA USA qhag-l@uowaedu Zhaosog Lu Smo Fraser Uversty Buraby BC Caada zhaosog@sfuca L Xao Mcrosoft Research Redmod WA USA lxao@mcrosoftcom
More informationBayes (Naïve or not) Classifiers: Generative Approach
Logstc regresso Bayes (Naïve or ot) Classfers: Geeratve Approach What do we mea by Geeratve approach: Lear p(y), p(x y) ad the apply bayes rule to compute p(y x) for makg predctos Ths s essetally makg
More informationChapter 5 Properties of a Random Sample
Lecture 6 o BST 63: Statstcal Theory I Ku Zhag, /0/008 Revew for the prevous lecture Cocepts: t-dstrbuto, F-dstrbuto Theorems: Dstrbutos of sample mea ad sample varace, relatoshp betwee sample mea ad sample
More informationCHAPTER 4 RADICAL EXPRESSIONS
6 CHAPTER RADICAL EXPRESSIONS. The th Root of a Real Number A real umber a s called the th root of a real umber b f Thus, for example: s a square root of sce. s also a square root of sce ( ). s a cube
More informationLecture 9: Tolerant Testing
Lecture 9: Tolerat Testg Dael Kae Scrbe: Sakeerth Rao Aprl 4, 07 Abstract I ths lecture we prove a quas lear lower boud o the umber of samples eeded to do tolerat testg for L dstace. Tolerat Testg We have
More informationTESTS BASED ON MAXIMUM LIKELIHOOD
ESE 5 Toy E. Smth. The Basc Example. TESTS BASED ON MAXIMUM LIKELIHOOD To llustrate the propertes of maxmum lkelhood estmates ad tests, we cosder the smplest possble case of estmatg the mea of the ormal
More informationEconometric Methods. Review of Estimation
Ecoometrc Methods Revew of Estmato Estmatg the populato mea Radom samplg Pot ad terval estmators Lear estmators Ubased estmators Lear Ubased Estmators (LUEs) Effcecy (mmum varace) ad Best Lear Ubased Estmators
More informationSolving Constrained Flow-Shop Scheduling. Problems with Three Machines
It J Cotemp Math Sceces, Vol 5, 2010, o 19, 921-929 Solvg Costraed Flow-Shop Schedulg Problems wth Three Maches P Pada ad P Rajedra Departmet of Mathematcs, School of Advaced Sceces, VIT Uversty, Vellore-632
More informationOrdinary Least Squares Regression. Simple Regression. Algebra and Assumptions.
Ordary Least Squares egresso. Smple egresso. Algebra ad Assumptos. I ths part of the course we are gog to study a techque for aalysg the lear relatoshp betwee two varables Y ad X. We have pars of observatos
More informationMultivariate Transformation of Variables and Maximum Likelihood Estimation
Marquette Uversty Multvarate Trasformato of Varables ad Maxmum Lkelhood Estmato Dael B. Rowe, Ph.D. Assocate Professor Departmet of Mathematcs, Statstcs, ad Computer Scece Copyrght 03 by Marquette Uversty
More informationLecture 3 Probability review (cont d)
STATS 00: Itroducto to Statstcal Iferece Autum 06 Lecture 3 Probablty revew (cot d) 3. Jot dstrbutos If radom varables X,..., X k are depedet, the ther dstrbuto may be specfed by specfyg the dvdual dstrbuto
More informationSimple Linear Regression
Statstcal Methods I (EST 75) Page 139 Smple Lear Regresso Smple regresso applcatos are used to ft a model descrbg a lear relatoshp betwee two varables. The aspects of least squares regresso ad correlato
More informationFunctions of Random Variables
Fuctos of Radom Varables Chapter Fve Fuctos of Radom Varables 5. Itroducto A geeral egeerg aalyss model s show Fg. 5.. The model output (respose) cotas the performaces of a system or product, such as weght,
More informationX X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then
Secto 5 Vectors of Radom Varables Whe workg wth several radom varables,,..., to arrage them vector form x, t s ofte coveet We ca the make use of matrx algebra to help us orgaze ad mapulate large umbers
More informationSummary of the lecture in Biostatistics
Summary of the lecture Bostatstcs Probablty Desty Fucto For a cotuos radom varable, a probablty desty fucto s a fucto such that: 0 dx a b) b a dx A probablty desty fucto provdes a smple descrpto of the
More informationCS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x
CS 75 Mache Learg Lecture 8 Lear regresso Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 75 Mache Learg Lear regresso Fucto f : X Y s a lear combato of put compoets f + + + K d d K k - parameters
More information{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:
Chapter 4 Exercses Samplg Theory Exercse (Smple radom samplg: Let there be two correlated radom varables X ad A sample of sze s draw from a populato by smple radom samplg wthout replacemet The observed
More informationAnalysis of Lagrange Interpolation Formula
P IJISET - Iteratoal Joural of Iovatve Scece, Egeerg & Techology, Vol. Issue, December 4. www.jset.com ISS 348 7968 Aalyss of Lagrage Iterpolato Formula Vjay Dahya PDepartmet of MathematcsMaharaja Surajmal
More informationPoint Estimation: definition of estimators
Pot Estmato: defto of estmators Pot estmator: ay fucto W (X,..., X ) of a data sample. The exercse of pot estmato s to use partcular fuctos of the data order to estmate certa ukow populato parameters.
More informationENGI 3423 Simple Linear Regression Page 12-01
ENGI 343 mple Lear Regresso Page - mple Lear Regresso ometmes a expermet s set up where the expermeter has cotrol over the values of oe or more varables X ad measures the resultg values of aother varable
More informationUNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS
UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS Postpoed exam: ECON430 Statstcs Date of exam: Jauary 0, 0 Tme for exam: 09:00 a.m. :00 oo The problem set covers 5 pages Resources allowed: All wrtte ad prted
More informationInvestigation of Partially Conditional RP Model with Response Error. Ed Stanek
Partally Codtoal Radom Permutato Model 7- vestgato of Partally Codtoal RP Model wth Respose Error TRODUCTO Ed Staek We explore the predctor that wll result a smple radom sample wth respose error whe a
More informationStrong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity
BULLETIN of the MALAYSIAN MATHEMATICAL SCIENCES SOCIETY Bull. Malays. Math. Sc. Soc. () 7 (004), 5 35 Strog Covergece of Weghted Averaged Appromats of Asymptotcally Noepasve Mappgs Baach Spaces wthout
More informationChapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements
Aoucemets No-Parametrc Desty Estmato Techques HW assged Most of ths lecture was o the blacboard. These sldes cover the same materal as preseted DHS Bometrcs CSE 90-a Lecture 7 CSE90a Fall 06 CSE90a Fall
More informationCommunication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems
Commucato-Effcet Dstrbuted Prmal-Dual Algorthm for Saddle Pot Problems Yaodog Yu Nayag Techologcal Uversty ydyu@tu.edu.sg Sul Lu Nayag Techologcal Uversty lusl@tu.edu.sg So Jal Pa Nayag Techologcal Uversty
More informationX ε ) = 0, or equivalently, lim
Revew for the prevous lecture Cocepts: order statstcs Theorems: Dstrbutos of order statstcs Examples: How to get the dstrbuto of order statstcs Chapter 5 Propertes of a Radom Sample Secto 55 Covergece
More informationSupervised learning: Linear regression Logistic regression
CS 57 Itroducto to AI Lecture 4 Supervsed learg: Lear regresso Logstc regresso Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 57 Itro to AI Data: D { D D.. D D Supervsed learg d a set of eamples s
More information2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America
SOLUTION OF SYSTEMS OF SIMULTANEOUS LINEAR EQUATIONS Gauss-Sedel Method 006 Jame Traha, Autar Kaw, Kev Mart Uversty of South Florda Uted States of Amerca kaw@eg.usf.edu Itroducto Ths worksheet demostrates
More informationLecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model
Lecture 7. Cofdece Itervals ad Hypothess Tests the Smple CLR Model I lecture 6 we troduced the Classcal Lear Regresso (CLR) model that s the radom expermet of whch the data Y,,, K, are the outcomes. The
More informationLecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions
CO-511: Learg Theory prg 2017 Lecturer: Ro Lv Lecture 16: Bacpropogato Algorthm Dsclamer: These otes have ot bee subected to the usual scruty reserved for formal publcatos. They may be dstrbuted outsde
More informationObjectives of Multiple Regression
Obectves of Multple Regresso Establsh the lear equato that best predcts values of a depedet varable Y usg more tha oe eplaator varable from a large set of potetal predctors {,,... k }. Fd that subset of
More informationClass 13,14 June 17, 19, 2015
Class 3,4 Jue 7, 9, 05 Pla for Class3,4:. Samplg dstrbuto of sample mea. The Cetral Lmt Theorem (CLT). Cofdece terval for ukow mea.. Samplg Dstrbuto for Sample mea. Methods used are based o CLT ( Cetral
More informationSTATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1
STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS Recall Assumpto E(Y x) η 0 + η x (lear codtoal mea fucto) Data (x, y ), (x 2, y 2 ),, (x, y ) Least squares estmator ˆ E (Y x) ˆ " 0 + ˆ " x, where ˆ
More informationThe Mathematical Appendix
The Mathematcal Appedx Defto A: If ( Λ, Ω, where ( λ λ λ whch the probablty dstrbutos,,..., Defto A. uppose that ( Λ,,..., s a expermet type, the σ-algebra o λ λ λ are defed s deoted by ( (,,...,, σ Ω.
More informationResearch Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings
Hdaw Publshg Corporato Iteratoal Joural of Mathematcs ad Mathematcal Sceces Volume 009, Artcle ID 391839, 9 pages do:10.1155/009/391839 Research Artcle A New Iteratve Method for Commo Fxed Pots of a Fte
More informationMaximum Likelihood Estimation
Marquette Uverst Maxmum Lkelhood Estmato Dael B. Rowe, Ph.D. Professor Departmet of Mathematcs, Statstcs, ad Computer Scece Coprght 08 b Marquette Uverst Maxmum Lkelhood Estmato We have bee sag that ~
More informationρ < 1 be five real numbers. The
Lecture o BST 63: Statstcal Theory I Ku Zhag, /0/006 Revew for the prevous lecture Deftos: covarace, correlato Examples: How to calculate covarace ad correlato Theorems: propertes of correlato ad covarace
More informationCIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights
CIS 800/002 The Algorthmc Foudatos of Data Prvacy October 13, 2011 Lecturer: Aaro Roth Lecture 9 Scrbe: Aaro Roth Database Update Algorthms: Multplcatve Weghts We ll recall aga) some deftos from last tme:
More informationSupport vector machines
CS 75 Mache Learg Lecture Support vector maches Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square CS 75 Mache Learg Outle Outle: Algorthms for lear decso boudary Support vector maches Mamum marg hyperplae.
More informationDistributed Accelerated Proximal Coordinate Gradient Methods
Dstrbuted Accelerated Proxmal Coordate Gradet Methods Yog Re, Ju Zhu Ceter for Bo-Ispred Computg Research State Key Lab for Itell. Tech. & Systems Dept. of Comp. Sc. & Tech., TNLst Lab, Tsghua Uversty
More informationIntroduction to local (nonparametric) density estimation. methods
Itroducto to local (oparametrc) desty estmato methods A slecture by Yu Lu for ECE 66 Sprg 014 1. Itroducto Ths slecture troduces two local desty estmato methods whch are Parze desty estmato ad k-earest
More informationLecture 3. Sampling, sampling distributions, and parameter estimation
Lecture 3 Samplg, samplg dstrbutos, ad parameter estmato Samplg Defto Populato s defed as the collecto of all the possble observatos of terest. The collecto of observatos we take from the populato s called
More informationESS Line Fitting
ESS 5 014 17. Le Fttg A very commo problem data aalyss s lookg for relatoshpetwee dfferet parameters ad fttg les or surfaces to data. The smplest example s fttg a straght le ad we wll dscuss that here
More informationENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections
ENGI 441 Jot Probablty Dstrbutos Page 7-01 Jot Probablty Dstrbutos [Navd sectos.5 ad.6; Devore sectos 5.1-5.] The jot probablty mass fucto of two dscrete radom quattes, s, P ad p x y x y The margal probablty
More informationMedian as a Weighted Arithmetic Mean of All Sample Observations
Meda as a Weghted Arthmetc Mea of All Sample Observatos SK Mshra Dept. of Ecoomcs NEHU, Shllog (Ida). Itroducto: Iumerably may textbooks Statstcs explctly meto that oe of the weakesses (or propertes) of
More informationFor combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.
Addtoal Decrease ad Coquer Algorthms For combatoral problems we mght eed to geerate all permutatos, combatos, or subsets of a set. Geeratg Permutatos If we have a set f elemets: { a 1, a 2, a 3, a } the
More informationSimulation Output Analysis
Smulato Output Aalyss Summary Examples Parameter Estmato Sample Mea ad Varace Pot ad Iterval Estmato ermatg ad o-ermatg Smulato Mea Square Errors Example: Sgle Server Queueg System x(t) S 4 S 4 S 3 S 5
More informationUNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS
Numercal Computg -I UNIT SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS Structure Page Nos..0 Itroducto 6. Objectves 7. Ital Approxmato to a Root 7. Bsecto Method 8.. Error Aalyss 9.4 Regula Fals Method
More informationUnimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods
Malaysa Umodalty Joural Tests of Mathematcal for Global Optmzato Sceces (): of 05 Sgle - 5 Varable (007) Fuctos Usg Statstcal Methods Umodalty Tests for Global Optmzato of Sgle Varable Fuctos Usg Statstcal
More informationNon-uniform Turán-type problems
Joural of Combatoral Theory, Seres A 111 2005 106 110 wwwelsevercomlocatecta No-uform Turá-type problems DhruvMubay 1, Y Zhao 2 Departmet of Mathematcs, Statstcs, ad Computer Scece, Uversty of Illos at
More informationbest estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best
Error Aalyss Preamble Wheever a measuremet s made, the result followg from that measuremet s always subject to ucertaty The ucertaty ca be reduced by makg several measuremets of the same quatty or by mprovg
More informationUnsupervised Learning and Other Neural Networks
CSE 53 Soft Computg NOT PART OF THE FINAL Usupervsed Learg ad Other Neural Networs Itroducto Mture Destes ad Idetfablty ML Estmates Applcato to Normal Mtures Other Neural Networs Itroducto Prevously, all
More informationThird handout: On the Gini Index
Thrd hadout: O the dex Corrado, a tala statstca, proposed (, 9, 96) to measure absolute equalt va the mea dfferece whch s defed as ( / ) where refers to the total umber of dvduals socet. Assume that. The
More information( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model
Chapter 3 Asmptotc Theor ad Stochastc Regressors The ature of eplaator varable s assumed to be o-stochastc or fed repeated samples a regresso aalss Such a assumpto s approprate for those epermets whch
More informationL5 Polynomial / Spline Curves
L5 Polyomal / Sple Curves Cotets Coc sectos Polyomal Curves Hermte Curves Bezer Curves B-Sples No-Uform Ratoal B-Sples (NURBS) Mapulato ad Represetato of Curves Types of Curve Equatos Implct: Descrbe a
More informationThe internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test
Fal verso The teral structure of atural umbers oe method for the defto of large prme umbers ad a factorzato test Emmaul Maousos APM Isttute for the Advacemet of Physcs ad Mathematcs 3 Poulou str. 53 Athes
More informationCHAPTER VI Statistical Analysis of Experimental Data
Chapter VI Statstcal Aalyss of Expermetal Data CHAPTER VI Statstcal Aalyss of Expermetal Data Measuremets do ot lead to a uque value. Ths s a result of the multtude of errors (maly radom errors) that ca
More informationCS286.2 Lecture 4: Dinur s Proof of the PCP Theorem
CS86. Lecture 4: Dur s Proof of the PCP Theorem Scrbe: Thom Bohdaowcz Prevously, we have prove a weak verso of the PCP theorem: NP PCP 1,1/ (r = poly, q = O(1)). Wth ths result we have the desred costat
More information8.1 Hashing Algorithms
CS787: Advaced Algorthms Scrbe: Mayak Maheshwar, Chrs Hrchs Lecturer: Shuch Chawla Topc: Hashg ad NP-Completeess Date: September 21 2007 Prevously we looked at applcatos of radomzed algorthms, ad bega
More informationDiscrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b
CS 70 Dscrete Mathematcs ad Probablty Theory Fall 206 Sesha ad Walrad DIS 0b. Wll I Get My Package? Seaky delvery guy of some compay s out delverg packages to customers. Not oly does he had a radom package
More informationKernel-based Methods and Support Vector Machines
Kerel-based Methods ad Support Vector Maches Larr Holder CptS 570 Mache Learg School of Electrcal Egeerg ad Computer Scece Washgto State Uverst Refereces Muller et al. A Itroducto to Kerel-Based Learg
More informationSome Notes on the Probability Space of Statistical Surveys
Metodološk zvezk, Vol. 7, No., 200, 7-2 ome Notes o the Probablty pace of tatstcal urveys George Petrakos Abstract Ths paper troduces a formal presetato of samplg process usg prcples ad cocepts from Probablty
More informationMATH 247/Winter Notes on the adjoint and on normal operators.
MATH 47/Wter 00 Notes o the adjot ad o ormal operators I these otes, V s a fte dmesoal er product space over, wth gve er * product uv, T, S, T, are lear operators o V U, W are subspaces of V Whe we say
More informationMu Sequences/Series Solutions National Convention 2014
Mu Sequeces/Seres Solutos Natoal Coveto 04 C 6 E A 6C A 6 B B 7 A D 7 D C 7 A B 8 A B 8 A C 8 E 4 B 9 B 4 E 9 B 4 C 9 E C 0 A A 0 D B 0 C C Usg basc propertes of arthmetc sequeces, we fd a ad bm m We eed
More informationNaïve Bayes MIT Course Notes Cynthia Rudin
Thaks to Şeyda Ertek Credt: Ng, Mtchell Naïve Bayes MIT 5.097 Course Notes Cytha Rud The Naïve Bayes algorthm comes from a geeratve model. There s a mportat dstcto betwee geeratve ad dscrmatve models.
More informationBounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy
Bouds o the expected etropy ad KL-dvergece of sampled multomal dstrbutos Brado C. Roy bcroy@meda.mt.edu Orgal: May 18, 2011 Revsed: Jue 6, 2011 Abstract Iformato theoretc quattes calculated from a sampled
More information1 Onto functions and bijections Applications to Counting
1 Oto fuctos ad bectos Applcatos to Coutg Now we move o to a ew topc. Defto 1.1 (Surecto. A fucto f : A B s sad to be surectve or oto f for each b B there s some a A so that f(a B. What are examples of
More informationC-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory
ROAD MAP... AE301 Aerodyamcs I UNIT C: 2-D Arfols C-1: Aerodyamcs of Arfols 1 C-2: Aerodyamcs of Arfols 2 C-3: Pael Methods C-4: Th Arfol Theory AE301 Aerodyamcs I Ut C-3: Lst of Subects Problem Solutos?
More informationChapter 9 Jordan Block Matrices
Chapter 9 Jorda Block atrces I ths chapter we wll solve the followg problem. Gve a lear operator T fd a bass R of F such that the matrx R (T) s as smple as possble. f course smple s a matter of taste.
More informationLecture Note to Rice Chapter 8
ECON 430 HG revsed Nov 06 Lecture Note to Rce Chapter 8 Radom matrces Let Y, =,,, m, =,,, be radom varables (r.v. s). The matrx Y Y Y Y Y Y Y Y Y Y = m m m s called a radom matrx ( wth a ot m-dmesoal dstrbuto,
More informationChapter 4 Multiple Random Variables
Revew for the prevous lecture: Theorems ad Examples: How to obta the pmf (pdf) of U = g (, Y) ad V = g (, Y) Chapter 4 Multple Radom Varables Chapter 44 Herarchcal Models ad Mxture Dstrbutos Examples:
More informationChapter 8. Inferences about More Than Two Population Central Values
Chapter 8. Ifereces about More Tha Two Populato Cetral Values Case tudy: Effect of Tmg of the Treatmet of Port-We tas wth Lasers ) To vestgate whether treatmet at a youg age would yeld better results tha
More informationCubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem
Joural of Amerca Scece ;6( Cubc Nopolyomal Sple Approach to the Soluto of a Secod Order Two-Pot Boudary Value Problem W.K. Zahra, F.A. Abd El-Salam, A.A. El-Sabbagh ad Z.A. ZAk * Departmet of Egeerg athematcs
More informationA Remark on the Uniform Convergence of Some Sequences of Functions
Advaces Pure Mathematcs 05 5 57-533 Publshed Ole July 05 ScRes. http://www.scrp.org/joural/apm http://dx.do.org/0.436/apm.05.59048 A Remark o the Uform Covergece of Some Sequeces of Fuctos Guy Degla Isttut
More informationAn Introduction to. Support Vector Machine
A Itroducto to Support Vector Mache Support Vector Mache (SVM) A classfer derved from statstcal learg theory by Vapk, et al. 99 SVM became famous whe, usg mages as put, t gave accuracy comparable to eural-etwork
More informationThe Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)
We have covered: Selecto, Iserto, Mergesort, Bubblesort, Heapsort Next: Selecto the Qucksort The Selecto Problem - Varable Sze Decrease/Coquer (Practce wth algorthm aalyss) Cosder the problem of fdg the
More informationChapter 14 Logistic Regression Models
Chapter 4 Logstc Regresso Models I the lear regresso model X β + ε, there are two types of varables explaatory varables X, X,, X k ad study varable y These varables ca be measured o a cotuous scale as
More informationEstimation of Stress- Strength Reliability model using finite mixture of exponential distributions
Iteratoal Joural of Computatoal Egeerg Research Vol, 0 Issue, Estmato of Stress- Stregth Relablty model usg fte mxture of expoetal dstrbutos K.Sadhya, T.S.Umamaheswar Departmet of Mathematcs, Lal Bhadur
More informationLECTURE 24 LECTURE OUTLINE
LECTURE 24 LECTURE OUTLINE Gradet proxmal mmzato method Noquadratc proxmal algorthms Etropy mmzato algorthm Expoetal augmeted Lagraga mehod Etropc descet algorthm **************************************
More informationA New Family of Transformations for Lifetime Data
Proceedgs of the World Cogress o Egeerg 4 Vol I, WCE 4, July - 4, 4, Lodo, U.K. A New Famly of Trasformatos for Lfetme Data Lakhaa Watthaacheewakul Abstract A famly of trasformatos s the oe of several
More informationGenerative classification models
CS 75 Mache Learg Lecture Geeratve classfcato models Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square Data: D { d, d,.., d} d, Classfcato represets a dscrete class value Goal: lear f : X Y Bar classfcato
More informationExtreme Value Theory: An Introduction
(correcto d Extreme Value Theory: A Itroducto by Laures de Haa ad Aa Ferrera Wth ths webpage the authors ted to form the readers of errors or mstakes foud the book after publcato. We also gve extesos for
More informationInvestigating Cellular Automata
Researcher: Taylor Dupuy Advsor: Aaro Wootto Semester: Fall 4 Ivestgatg Cellular Automata A Overvew of Cellular Automata: Cellular Automata are smple computer programs that geerate rows of black ad whte
More informationPTAS for Bin-Packing
CS 663: Patter Matchg Algorthms Scrbe: Che Jag /9/00. Itroducto PTAS for B-Packg The B-Packg problem s NP-hard. If we use approxmato algorthms, the B-Packg problem could be solved polyomal tme. For example,
More informationTaylor s Series and Interpolation. Interpolation & Curve-fitting. CIS Interpolation. Basic Scenario. Taylor Series interpolates at a specific
CIS 54 - Iterpolato Roger Crawfs Basc Scearo We are able to prod some fucto, but do ot kow what t really s. Ths gves us a lst of data pots: [x,f ] f(x) f f + x x + August 2, 25 OSU/CIS 54 3 Taylor s Seres
More informationOvercoming Limitations of Sampling for Aggregation Queries
CIS 6930 Approxmate Quer Processg Paper Presetato Sprg 2004 - Istructor: Dr Al Dobra Overcomg Lmtatos of Samplg for Aggregato Queres Authors: Surajt Chaudhur, Gautam Das, Maur Datar, Rajeev Motwa, ad Vvek
More informationBeam Warming Second-Order Upwind Method
Beam Warmg Secod-Order Upwd Method Petr Valeta Jauary 6, 015 Ths documet s a part of the assessmet work for the subject 1DRP Dfferetal Equatos o Computer lectured o FNSPE CTU Prague. Abstract Ths documet
More informationKLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames
KLT Tracker Tracker. Detect Harrs corers the frst frame 2. For each Harrs corer compute moto betwee cosecutve frames (Algmet). 3. Lk moto vectors successve frames to get a track 4. Itroduce ew Harrs pots
More informationAN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET
AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET Abstract. The Permaet versus Determat problem s the followg: Gve a matrx X of determates over a feld of characterstc dfferet from
More informationA tighter lower bound on the circuit size of the hardest Boolean functions
Electroc Colloquum o Computatoal Complexty, Report No. 86 2011) A tghter lower boud o the crcut sze of the hardest Boolea fuctos Masak Yamamoto Abstract I [IPL2005], Fradse ad Mlterse mproved bouds o the
More informationCS 1675 Introduction to Machine Learning Lecture 12 Support vector machines
CS 675 Itroducto to Mache Learg Lecture Support vector maches Mlos Hauskrecht mlos@cs.ptt.edu 539 Seott Square Mdterm eam October 9, 7 I-class eam Closed book Stud materal: Lecture otes Correspodg chapters
More information18.413: Error Correcting Codes Lab March 2, Lecture 8
18.413: Error Correctg Codes Lab March 2, 2004 Lecturer: Dael A. Spelma Lecture 8 8.1 Vector Spaces A set C {0, 1} s a vector space f for x all C ad y C, x + y C, where we take addto to be compoet wse
More informationOverview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression
Overvew Basc cocepts of Bayesa learg Most probable model gve data Co tosses Lear regresso Logstc regresso Bayesa predctos Co tosses Lear regresso 30 Recap: regresso problems Iput to learg problem: trag
More informationLinear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab
Lear Regresso Lear Regresso th Shrkage Some sldes are due to Tomm Jaakkola, MIT AI Lab Itroducto The goal of regresso s to make quattatve real valued predctos o the bass of a vector of features or attrbutes.
More informationLecture Notes Types of economic variables
Lecture Notes 3 1. Types of ecoomc varables () Cotuous varable takes o a cotuum the sample space, such as all pots o a le or all real umbers Example: GDP, Polluto cocetrato, etc. () Dscrete varables fte
More information