Fast Rates for Regularized Objectives

Size: px
Start display at page:

Download "Fast Rates for Regularized Objectives"

Transcription

1 Fast Rates for Regularized Objectives Karthik Sridhara, Natha Srebro, Shai Shalev-Shwartz Toyota Techological Istitute Chicago Abstract We study covergece properties of empirical miimizatio of a stochastic strogly covex objective, where the stochastic compoet is liear. We show that the value attaied by the empirical miimizer coverges to the optimal value with rate /. The result applies, i particular, to the SVM objective. Thus, we obtai a rate of / o the covergece of the SVM objective (with fixed regularizatio parameter) to its ifiite data limit. We demostrate how this is essetial for obtaiig certai type of oracle iequalities for SVMs. The results exted also to approximate miimizatio as well as to strog covexity with respect to a arbitrary orm, ad so also to objectives regularized usig other l p orms. Itroductio We cosider the problem of (approximately) miimizig a stochastic objective F (w) = E θ [f(w; θ)] () where the optimizatio is with respect to w W, based o a i.i.d. sample θ,..., θ. We focus o problems where f(w; θ) has a geeralized liear form: f(w; θ) = l( w, φ(θ), θ) + r(w). (2) The relevat special case is regularized liear predictio, where θ = (x, y), l( w, φ(x), y) is the loss of predictig w, φ(x) whe the true target is y, ad r(w) is a regularizer. It is well kow that whe the domai W ad the mappig φ( ) are bouded, ad the fuctio l(z; θ) is Lipschitz cotiuous i z, the empirical averages ˆF (w) = Ê [f(w; θ)] = f(w; θ i ) (3) coverge uiformly to their expectatios F (w) with rate /. This justifies usig the empirical miimizer ŵ = arg mi ˆF (w), (4) w W ad we ca the establish covergece of F (ŵ) to the populatio optimum with a rate of /. i= F (w ) = mi F (w) (5) w W Recetly, Haza et al [] studied a olie aalogue to this problem, ad established that if f(w; θ) is strogly covex i w, the average olie regret dimiishes with a much faster rate, amely (log )/. The fuctio f(w; θ) becomes strogly covex whe, for example, we have r(w) = λ 2 w 2 as i SVMs ad other regularized learig settigs. I this paper we preset a aalogous fast rate for empirical miimizatio of a strogly covex stochastic objective. I fact, we do ot eed to assume that we perform the empirical miimizatio

2 exactly: we provide uiform (over all w W) guaratees o the populatio sub-optimality F (w) F (w ) i terms of the empirical sub-optimality ˆF (w) ˆF (ŵ) with a rate of /. This is a stroger type of result tha what ca be obtaied with a olie-to-batch coversio, as it applies to ay possible solutio w, ad ot oly to some specific algorithmically defied solutio. For example, it ca be used to aalyze the performace of approximate miimizers obtaied through approximate optimizatio techiques. Specifically, cosider f(w; θ) as i (2), where l(z; θ) is covex ad L- Lipschitz i z, the orm of φ(θ) is bouded by B, ad r is λ-strogly covex. We show that for ay a > 0 ad δ > 0, with probability at least δ, for all w (of arbitrary magitude): F (w) F (w ) ( + a)( ˆF (w) ˆF (ŵ)) + O ( ( + /a) L2 B 2 (log(/δ)) ). (6) We emphasize that here ad throughout the paper the big-o otatio hides oly fixed umeric costats. It might ot be surprisig that requirig strog covexity yields a rate of /. Ideed, the coectio betwee strog covexity, variace bouds, ad rates of /, is well kow. However, it is iterestig to ote the geerality of the result here, ad the simplicity of the coditios. I particular, we do ot require ay low oise coditios, or that the loss fuctio is strogly covex (it eed oly be weakly covex). I particular, (3) applies, uder o additioal coditios, to the SVM objective. We therefore obtai covergece with a rate of / for the SVM objective. This / rate o the SVM objective is always valid, ad does ot deped o ay low-oise coditios or o specific properties of the kerel fuctio. Such a fast rate might seem surprisig at a first glace to the reader familiar with the / rate o the expected loss of the SVM optimum. There is o cotradictio here what we establish is that although the loss might coverge at a rate of /, the SVM objective (regularized loss) always coverges at a rate of /. I fact, i Sectio 3 we see how a rate of / o the objective correspods to a rate of / o the loss. Specifically, we perform a oracle aalysis of the optimum of the SVM objective (rather tha of empirical miimizatio subject to a orm costrait, as i other oracle aalyses of regularized liear learig), based o the existece of some (ukow) low-orm, low-error predictor w. Strog covexity is a cocept that depeds o a choice of orm. We state our results i a geeral form, for ay choice of orm. Strog covexity of r(w) must hold with respect to the chose orm, ad the data φ(θ) must be bouded with respect to the dual orm, i.e. we must have φ(θ) B. This allows us to apply our results also to more geeral forms of regularizers, icludig squared l p orm regularizers, r(w) = λ 2 w 2 p, for p < 2 (see Corollary 2). However, the reader may choose to read the paper always thikig of the orm w, ad so also its dual orm w, as the stadard l 2 -orm. 2 Mai Result We cosider a geeralized liear fuctio f : W Θ R, that ca be writte as i (2), defied over a closed covex subset W of a Baach space equipped with orm. Lipschitz cotiuity ad boudedess We require that the mappig φ( ) is bouded by B, i.e. φ(θ) B, ad that the fuctio l(z; θ) is L-Lipschitz i z R for every θ. Strog Covexity We require that F (w) is λ-strogly covex w.r.t. the orm w. That is, for all w, w 2 W ad α [0, ] we have: F (αw + ( α)w 2 ) αf (w ) + ( α)f (w 2 ) λ 2 α( α) w w 2 2. Recallig that w = arg mi w F (w), this esures (see for example [2, Lemma 3]): F (w) F (w ) + λ 2 w w 2 (7) We require oly that the expectatio F (w) = E [f(w; θ)] is strogly covex. Of course, requirig that f(w; θ) is λ-strogly covex for all θ (with respect to w) is eough to esure the coditio. 2

3 I particular, for a geeralized liear fuctio of the form (2) it is eough to require that l(z; y) is covex i z ad that r(w) is λ-strogly covex (w.r.t. the orm w ). We ow provide a faster covergece rate usig the above coditios. Theorem. Let W be a closed covex subset of a Baach space with orm ad dual orm ad cosider f(w; θ) = l( w, φ(θ) ; θ) + r(w) satisfyig the Lipschitz cotiuity, boudedess, ad strog covexity requiremets with parameters B, L, ad λ. Let w, ŵ, F (w) ad ˆF (w) be as defied i (??)-(??). The, for ay δ > 0 ad ay a > 0, with probability at least δ over a sample of size, we have that for all w W: (where [x] + = max(x, 0)) F (w) F (w ) ( + a)[ ˆF (w) ˆF (w )] ( + a )L2 B 2 (32 + log(/δ)) ( + a)( ˆF (w) ˆF (ŵ)) + 8 ( + a )L2 B 2 (32 + log(/δ)). It is particularly iterestig to cosider regularizers of the form r(w) = λ 2 w 2 p, which are (p )λstrogly covex w.r.t. the correspodig l p -orm [2]. Applyig Theorem to this case yields the followig boud: Corollary 2. Cosider a l p orm ad its dual l q, with < p 2, q + p =, ad the objective f(w; θ) = l( w, φ(θ) ; θ) + λ 2 w 2 p, where φ(θ) q B ad l(z; y) is covex ad L-Lipschitz i z. The domai is the etire Baach space W = l p. The, for ay δ > 0 ad ay a > 0, with probability at least δ over a sample of size, we have that for all w W = l p (of ay magitude): ( F (w) F (w ) ( + a)( ˆF (w) ˆF ( + (ŵ)) + O a )L2 B 2 log(/δ) (p ) Corollary 2 allows us to aalyze the rate of covergece of the regularized risk for l p -regularized liear learig. That is, traiig by miimizig the empirical average of: ). f(w; x, y) = l( w, x, y) + λ 2 w 2 p (8) where l(z, y) is some covex loss fuctio ad x q B. For example, i SVMs we use the l 2 orm, ad so boud x 2 B, ad the hige loss l(z, y) = [ yz] +, which is -Lipschitz. What we obtai is a boud o how quickly we ca miimize the expectatio F (w) = E [l( w, x, y)] + λ 2 w 2 p, i.e. the regularized empirical loss, or i other words, how quickly do we coverge to the ifiite-data optimum of the objective. We see, the, that the SVM objective coverges to its optimum value at a fast rate of /, without ay special assumptios. This still does t mea that the expected loss L(ŵ) = E [l( ŵ, x, y)] coverges at this rate. This behavior is empirically demostrated o the left plot of Figure. For each data set size we plot the excess expected loss L(ŵ) L(w ) ad the sub-optimality of the regularized expected loss F (ŵ) F (w ) (recall that F (ŵ) = L(ŵ) + λ 2 ŵ 2 ). Although the regularized expected loss coverges to its ifiite data limit, i.e. to the populatio miimizer, with rate roughly /, the expected loss L(ŵ) coverges at a slower rate of roughly /. Studyig the covergece rate of the SVM objective allows us to better uderstad ad appreciate aalysis of computatioal optimizatio approaches for this objective, as well as obtai oracle iequalities o the geeralizatio loss of ŵ, as we do i the followig Sectio. Before movig o, we briefly provide a example of applyig Theorem with respect to the l - orm. The boud i Corollary 2 diverges whe p ad the Corollary is ot applicable for l regularizatio. This is because w 2 is ot strogly covex w.r.t. the l -orm. A example of a regularizer that is strogly covex with respect to the l orm is the (uormalized) etropy regularizer [3]: r(w) = d i= w i log( w i ). This regularizer is /Bw-strogly 2 covex w.r.t. w, as log as w B w (see [2]), yieldig: Corollary 3. Cosider a fuctio f(w; θ) = l( w, φ(θ) ; θ) + d i= w i log( w i ), where φ(θ) B ad l(z; y) is covex ad L-Lipschitz i z. Take the domai to be the l ball 3

4 Suboptimality of Objective Excess Expected Loss Figure : Left: Excess expected loss L(ŵ) L(w ) ad sub-optimality of the regularized expected loss F (ŵ) F (w ) as a fuctio of traiig set size, for a fixed λ = 0.8. Right: Excess expected loss L(ŵ λ ) mi w L(w o), relative to the overall optimal w o = arg mi w L(w), with λ = p 300/. Both plots are o a logarithmic scale ad refer to a sythetic example with x uiform over [.5,.5] 300, ad y = sig x whe x > but uiform otherwise. W = {w R d : w B w }. The, for ay δ > 0 ad ay a > 0, with probability at least δ over a sample of size, we have that for all w W: ( F (w) F (w ) ( + a)( ˆF (w) ˆF ( + a (ŵ)) + O )L2 B 2 Bw 2 ) log(/δ). 3 Oracle Iequalities for SVMs I this Sectio we apply the results from previous Sectio to obtai a oracle iequality o the expected loss L(w) = E [l( w, x, y)] of a approximate miimizer of the SVM traiig objective ˆF λ (w) = Ê [f λ(w)] where f λ (w; x, y) = l( w, x, y) + λ 2 w 2, (9) ad l(z, y) is the hige-loss, or ay other -Lipschitz loss fuctio. As before we deote B = sup x x (all orms i this Sectio are l 2 orms). We assume, as a oracle assumptio, that there exists a good predictor w o with low orm w o ad which attais low expected loss L(w o ). Cosider a optimizatio algorithm for ˆF λ (w) that is guarateed to fid w such that ˆF λ ( w) mi ˆF λ (w) + ɛ opt. Usig the results of Sectio 2, we ca traslate this approximate optimality of the empirical objective to a approximate optimality of the expected objective F λ (w) = E [f λ (w)]. Specifically, applyig Corollary 2 with a = we have that with probability at least δ: ( B F λ ( w) F λ (w 2 ) log(/δ) ) 2ɛ opt + O. (0) Optimizig to withi ɛ opt = O( B2 ) is the eough to esure ( B F λ ( w) F λ (w 2 ) log(/δ) ) = O. () I order to traslate this to a boud o the expected loss L( w) we cosider the followig decompositio: L( w) = L(w o ) + (F λ ( w) F λ (w )) + (F λ (w ) F λ (w o )) + λ 2 w o 2 λ 2 ( w 2 B 2 ) log(/δ) L(w o ) + O λ 2 w o 2 (2) 4

5 where we used the boud (8) to boud the secod term, the optimality of w to esure the third term is o-positive, ad we also dropped the last, o-positive, term. This might seem like a rate of / o the geeralizatio error, but we eed to choose λ so as to balace the secod ad third terms. The optimal choice for λ is λ() = c B log(/δ) w o, (3) for some costat c. We ca ow formally state our oracle iequality, which is obtaied by substitutig (0) ito (9): Corollary 4. Cosider a SVM-type objective as i (6). For ay w o ad ay δ > 0, with probability at least δ over a sample of size, we have that for all w s.t. ˆFλ() ( w) mi ˆF λ() (w)+o( B2 ), where λ() chose as i (0), the followig holds: L( w) L(w o ) + O B 2 w o 2 log(/δ) Corollary 4 is demostrated empirically o the right plot of Figure. The way we set λ() i Corollary 4 depeds o w o. However, usig we obtai: λ() = B log(/δ) (4) Corollary 5. Cosider a SVM-type objective as i (6) with λ() set as i (). For ay δ > 0, with probability at least δ over a sample of size, we have that for all w s.t. ˆFλ() ( w) mi ˆF λ() (w) + O( B2 ), the followig holds: L( w) if w o L(w o ) + O B 2 ( w o 4 + ) log(/δ) The price we pay here is that the boud of Corollary 5 is larger by a factor of w o relative to the boud of Corollary 4. Nevertheless, this boud allows us to coverge with a rate of / to the expected loss of ay fixed predictor. It is iterestig to repeat the aalysis of this Sectio usig the more stadard result: ( ) F λ (w) F λ (w ) ˆF λ (w) ˆF B λ (w 2 ) + O w B 2 for w B w where we igore the depedece o δ. Settig B w = 2/λ, as this is a boud o the orm of both the empirical ad populatio optimums, ad usig (2) istead of Corollary 2 i our aalysis yields the oracle iequality: ( ) /3 L( w) L(w o ) + O B 2 w o 2 log(/δ) (6) (5) The oracle aalysis studied here is very simple our oracle assumptio ivolves oly a sigle predictor w o, ad we make o assumptios about the kerel or the oise. We ote that a more sophisticated aalysis has bee carried out by Steiwart et al [4], who showed that rates faster tha / are possible uder certai coditios o oise ad complexity of kerel class. I Steiwart s et al aalyses the estimatio rates (i.e. rates for expected regularized risk) are give i terms of the approximatio error quatity λ 2 w 2 + L(w ) L where L is the Bayes risk. I our result we cosider the estimatio rate for regularized objective idepedet of the approximatio error. 5

6 4 Proof of Mai Result To prove Theorem we use techiques of reweighig ad peelig followig Bartlett et al [5]. For each w, we defie g w (θ) = f(w; θ) f(w ; θ), ad so our goal is to boud the expectatio of g w i terms of its empirical average. We deote by G = {g w w W }. Sice our desired boud is ot exactly uiform, ad we would like to pay differet attetio to fuctios depedig o their expected sub-optimality, we will istead cosider the followig reweighted class. For ay r > 0 defie G r = { g r w = } gw : w W, k(w) = mi{k Z 4 k(w) + : E [g w ] r4 k } where Z + is the set of o-egative itegers. I other words, g r w G r is just a scaled versio of g w G ad the scalig factor esures that E [g r w] r. We will begi by boudig the variatio betwee expected ad empirical average values of g r G r. This is typically doe i terms of the complexity of the class G r. However, we will istead use the complexity of a slightly differet class of fuctios, which igores the o-radom (i.e. o-datadepedet) regularizatio terms r(w). Defie: H r = { h r w = hw : w W, k(w) = mi{k Z 4 k(w) + : E [g w ] r4 k } where h w (θ) = g w (θ) (r(w) r(w )) = l( w, φ(θ) ; θ) l( w, φ(θ) ; θ). (9) That is, h r w(θ) is the data depedet compoet of gw, r droppig the (scaled) regularizatio terms. With this defiitio we have E [gw] r Ê [gr w] = E [h r w] Ê [hr w] (the regularizatio terms o the left had side cacel out), ad so it is eough to boud the deviatio of the empirical meas i H r. This ca be doe i terms of the Rademacher Complexity of the class, R(H r ) [6, Theorem 5]: For ay δ > 0, with probability at least δ, ( ) sup E [h r ] Ê [hr ] 2R(H r ) + sup h r log /δ (θ) 2. (20) h r H r h r H r,θ We will ow proceed to boudig the two terms o the right had side: Lemma 6. sup h r H r,θ h r (θ) LB 2r / λ Proof. From the defiitio of h r w give i (5) (6), the Lipschitz cotiuity of l( ; θ), ad the boud φ(θ) B, we have for all w, θ: } (7) (8) h r w(θ) hw(θ) 4 k(w) LB w w /4 k(w) (2) We ow use the strog covexity of F (w), ad i particular eq. (4), as well as the defiitios of g w ad k(w), ad fially ote that 4 k(w), to get: w w 2 λ (F (w) F (w 2 )) = λ E [g 2 w] λ 4k(w) 2 r λ 6k(w) r (22) Substitutig (9) i (8) yields the desired boud. Lemma 7. R(H r ) 2L B 2r Proof. We will use the followig geeric boud o the Rademacher complexity of liear fuctioals [7, Theorem ]: for ay t(w) which is λ-strogly covex (w.r.t a orm with dual orm ), 2a R({φ w, φ t(w) a}) (sup φ ). (23) For each a > 0, defie H(a) = {h w : w W, E [g w ] a}. First ote that E [g w ] = F (w) F (w ) is λ-strogly covex. Usig (20) ad the Lipschitz compositio property we therefore have 2a R(H(a)) LB. Now: R(H r ) = R ( j=04 j H(r4 j ) ) 4 j R(H(4r j 2r )) LB 4 j/2 2r = 2LB j=0 j=0 6

7 We ow proceed to boudig E [g w ] = F (w) F (w ) ad thus provig Theorem. For ay r > 0, with probability at least δ we have: E [g w ] Ê [g w] = 4 k(w) (E [gw] r Ê [gr w]) = 4 k(w) (E [h r w] Ê [hr w]) 4 k(w) rd (24) where D = LB ( log(/δ) log(/δ)) 2LB is obtaied by substitutig Lemmas 6 ad 7 ito (7). We ow cosider two possible cases: k(w) = 0 ad k(w) > 0. The case k(w) = 0 correspods to fuctios with a expected value close to optimal: E [g w ] r, i.e. F (w) F (w ) + r. I this case (2) becomes: E [g w ] Ê [g w] + rd (25) We ow tur to fuctios for which k(w) > 0, i.e. with expected values further away from optimal. I this case, the defiitio of k(w) esures 4 k(w) r < E [g w ] ad substitutig this ito (2) we have E [g w ] Ê [g w] 4 r E [g w] rd. Rearragig terms yields: E [g w ] 4D/ r Ê [g w] (26) Combiig the two cases (22) ad (23) (ad requirig r (4D) 2 so that 4D/ ), we always r have: E [g w ] [Ê 4D/ [gw ]] + rd (27) r + Settig r = ( + a )2 (4D) 2 yields the boud i Theorem. 5 Compariso with Previous Fast Rate Guaratees Rates faster tha / for estimatio have bee previously explored uder various coditios, where strog covexity has played a sigificat role. Lee et al [8] showed faster rates for squared loss, exploitig the strog covexity of this loss fuctio, but oly uder fiite pseudodimesioality assumptio, which do ot hold i SVM-like settigs. Bousquet [9] provided similar guaratees whe the spectrum of the kerel matrix (covariace of the data) is expoetially decayig. Tsybakov [0] itroduced a margi coditio uder which rates faster tha / are show possible. It is also possible to esure rates of / by relyig o low oise coditios [9, ], but here we make o such assumptio. Most methods for derivig fast rates first boud the variace of the fuctios i the class by some mootoe fuctio of their expectatios. The, usig methods as i Bartlett et al [5], oe ca get bouds that have a localized complexity term ad additioal terms of order faster tha /. However, it is importat to ote that the localized complexity term typically domiates the rate ad still eeds to be cotrolled. For example, Bartlett et al [2] show that strict covexity of the loss fuctio implies a variace boud, ad provide a geeral result that ca eable obtaiig faster rates as log as the complexity term is low. For istace, for classes with fiite VC dimesio V, the resultig rate is (V +2)/(2V +2), which ideed is better tha / but is ot quite /. Thus we see that eve for a strictly covex loss fuctio, such as the squared loss, additioal coditios are ecessary i order to obtai fast rates. I this work we show that strog covexity ot oly implies a variace boud but i fact ca be used to boud the localized complexity. A importat distictio is that we require strog covexity of the fuctio F (w) with respect to the orm w. This is rather differet tha requirig the loss fuctio z l(z, y) be strogly covex o the reals. I particular, the loss of a liear predictor, w l( w, x, y) ca ever be strogly covex i a multi-dimesioal space, eve if l is strogly covex, sice it is flat i directios orthogoal to x. As metioed, f(w; x, y) = l( w, x, y) ca ever be strogly covex i a high-dimesioal space. However, we actually oly require the strog covexity of the expected loss F (w). If the loss fuctio l(z, y) is λ-strogly covex i z, ad the eigevalues of the covariace of x are bouded away from zero, strog covexity of F (w) ca be esured. I particular, F (w) would be cλstrogly-covex, where c is the miimal eigevalue of the COV[x]. This eables us to use Theorem 7

8 to obtai rates of / o the expected loss itself. However, we caot expect the eigevalues to be bouded away from zero i very high dimesioal spaces, limitig the applicability of the result of low-dimesioal spaces were, as discussed above, other results also apply. A iterestig observatio about our proof techique is that the oly cocetratio iequality we ivoked was McDiarmid s Iequality (i [6, Theorem 5] to obtai (7) a boud o the deviatios i terms of the Rademacher complexity). This was possible because we could make a localizatio argumet for the l orm of the fuctios i our fuctio class i terms of their expectatio. 6 Summary We believe this is the first demostratio that, without ay additioal requiremets, the SVM objective coverges to its ifiite data limit with a rate of O(/). This improves the previous results that cosidered the SVM objective oly uder special additioal coditios. The results exteds also to other regularized objectives. Although the quatity that is ultimately of iterest to us is the expected loss, ad ot the regularized expected loss, it is still importat to uderstad the statistical behavior of the regularized expected loss. This is the quatity that we actually optimize, track, ad ofte provide bouds o (e.g. i approximate or stochastic optimizatio approaches). A better uderstadig of its behavior ca allow us to both theoretically explore the behavior of regularized learig methods, to better uderstad empirical behavior observed i practice, ad to appreciate guaratees of stochastic optimizatio approaches for such regularized objectives. As we saw i Sectio 3, derivig such fast rates is also essetial for obtaiig simple ad geeral oracle iequalities, that also helps us guide our choice of regularizatio parameters. Refereces [] E. Haza, A. Kalai, S. Kale, ad A. Agarwal. Logarithmic regret algorithms for olie covex optimizatio. I Proceedigs of the Nieteeth Aual Coferece o Computatioal Learig Theory, [2] S. Shalev-Shwartz. Olie Learig: Theory, Algorithms, ad Applicatios. PhD thesis, The Hebrew Uiversity, [3] T. Zhag. Coverig umber bouds of certai regularized liear fuctio classes. J. Mach. Lear. Res., 2: , [4] I. Steiwart, D. Hush, ad C. Scovel. A ew cocetratio result for regularized risk miimizers. Highdimesioal Probability IV, i IMS Lecture Notes, 5: , [5] P. L. Bartlett, O. Bousquet, ad S. Medelso. Localized rademacher complexities. I COLT 02: Proceedigs of the 5th Aual Coferece o Computatioal Learig Theory, pages 44 58, Lodo, UK, Spriger-Verlag. [6] O. Bousquet, S. Bouchero, ad G. Lugosi. Itroductio to statistical learig theory. I O. Bousquet, U.v. Luxburg, ad G. Rätsch, editors, Advaced Lectures i Machie Learig, pages Spriger, [7] S. M. Kakade, K. Sridhara, ad A. Tewari. O the complexity of liear predictio: Risk bouds, margi bouds, ad regularizatio. I NIPS, [8] W. S. Lee, P. L. Bartlett, ad R. C. Williamso. The importace of covexity i learig with squared loss. I Computatioal Learig Theory, pages 40 46, 996. [9] O. Bousquet. Cocetratio Iequalities ad Empirical Processes Theory Applied to the Aalysis of Learig Algorithms. PhD thesis, Ecole Polytechique, [0] A. Tsybakov. Optimal aggregatio of classifiers i statistical learig. Aals of Statistics, 32:35 66, [] I. Steiwart ad C. Scovel. Fast rates for support vector machies usig gaussia kerels. ANNALS OF STATISTICS, 35:575, [2] P. L. Bartlett, M. I. Jorda, ad J. D. McAuliffe. Covexity, classificatio, ad risk bouds. Joural of the America Statistical Associatio, 0:38 56, March

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Learning Bounds for Support Vector Machines with Learned Kernels

Learning Bounds for Support Vector Machines with Learned Kernels Learig Bouds for Support Vector Machies with Leared Kerels Nati Srebro TTI-Chicago Shai Be-David Uiversity of Waterloo Mostly based o a paper preseted at COLT 06 Kerelized Large-Margi Liear Classificatio

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Exponential Convergence Rates in Classification

Exponential Convergence Rates in Classification Expoetial Covergece Rates i Classificatio Vladimir Koltchiskii ad Olexadra Bezosova Departmet of Mathematics ad Statistics The Uiversity of New Mexico Albuquerque, NM 873-4, U.S.A. vlad@math.um.edu,bezosik@math.um.edu

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Fast Rates for Support Vector Machines

Fast Rates for Support Vector Machines Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA {igo,jcs}@lal.gov Abstract. We establish learig rates to the Bayes risk

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula Joural of Multivariate Aalysis 102 (2011) 1315 1319 Cotets lists available at ScieceDirect Joural of Multivariate Aalysis joural homepage: www.elsevier.com/locate/jmva Superefficiet estimatio of the margials

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization O the Complexity of Liear Predictio: Risk Bouds, Margi Bouds, ad Regularizatio Sham M Kakade TTI Chicago Chicago, IL 60637 sham@tti-corg Karthik Sridhara TTI Chicago Chicago, IL 60637 karthik@tti-corg

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Local Rademacher Complexities

Local Rademacher Complexities Local Rademacher Complexities Peter L. Bartlett Departmet of Statistics ad Divisio of Computer Sciece Uiversity of Califoria at Berkeley 367 Evas Hall Berkeley, CA 94720-3860 bartlett@stat.berkeley.edu

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURE 23. SOME CONSEQUENCES OF ONLINE NO-REGRET METHODS I this lecture, we explore some cosequeces of the developed techiques.. Covex optimizatio Wheever

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Analysis of the Chow-Robbins Game with Biased Coins

Analysis of the Chow-Robbins Game with Biased Coins Aalysis of the Chow-Robbis Game with Biased Cois Arju Mithal May 7, 208 Cotets Itroductio to Chow-Robbis 2 2 Recursive Framework for Chow-Robbis 2 3 Geeralizig the Lower Boud 3 4 Geeralizig the Upper Boud

More information

An almost sure invariance principle for trimmed sums of random vectors

An almost sure invariance principle for trimmed sums of random vectors Proc. Idia Acad. Sci. Math. Sci. Vol. 20, No. 5, November 200, pp. 6 68. Idia Academy of Scieces A almost sure ivariace priciple for trimmed sums of radom vectors KE-ANG FU School of Statistics ad Mathematics,

More information

On the Theory of Learning with Privileged Information

On the Theory of Learning with Privileged Information O the Theory of Learig with Privileged Iformatio Dmitry Pechyoy NEC Laboratories Priceto, NJ 08540, USA pechyoy@ec-labs.com Vladimir Vapik NEC Laboratories Priceto, NJ 08540, USA vlad@ec-labs.com Abstract

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences Commuicatios of the Korea Statistical Society 29, Vol. 16, No. 5, 841 849 Precise Rates i Complete Momet Covergece for Negatively Associated Sequeces Dae-Hee Ryu 1,a a Departmet of Computer Sciece, ChugWoo

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

Notes 5 : More on the a.s. convergence of sums

Notes 5 : More on the a.s. convergence of sums Notes 5 : More o the a.s. covergece of sums Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: Dur0, Sectios.5; Wil9, Sectio 4.7, Shi96, Sectio IV.4, Dur0, Sectio.. Radom series. Three-series

More information

Approximation by Superpositions of a Sigmoidal Function

Approximation by Superpositions of a Sigmoidal Function Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 22 (2003, No. 2, 463 470 Approximatio by Superpositios of a Sigmoidal Fuctio G. Lewicki ad G. Mario Abstract. We geeralize

More information

Regularization methods for large scale machine learning

Regularization methods for large scale machine learning Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale

More information

Multi parameter proximal point algorithms

Multi parameter proximal point algorithms Multi parameter proximal poit algorithms Ogaeditse A. Boikayo a,b,, Gheorghe Moroşau a a Departmet of Mathematics ad its Applicatios Cetral Europea Uiversity Nador u. 9, H-1051 Budapest, Hugary b Departmet

More information

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices A Hadamard-type lower boud for symmetric diagoally domiat positive matrices Christopher J. Hillar, Adre Wibisoo Uiversity of Califoria, Berkeley Jauary 7, 205 Abstract We prove a ew lower-boud form of

More information

Notes for Lecture 11

Notes for Lecture 11 U.C. Berkeley CS78: Computatioal Complexity Hadout N Professor Luca Trevisa 3/4/008 Notes for Lecture Eigevalues, Expasio, ad Radom Walks As usual by ow, let G = (V, E) be a udirected d-regular graph with

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECURE 4 his lecture is partly based o chapters 4-5 i [SSBD4]. Let us o give a variat of SGD for strogly covex fuctios. Algorithm SGD for strogly covex

More information

Lecture 11 October 27

Lecture 11 October 27 STATS 300A: Theory of Statistics Fall 205 Lecture October 27 Lecturer: Lester Mackey Scribe: Viswajith Veugopal, Vivek Bagaria, Steve Yadlowsky Warig: These otes may cotai factual ad/or typographic errors..

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Spectral Partitioning in the Planted Partition Model

Spectral Partitioning in the Planted Partition Model Spectral Graph Theory Lecture 21 Spectral Partitioig i the Plated Partitio Model Daiel A. Spielma November 11, 2009 21.1 Itroductio I this lecture, we will perform a crude aalysis of the performace of

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information