Sparsity in Multiple Kernel Learning

Size: px
Start display at page:

Download "Sparsity in Multiple Kernel Learning"

Transcription

1 Sparsity i Multiple Kerel Learig Vladimir Koltchiskii School of Mathematics Georgia Istitute of Techology Atlata, GA USA vlad@math.gatech.edu ad Mig Yua School of Idustrial ad Systems Egieerig Georgia Istitute of Techology Atlata, GA USA myua@isye.gatech.edu April 28, 2010 The research of this author was supported i part by NSF grats MPSA-MCS , DMS ad CCF The research of this author was supported i part by NSF grats MPSA-MCS ad DMS

2 Abstract The problem of multiple kerel learig based o pealized empirical risk miimizatio is discussed. The complexity pealty is determied joitly by the empirical L 2 orms ad the reproducig kerel Hilbert space RKHS orms iduced by the kerels with a data-drive choice of regularizatio parameters. The mai focus is o the case whe the total umber of kerels is large, but oly a relatively small umber of them is eeded to represet the target fuctio, so that the problem is sparse. The goal is to establish oracle iequalities for the excess risk of the resultig predictio rule showig that the method is adaptive both to the ukow desig distributio ad to the sparsity of the problem. 1 Itroductio Let X i, Y i, i = 1,..., be idepedet copies of a radom couple X, Y with values i S T, where S is a measurable space with σ-algebra A typically, S is a compact subset of a fiite-dimesioal Euclidea space ad T is a Borel subset of R. I what follows, P will deote the distributio of X, Y ad Π the distributio of X. The correspodig empirical distributios, based o X 1, Y 1,... X, Y ad o X 1,..., X, will be deoted by P ad Π, respectively. For a measurable fuctio g : S T R, we deote P g := gdp = EgX, Y ad P g := gdp = 1 gx j, Y j. S T S T Similarly, we use the otatios Πf ad Π f for the itegrals of a fuctio f : S R with respect to the measures Π ad Π. The goal of predictio is to lear a reasoably good predictio rule f : S R from the empirical data {X i, Y i : i = 1, 2,..., }. To be more specific, cosider a loss fuctio l : T R R + ad defie the risk of a predictio rule f as P l f = ElY, fx, where l fx, y = ly, fx. A optimal predictio rule with respect to this loss is defied as f = argmi P l f, f:s R 2

3 where the miimizatio is take over all measurable fuctios ad, for simplicity, it is assumed that the miimum is attaied. The excess risk of a predictio rule f is defied as El f := P l f P l f. Throughout the paper, the otatio a b meas that there exists a umerical costat c > 0 such that c 1 a b c. By umerical costats we usually mea real umbers whose precise values are ot ecessarily specified, or, sometimes, costats that might deped o the characteristics of the problem that are of little iterest to us for istace, some costats that deped oly o the loss fuctio. 1.1 Learig i Reproducig Kerel Hilbert Spaces Let H K be a reproducig kerel Hilbert space RKHS associated with a symmetric oegatively defiite kerel K : S S R such that for ay x S, K x := K, x H K ad fx = f, K x HK for all f H K Aroszaj If it is kow that if f H K ad f HK 1, the it is atural to estimate f by a solutio ˆf of the followig empirical risk miimizatio problem: 1 ˆf := argmi f HK 1 ly i, fx i. 1 The size of the excess risk El ˆf of such a empirical solutio depeds o the smoothess of fuctios i the RKHS H K. A atural otio of smoothess i this cotext is related to the ukow desig distributio Π. Namely, let T K be the itegral operator from L 2 Π ito L 2 Π with kerel K. Uder a stadard assumptio that the kerel K is square itegrable i the theory of RKHS it is usually eve assumed that S is compact ad K is cotiuous, the operator T K is compact ad its spectrum is discrete. If {λ k } is the sequece of the eigevalues i=1 arraged i decreasig order of T K ad {φ k } is the correspodig L 2 Π-orthoormal sequece of eigefuctios, the it is well kow that the RKHS-orms of fuctios from the liear spa of {φ k } ca be writte as f 2 H K = k 1 f, φ k L2 Π 2 λ k, 3

4 which meas that the smoothess of fuctios i H K depeds o the rate of decay of eigevalues λ k that, i tur, depeds o the desig distributio Π. It is also clear that the uit balls i the RKHS H K are ellipsoids i the space L 2 Π with axes λ k. It was show by Medelso 2002 that the followig fuctio γ δ := 1 k 1λ k δ 2 1/2, δ [0, 1] provides tight upper ad lower bouds up to costats o localized Rademacher complexities of the uit ball i H K ad plays a importat role i the aalysis of the empirical risk miimizatio problem 1. It is easy to see that the fuctio γ 2 δ is cocave, γ 0 = 0 ad, as a cosequece, γ δ/δ is a decreasig fuctio of δ ad γ δ/δ 2 is strictly decreasig. Hece, there exists uique positive solutio of the equatio γ δ = δ 2. If δ deotes this solutio, the the results of Medelso 2002 imply that with some costat C > 0 ad with probability at least 1 e t El ˆf C δ 2 + t. The size of the quatity δ 2 ivolved i this upper boud o the excess risk depeds o the rate of decay of the eigevalues λ k as k. I particular, if λ k k 2β for some β > 1/2, the it is easy to see that γ δ 1/2 δ 1 1 2β ad δ2 2β/2β+1. Recall that uit balls i H K are ellipsoids i L 2 Π with axes of the order k β ad it is well kow that, i a variety of estimatio problems, 2β/2β+1 represets miimax covergece rates of the squared L 2 -risk for fuctios from such ellipsoids for istace, from Sobolev balls of smoothess β, as i famous Pisker s Theorem see, e.g., Tsybakov 2009, Chapter 3. Example. Sobolev spaces W α,2 G, G R d of smoothess α > d/2 is a well kow class of cocrete examples of RKHS. Let T d, d 1 deote the d-dimesioal torus ad let Π be the uiform distributio i T d. It is easy to check that, for all α > d/2, the Sobolev space W α,2 T d is a RKHS geerated by the kerel Kx, y = kx y, x, y T, where the fuctio k L 2 T d is defied by its Fourier coefficiets ˆk = α, = 1,..., d Z d, 2 := d. I this case, the eigefuctios of the operator T K are the fuctios of the Fourier basis ad its eigevalues are the umbers { α : Z d }. For d = 1 ad α > 1/2, we have 4

5 λ k k 2α recall that {λ k } are the eigevalues arraged i decreasig order so, β = α ad δ 2 2α/2α+1, which is a miimax oparametric covergece rate for Sobolev balls i W α,2 T see, e.g., Tsybakov 2009, Theorem 2.9. More geerally, for arbitrary d 1 ad α > d/2, we get β = α/d ad δ 2 2α/2α+d, which is also a miimax optimal covergece rate i this case. Suppose ow that the distributio Π is uiform i a torus T d T d of dimesio d < d. We will use the same kerel K, but restrict the RKHS H K to the torus T d of smaller dimesio. Let d = d d. For Z d, we will write =, with Z d, Z d. It is easy to prove that the eigevalues of the operator T K become i this case Z d Due to this fact, the orm of the space H K α α d /2. restricted to T d is equivalet to the orm of the Sobolev space W α d /2,2 T d. Sice the eigevalues of the operator T K coicide, up to a costat, with the umbers { α d /2 : Z d }, we get δ 2 2α d 2α d +d which is agai the miimax covergece rate for Sobolev balls i W α d /2,2 T d. I the case of more geeral desig distributios Π, the rate of decay of the eigevalues λ k ad the correspodig size of the excess risk boud δ 2 depeds o Π. If, for istace, Π is supported i a submaifold S T d of dimesio dims < d, the rate of covergece of δ 2 to 0 depeds o the dimesio of the submaifold S rather tha o the dimesio of the ambiet space T d. Usig the properties of the fuctio γ, i particular, the fact that γ δ/δ is decreasig, it is easy to observe that γ δ δ δ + δ 2, δ 0, 1]. Moreover, if ɛ = ɛk deotes the smallest value of ɛ such that the liear fuctio ɛδ + ɛ 2, δ 0, 1] provides a upper boud for the fuctio γ δ, δ 0, 1], the ɛ δ ɛ. Note that ɛ also depeds o, but we do ot have to emphasize this depedece i the otatios sice, i what follows, is fixed. Based o the observatios above, the quatity δ coicides up to a umerical costat with the slope ɛ of the smallest liear majorat of the form ɛδ+ɛ 2 of the fuctio γ δ. This iterpretatio of δ is of some importace i the desig of complexity pealties used i this paper. 5

6 1.2 Sparse Recovery via Regularizatio Istead of miimizig the empirical risk over a RKHS-ball as i problem 1, it is very commo to defie the estimator ˆf of the target fuctio f as a solutio of the pealized empirical risk miimizatio problem of the form [ ] 1 ˆf := argmi f H ly i, fx i + ɛ f α H K, 2 i=1 where ɛ > 0 is a tuig parameter that balaces the tradeoff betwee the empirical risk ad the smoothess of the estimate ad, most ofte, α = 2 sometimes, α = 1. The properties of the estimator ˆf has bee studied extesively. I particular, it was possible to derive probabilistic bouds o the excess risk El ˆf oracle iequalities with the cotrol of the radom error i terms of the rate of decay of the eigevalues {λ k }, or, equivaletly, i terms of the fuctio γ see, e.g., Blachard, Bousquet ad Massart I the recet years, there has bee a lot of iterest i a data depedet choice of kerel K i this type of problems. I particular, give a fiite possibly large dictioary {K j : j = 1, 2,..., N} of symmetric oegatively defiite kerels o S, oe ca try to fid a good kerel K as a covex combiatio of the kerels from the dictioary: { N } K K := θ j K j : θ j 0, θ θ N = 1. 3 The coefficiets of K eed to be estimated from the traiig data alog with the predictio rule. Usig this approach for problem 2 with α = 1 leads to the followig optimizatio problem: ˆf := argmi f HK K K P l f + ɛ f HK. 4 This learig problem, ofte referred to as the multiple kerel learig, has bee studied recetly by Bousquet ad Herrma 2003, Cramer, Keshet ad Siger 2003, Lackriet, Cristiaii, Bartlett, Ghaoui ad Jorda 2004, Micchelli ad Potil 2005, Li ad Zhag 2006, Srebro ad Be-David 2006, Bach 2008 ad Koltchiskii ad Yua 2008 amog others. I particular, see, e.g., Micchelli ad Potil 2005, problem 4 is equivalet to the followig: ˆf 1,..., ˆf N := argmi fj H Kj,,...,N P l f f N + ɛ 6 N f j HKj, 5

7 which is a ifiite-dimesioal versio of LASSO-type pealizatio. Koltchiskii ad Yua 2008 studied this method i the case whe the dictioary is large, but the target fuctio f has a sparse represetatio i terms of a relatively small subset of kerels {K j : j J}. It was show that this method is adaptive to sparsity extedig well kow properties of LASSO to this ifiite dimesioal framework. I this paper, we study a differet approach to the multiple kerel learig. It is closer to the recet work o sparse additive models see, e.g., Ravikumar, Liu, Lafferty ad Wasserma 2008 ad Meier, va de Geer ad Bühlma 2009 ad it is based o a double pealizatio with a combiatio of empirical L 2 -orms used to eforce the sparsity of the solutio ad RKHS-orms used to eforce the smoothess of the compoets. Moreover, we suggest a data-drive method of choosig the values of regularizatio parameters that is adaptive to ukow smoothess of the compoets determied by the behavior of distributio depedet eigevalues of the kerels. Let H j := H Kj, j = 1,..., N. Deote H := l.s. spa, ad H N := N H j { } h 1,..., h N : h j H j, j = 1,..., N. l.s. meaig the liear Note that f H if ad oly if there exists a additive represetatio possibly, o-uique f = f f N, where f j H j, j = 1,..., N. Also, H N has a atural structure of a liear space ad it ca be equipped with the followig ier product f 1,..., f N, g 1,..., g N H N := to become the direct sum of Hilbert spaces H j, j = 1,..., N. N f j, g j Hj Give a covex subset D H N, cosider the followig pealized empirical risk miimizatio problem: ˆf1,..., ˆf N = argmi f 1,...,f N D [ P l f f N + N ] ɛj f j L2 Π + ɛ 2 j f j Hj. 6 Note that for special choices of set D, for istace, for D := {f 1,..., f N : f j H j, f j Hj R j } for some R j > 0, j = 1,..., N, oe ca replace each compoet f j ivolved i the optimizatio problem by its orthogoal projectios i H j oto the liear spa of the fuctios {K j, X i, i = 1,..., } ad reduce the problem to a covex optimizatio over a fiite dimesioal space of dimesio N. 7

8 The complexity pealty i the problem 6 is based o two orms of the compoets f j of a additive represetatio: the empirical L 2 -orm, f j L2 Π, with regularizatio parameter ɛ j, ad a RKHS-orm, f j Hj, with regularizatio parameter ɛ 2 j. The empirical L 2 -orm the lighter orm is used to eforce the sparsity of the solutio whereas the RKHS orms the heavier orms are used to eforce the smoothess of the compoets. This is similar to the approach take i Meier, va de Geer ad Bühlma 2009 i the cotext of classical additive models, i.e., i the case whe S := [0, 1] N, H j := W α,2 [0, 1] for some smoothess α > 1/2 ad the space H j is a space of fuctios depedig o the j-th variable. I this case, the regularizatio parameters ɛ j are equal up to a costat to α/2α+1. The quatity ɛ 2 j, used i the smoothess part of the pealty, coicides with the miimax covergece rate i a oe compoet smooth problem. At the same time, the quatity ɛ j, used i the sparsity part of the pealty, is equal to the square root of the miimax rate which is similar to the choice of regularizatio parameter i stadard sparse recovery methods such as LASSO. This choice of regularizatio parameters results i the excess risk of the order d 2α/2α+1, where d is the umber of compoets of the target fuctio the degree of sparsity of the problem. The framework of multiple kerel learig cosidered i this paper icludes may geeralized versios of classical additive models. For istace, oe ca thik of the case whe S := [0, 1] m 1 [0, 1] m N ad H j = W α,2 [0, 1] m j is a space of fuctios depedig o the j-th block of variables. I this case, a proper choice of regularizatio parameters for uiform desig distributio would be ɛ j = α/2α+mj, j = 1,..., N so, these parameters ad the error rates for differet compoets of the model are differet. It should be also clear from the discussio i Sectio 1.1 that, if the desig distributio Π is ukow, the miimax covergece rates for the oe compoet problems are also ukow. For istace, if the projectios of desig poits o the cubes [0, 1] m j are distributed i lower dimesioal submaifolds of these cubes, the the ukow dimesios of the submaifolds rather tha the dimesios m j would be ivolved i the miimax rates ad i the regularizatio parameters ɛ j. Because of this, data drive choice of regularizatio parameters ɛ j that provides adaptatio to the ukow desig distributio Π ad to the ukow smoothess of the compoets related to this distributio is a major issue i multiple kerel learig. From this poit of view, eve i the case of classical additive models, the choice of regularizatio 8

9 parameters that is based oly o Sobolev type smoothess ad igores the desig distributio is ot adaptive. Note that, i the ifiite dimesioal LASSO studied i Koltchiskii ad Yua 2008, the regularizatio parameter ɛ is chose the same way as i the classical log N LASSO ɛ, so, it is ot related to the smoothess of the compoets. However, the oracle iequalities proved i Koltchiskii ad Yua 2008 give correct size of the excess risk oly for special choices of kerels that deped o ukow smoothess of the compoets of the target fuctio f, so, this method is ot adaptive either. 1.3 Adaptive Choice of Regularizatio Parameters Deote Kj X l, X k ˆK j :=. l,k=1, This Gram matrix ca be viewed as a empirical versio of the itegral operator j T Kj from L 2 Π ito L 2 Π with kerel K j. Deote ˆλ k, k = 1, 2,... the eigevalues of ˆKj arraged i decreasig order. We also use the otatio λ j, k = 1, 2,... for the eigevalues of the operator T Kj fuctios γ j, ˆγ j, 1 γ j δ := k : L 2 Π L 2 Π with kerel K j arraged i decreasig order. Defie k=1 λ j k δ 2 ad ˆγ j δ := 1/2 1 k=1 ˆλ j k δ 2 1/2, ad, for a fixed give A 1, let { A log N ˆɛ j := if ɛ } : ˆγ j δ ɛδ + ɛ 2, δ 0, 1]. 7 Oe ca view ˆɛ j as a empirical estimate of the quatity ɛ j = ɛk j that as we have already poited out plays a crucial role i the bouds o the excess risk i empirical risk miimizatio problems i the RKHS cotext. I fact, sice most ofte ɛ j A log N/, we will redefie this quatity as { A log N ɛ j := if ɛ } : γ j δ ɛδ + ɛ 2, δ 0, 1]. 8 We will use the followig values of regularizatio parameters i problem 6: ɛ j = τˆɛ j, where τ is a sufficietly large costat. 9

10 It should be emphasized that the structure of complexity pealty ad the choice of regularizatio parameters i 6 are closely related to the followig boud o Rademacher processes idexed by fuctios from a RKHS H K : with a high probability, for all h H K, ] R h C [ ɛk h L2Π + ɛ 2 K h HK. Such bouds follow from the results of Sectio 3 ad they provide a way to prove sparsity oracle iequalities for the estimators 6. The Rademacher process is defied as R f := 1 ε j fx j, where {ε j } is a sequece of i.i.d. Rademacher radom variables takig values +1 ad 1 with probability 1/2 each idepedet of {X j }. We will use several basic facts of the empirical processes theory throughout the paper. They iclude symmetrizatio iequalities ad cotractio compariso iequalities for Rademacher processes that ca be foud i the books of Ledoux ad Talagrad 1991 ad va der Vaart ad Weller empirical processes see, Talagrad 1996, Bousquet We also use Talagrad s cocetratio iequality for The mai goal of the paper is to establish oracle iequalities for the excess risk of the estimator ˆf = ˆf ˆf N. I these iequalities, the excess risk of ˆf is compared with the excess risk of a oracle f := f f N, f 1,..., f N D with a error term depedig o the degree of sparsity of the oracle, i.e., o the umber of o-zero compoets f j H j i its additive represetatio. The oracle iequalities will be stated i the ext sectio. Their proof relies o probabilistic bouds for empirical L 2 -orms ad data depedet regularizatio parameters ˆɛ j. The results of Sectio 3 show that they ca be bouded by their respective populatio couterparts. Usig these tools ad some bouds o empirical processes derived i Sectio 5, we prove i Sectio 4 the oracle iequalities for the estimator ˆf. 2 Oracle Iequalities Cosiderig the problem i the case whe the domai D of 6 is ot bouded, say, D = H N, leads to additioal techical complicatios ad might require some chages i the estimatio procedure. To avoid this, we assume below that D is a bouded covex subset of H N. It 10

11 will be also assumed that, for all j = 1,..., N, sup x S K j x, x 1, which, by elemetary properties of RKHS, implies that f j L f j Hj, R D := sup f f N L < +. f 1,...,f N D j = 1,..., N. Because of this, Deote R D := R D f L. We will allow the costats ivolved i the oracle iequalities stated ad proved below to deped o the value of RD so, implicitly, it is assumed that this value is ot too large. We shall also assume that N is large eough, say, so that log N 2 log log. This assumptio is ot essetial to our developmet ad is i place to avoid a extra term of the order 1 log log i our risk bouds. 2.1 Loss Fuctios of Quadratic Type We will formulate the assumptios o the loss fuctio l. The mai assumptio is that, for all y T, ly, is a oegative covex fuctio. I additio, we will assume that ly, 0, y T is uiformly bouded from above by a umerical costat. Moreover, suppose that, for all y T, ly, is twice cotiuously differetiable ad its first ad secod derivatives are uiformly bouded i T [ RD, R D ]. Deote mr := 1 2 if y T if 2 ly, u u R u 2, MR := 1 2 sup y T sup u R ad let m := mr D, M := MR D. We will assume that m > 0. Deote L := sup u R D,y T l y, u u. 2 ly, u u 2 9 Clearly, for all y T, the fuctio ly, satisfies Lipschitz coditio with costat L. The costats m, M, L will appear i a umber of places i what follows. Without loss of geerality, we ca also assume that m 1 ad L 1 otherwise, m ad L ca be replaced by a lower boud ad a upper boud, respectively. The loss fuctios satisfyig the assumptios stated above will be called the losses of quadratic type. If l is a loss of quadratic type ad f = f f N, f 1,..., f N D, the m f f 2 L 2 Π El f M f f 2 L 2 Π

12 This boud easily follows from a simple argumet based o Taylor expasio ad it will be used later i the paper. If H is dese i L 2 Π, the 10 implies that The quadratic loss ly, u := y u 2 if P l f = if P l f = P l f. 11 f H f L 2 Π i the case whe T R is a bouded set is oe of the mai examples of such loss fuctios. I this case, mr = 1 for all R > 0. I regressio problems with a bouded respose variable, more geeral loss fuctios of the form ly, u := φy u ca be also used, where φ is a eve oegative covex twice cotiuously differetiable fuctio with φ uiformly bouded i R, φ0 = 0 ad φ u > 0, u R. I classificatio problems, the loss fuctios of the form ly, u = φyu are commoly used, with φ beig a oegative decreasig covex twice cotiuously differetiable fuctio such that, agai, φ is uiformly bouded i R ad φ u > 0, u R. The loss fuctio φu = log e u ofte referred to as the logit loss is a specific example. 2.2 Geometry of the Dictioary Now we itroduce several importat geometric characteristics of dictioaries cosistig of kerels or, equivaletly, of RKHS. These characteristics are related to the degree of depedece of spaces of radom variables H j L 2 Π, j = 1,..., N ad they will be ivolved i the oracle iequalities for the excess risk El ˆf. First, for J {1,..., N} ad b [0, + ], deote C b J := { h 1,..., h N H N : h j L2 Π b h j L2 Π j J j J }. Clearly, the set C b J is a coe i the space H N that cosists of vectors h 1,..., h N whose compoets correspodig to j J domiate the rest of the compoets. This family of coes icreases as b icreases. For b = 0, C b J coicides with the liear subspace of vectors for which h j = 0, j J. For b = +, C b J is the whole space H N. The followig quatity will play the most importat role: { β 2,b J; Π := β 2,b J := if β > 0 : 1/2 h j 2 N L 2 Π β j J } h j, h 1,..., h N C L2 b J. Π 12

13 Clearly, β 2,b J; Π is a odecreasig fuctio of b. I the case of simple dictioary that cosists of oe-dimesioal spaces similar quatities have bee used i the literature o sparse recovery see, e.g., Koltchiskii 2008, 2009a,b,c. The quatity β 2,b J; Π ca be upper bouded i terms of some other geometric characteristics that describe how depedet the spaces of radom variables H j L 2 Π are. These characteristics will be itroduced below. Give h j H j, j = 1,..., N, deote by κ{h j : j J} the miimal eigevalue of the Gram matrix h j, h k L2 Π j,k J. Let We will also use the otatio { } κj := if κ{h j : j J} : h j H j, h j L2 Π = H J = l.s. H j. 13 The followig quatity is the maximal cosie of the agle i the space L 2 Π betwee the vectors i the subspaces H I ad H J for some I, J {1,..., N} : { } f, g L2 Π ρi, J := sup : f H I, g H J, f 0, g f L2 Π g L2 Π Deote ρj := ρj, J c. The quatities ρi, J ad ρj are very similar to the otio of caoical correlatio i the multivariate statistical aalysis. j J There are other importat geometric characteristics, frequetly used i the theory of sparse recovery, icludig so called restricted isometry costats by Cades ad Tao Defie δ d Π to be the smallest δ > 0 such that for all h 1,..., h N H N ad all J {1,..., N} with cardj = d, 1/2 1 δ h j 2 L 2 Π j J j J h j 1 + δ L2 Π j J h j 2 L 2 Π 1/2. This coditio with a sufficietly small value of δ d Π meas that for all choices of J with cardj = d the fuctios i the spaces H j, j J are almost orthogoal i L 2 Π. The followig simple propositio easily follows from some statemets i Koltchiskii 2009a,b, 2008 where the case of simple dictioaries cosistig of oe-dimesioal spaces H j was cosidered. 13

14 Propositio 1 For all J {1,..., N}, β 2, J; Π 1 κj1 ρ2 J. Also, if cardj = d ad δ 3d Π 1 8b, the β 2,bJ; Π 4. Thus, such quatities as β 2, J; Π or β 2,b J; Π, for fiite values of b, are reasoably small provided that the spaces of radom variables H j, j = 1,..., N satisfy proper coditios of weakess of correlatios. 2.3 Excess Risk Bouds We are ow i a positio to formulate our mai theorems that provide oracle iequalities for the excess risk El ˆf. I these theorems, El ˆf will be compared with the excess risk El f of a oracle f 1,..., f N D. Here ad i what follows, f := f 1 + +f N H. This is a little abuse of otatio: we are igorig the fact that such a additive represetatio of a fuctio f H is ot ecessarily uique. I some sese, f deotes both the vector f 1,..., f N H N ad the fuctio f f N H. However, this is ot goig to cause a cofusio i what follows. We will also use the followig otatios: J f := {1 j N : f j 0} ad df := cardj f. The error terms of the oracle iequalities will deped o the quatities ɛ j = ɛk j related to the smoothess properties of the RKHS ad also o the geometric characteristics of the dictioary itroduced above. I the first theorem, we will use the quatity β 2, J f ; Π to characterize the properties of the dictioary. I this case, there will be o assumptios o the quatities ɛ j : these quatities could be of differet order for differet kerel machies, so, differet compoets of the additive represetatio could have differet smoothess. I the secod theorem, we will use a smaller quatity β 2,b J; Π for a proper choice of parameter b <. I this case, we will have to make a additioal assumptio that ɛ j, j = 1,..., N are all of the same order up to a costat. I both cases, we cosider pealized empirical risk miimizatio problem 6 with datadepedet regularizatio parameters ɛ j = τˆɛ j, where ˆɛ j, j = 1,..., N are defied by 7 with some A 4 ad τ BL for a umerical costat B. 14

15 Theorem 2 There exist umerical costats C 1, C 2 > 0 such that, for all all oracles f 1,..., f N D, with probability at least 1 3N A/2, El ˆf + C 1 τ N ɛ j ˆf j f j L2 Π + τ 2 m N ɛ 2 j ˆf j Hj 2El f + C 2 τ β 2 2 ɛ 2 2, J f, Π j + f j Hj. 15 This result meas that if there exists a oracle f 1,..., f N D such that a the excess risk El f is small; b the spaces H j, j J f are ot strogly correlated with the spaces H j, j J f ; c H j, j J f are well posed i the sese that κj f is ot too small; d f j Hj, j J f are all bouded by a reasoable costat, the the excess risk El ˆf is essetially cotrolled by ɛ 2 j. At the same time, the oracle iequality provides a boud o the L 2 Π-distaces betwee the estimated compoets ˆf j ad the compoets of the oracle of course, everythig is uder the assumptio that the loss is of quadratic type ad m is bouded away from 0. Not also that the costat 2 i frot of the excess risk of the oracle El f ca be replaced by 1 + δ for ay δ > 0 with mior modificatios of the proof i this case, the costat C 2 depeds o δ ad is of the order 1/δ. Suppose ow that there exists ɛ > 0 ad a costat Λ > 0 such that Λ 1 ɛ j ɛ Λ, j = 1,..., N. Theorem 3 There exist umerical costats C 1, C 2, b > 0 such that, for all oracles f 1,..., f N D, with probability at least 1 3N A/2, El ˆf + C 1 τ ɛ Λ N ˆf j f j L2 Π + τ 2 ɛ 2 N ˆf j Hj 2El f + C 2 Λτ 2 ɛ β 2 2 2,bΛ J 2 f, Π df + f j Hj m

16 As before, the costat 2 i the upper boud ca be replaced by 1 + δ, but, i this case, the costats C 2 ad b would be of the order 1. The meaig of this result is that if there δ exists a oracle f 1,..., f N D such that a the excess risk El f is small; b the restricted isometry costat δ 3d Π is small for d = df; c f j Hj, j J f are all bouded by a reasoable costat, the the excess risk El ˆf is essetially cotrolled by df ɛ 2. At the same time, the distace N ˆf j f j L2 Π betwee the estimator ad the oracle is cotrolled by df ɛ. I particular, this implies that the empirical solutio ˆf 1,..., ˆf N is approximately sparse i the sese that j J f ˆf L2 Π is of the order df ɛ. Remarks. 1. It is easy to check that theorems 2 ad 3 hold also if oe replaces N i the defiitios 7 of ˆɛ j ad 8 of ɛ j by a arbitrary N N such that log N 2 log log a similar coditio o N itroduced early i Sectio 2 is ot eeded here. I this case, the probability bouds i the theorems become 1 3 N A/2. This chage might be of iterest if oe uses the results for a dictioary cosistig of just oe RKHS N = 1, which is ot the focus of this paper. 2. If the distributio depedet quatities ɛ j, j = 1,..., N are kow ad used as regularizatio parameters i 6, the oracle iequalities of theorems 2 ad 3 also hold with obvious simplificatios of their proofs. For istace, i the case whe S = [0, 1] N, the desig distributio Π is uiform ad, for each j = 1,..., N, H j is a Sobolev space of fuctios of smoothess α > 1/2 depedig oly o the j-th variable, we have ɛ j α/2α+1. Takig i this case ɛ j = τ α/2α+1 A log N would lead to oracle iequalities for sparse additive models is spirit of Meier, va de Geer ad Bühlma More precisely, if H j := {h W α,2 [0, 1] : 1 hxdx = 0}, the, for 0 uiform distributio Π, the spaces H j are orthogoal i L 2 Π recall that H j is viewed as a space of fuctios depedig o the j-th coordiate. Assume, for simplicity, that l is the quadratic loss ad that the regressio fuctio f ca be represeted as f =,j, where J is a subset of {1,..., N} of cardiality d ad f,j Hj 1. The it easily follows 16

17 from the boud of Theorem 3 that with probability at least 1 3N A/2 Ef = f f 2 L 2 Π Cτ 2 d 2α/2α+1 A log N. Note that, up to a costat, this essetially coicides with the miimax lower boud i this type of problems obtaied recetly by Raskutti, Waiwright ad Yu Of course, if the desig distributio is ot ecessarily uiform, a adaptive choice of regularizatio parameters might be eeded eve i such simple examples ad the approach described above leads to miimax optimal rates. 3 Prelimiary Bouds I this sectio, the case of a sigle RKHS H K associated with a kerel K is cosidered. We assume that Kx, x 1, x S. This implies that, for all h H K, h L2 Π h L h HK. 3.1 Compariso of L2 Π ad L2 Π First, we study the relatioship betwee the empirical ad the populatio L 2 orms for fuctios i H K. Theorem 4 Assume that A 1 ad log N 2 log log. The there exists a umerical costat C > 0 such that with probability at least 1 N A for all h H K h L2 Π C h L2 Π + ɛ h HK ; 17 h L2 Π C h L2 Π + ɛ h HK, 18 where { A log N ɛ = ɛk := if ɛ } : E sup h L2 Π δ R h ɛδ + ɛ 2, δ 0, 1]. 19 Proof. Observe that the iequalities hold trivially whe h = 0. We shall therefore cosider oly the case whe h 0. By symmetrizatio iequality, E sup 2 j < h L2 Π 2 j+1 Π Πh 2 2E 17 sup 2 j < h L2 Π 2 j+1 R h 2, 20

18 ad, by cotractio iequality, we further have E sup 2 j < h L2 Π 2 j+1 The defiitio of ɛ implies that E sup 2 j < h L2 Π 2 j+1 Π Πh 2 8E Π Πh 2 8E sup R h j < h L2 Π 2 j+1 sup R h 8 h L2 Π 2 j+1 A applicatio of Talagrad s cocetratio iequality yields sup Π Πh 2 2 E 2 j < h L2 Π 2 j+1 32 sup 2 j < h L2 Π 2 j+1 ɛ2 j+1 + ɛ Π Πh 2 +2 j+1 t + 2 log j ɛ2 j + ɛ j t + 2 log j + t + 2 log j + t + 2 log j with probability at least 1 exp t 2 log j for ay atural umber j. Now, by the uio boud, for all j such that 2 log j t, sup Π Πh 2 t + 2 log j 32 ɛ2 j + ɛ j 2 j < h L2 Π 2 j+1 with probability at least 1 j:2 log j t exp t 2 log j = 1 exp t j:2 log j t + t + 2 log j 23 j exp t. 24 Recall that ɛ A log N/ 1/2 ad h L2 Π h HK. Takig t = A log N + log 4, we easily get that, for all h H K such that h HK = 1 ad h L2 Π exp{ N A/2 }, Π Πh 2 C ɛ h L2 Π + ɛ 2 25 with probability at least 1 0.5N A ad with a umerical costat C > 0. I other words, with the same probability, for all h H K such that h L 2 Π h HK exp{ N A/2 }, Π Πh 2 C ɛ h L2 Π h HK + ɛ 2 h 2 H K

19 Therefore, for all h H K such that we have h L2 Π h HK > exp N A/2 27 h 2 L 2 Π = Πh 2 h 2 L 2 Π + C ɛ h L2 Π h HK + ɛ 2 h 2 H K, h 2 L 2 Π = Π h 2 h 2 L 2 Π + C ɛ h L2 Π h HK + ɛ 2 h 2 H K. It ca be ow deduced that, for a proper value of umerical costat C, h L2 Π C h L2 Π + ɛ h HK ad h L2 Π C h L2 Π + ɛ h HK. 28 It remais to cosider the case whe h L2 Π h HK exp N A/2. 29 Followig a similar argumet as before, with probability at least 1 0.5N A, sup Π Πh 2 16 ɛ exp N A/2 + ɛ 2 h L2 Π exp N A/2 Uder the coditios A 1, log N 2 log log, + exp N A/2 A log N + A log N. 1/2 A log N ɛ exp N A/2. 30 The sup h L2 Π exp N A/2 Π Πh 2 C ɛ with probability at least 1 0.5N A, which also implies 17 ad 18, ad the result follows. Theorem 4 shows that the two orms h L2 Π ad h L2 Π are of the same order up to a error term ɛ h HK. 19

20 3.2 Compariso of ˆɛK, ɛk, ɛk ad ˇɛK Recall the defiitios γ δ := 1 k=1 λ k δ 2 1/2, δ 0, 1] where {λ k } are the eigevalues of the itegral operator T K from L 2 Π ito L 2 Π with kerel K, ad, for some A 1, { ɛk := if ɛ A log N } : γ δ ɛδ + ɛ 2, δ 0, 1]. It follows from Lemma 42 of Medelso 2002 with a additioal applicatio of Cauchy- Schwarz iequality for the upper boud ad Hoffma-Jørgese iequality for the lower boud, see also Koltchiskii 2008 that, for some umerical costats C 1, C 2 > 0, C 1 1 k=1 λ k δ 2 1/2 1 E sup h L2 Π δ R h C 2 This fact ad the defiitios of ɛk, ɛk easily imply the followig result. 1 1/2 λ k δ 2, 32 k=1 Propositio 5 Uder the coditio Kx, x 1, x S, there exist umerical costats C 1, C 2 > 0 such that C 1 ɛk ɛk C 2 ɛk. 33 If K is the kerel of the projectio operator oto a fiite-dimesioal subspace H K of L 2 Π, it is easy to check that ɛk dimh K recall the otatio a b, which meas that there exists a umerical costat c > 0 such that c 1 a/b c. If the eigevalues λ k decay at a polyomial rate, i.e., λ k k 2β for some β > 1/2, the ɛk β/2β+1. Recall the otatio { A log N ˆɛK := if ɛ : 1 1/2 ˆλk δ 2 k=1 } ɛδ + ɛ 2, δ 0, 1], 34 where {ˆλ k } deote the eigevalues of the Gram matrix ˆK := KX i, X j. It follows i,,..., agai from the results of Medelso 2002 [amely, oe ca follow the proof of Lemma 42 20

21 i the case whe the RKHS H K is restricted to the sample X 1,..., X ad the expectatios are coditioal o the sample; the oe uses Cauchy-Schwarz ad Hoffma-Jørgese iequalities as i the proof of 32] that for some umerical costats C 1, C 2 > 0 C 1 1 k=1 ˆλ k δ 2 1/2 1 E ε sup h L2 Π δ R h C 2 1 1/2 ˆλ k δ 2, 35 where E ε idicates that the expectatio is take over the Rademacher radom variables oly coditioally o X 1,..., X. Therefore, if we deote by { } A log N ɛk := if ɛ : E ε sup R h ɛδ + ɛ 2, δ 0, 1] h L2 Π δ the empirical versio of ɛk, the ˆɛK ɛk. We will ow show that ɛk ɛk with a high probability. Theorem 6 Suppose that A 1 ad log N 2 log log. There exist umerical costats C 1, C 2 > 0 such that with probability at least 1 N A. k=1 36 C 1 ɛk ɛk C 2 ɛk, 37 Proof. Let t := A log N + log 14. It follows from Talagrad cocetratio iequality that E sup R h 2 j < h L2 Π 2 j+1 2 sup R h + 2 j+1 t + 2 log j 2 j < h L2 Π 2 j+1 + t + 2 log j. with probability at least 1 exp t 2 log j. O the other had, as derived i the proof of Theorem 4 see 23 sup Π Πh j < h L2 Π 2 j+1 ɛ2 j + ɛ j t + 2 log j 21 + t + 2 log j 38

22 with probability at least 1 exp t 2 log j. We will use these bouds oly for j such that 2 log j t. I this case, the secod boud implies that, for some umerical costat c > 0 ad all h satisfyig the coditios h HK = 1, 2 j < h L2 Π 2 j+1, we have h L2 Π c2 j + ɛ agai, see the proof of Theorem 4. Combiig these bouds, we get that with probability at least 1 2 exp t 2 log j, E sup R h 2 sup R h + 2 j+1 t + 2 log j 2 j < h L2 Π 2 j+1 h L2 Π cδ j where δ j = ɛ + 2 j. + t + 2 log j. Applyig ow Talagrad cocetratio iequality to the Rademacher process coditioally o the observed data X 1,..., X yields E ε sup R h 2 h L2 Π cδ j sup h L2 Π cδ j R h + Cδ j t + 2 log j + t + 2 log j, with coditioal probability at least 1 exp t 2 log j. From this ad from the previous boud it is ot hard to deduce that, for some umerical costats C, C ad for all j such that 2 log j t, E sup R h 2 j < h L2 Π 2 j+1 C E ε sup h L2 Π cδ j R h + δ j C ɛδ j + ɛ 2 C ɛ2 j + ɛ ɛ + ɛ 2 t + 2 log j + t + 2 log j with probability at least 1 3 exp t 2 log j. I obtaiig the secod iequality, we used the defiitio of ɛ ad the fact that, for t = A log N + log 14, 2 log j t, c 1 ɛ t + 2 log j/ 1/2, where c 1 is a umerical costat. Now, by the uio boud, the above iequality holds with probability at least 1 3 j:2 log j t exp t 2 log j 1 6 exp t 39 for all j such that 2 log j t simultaeously. Similarly, it ca be show that E sup R h C ɛ exp N A/2 + ɛ ɛ + ɛ 2 h L2 Π exp N A/2 22

23 with probability at least 1 exp t. For t = A log N + log 14, we get E sup h L2 Π δ R h C ɛδ + ɛ ɛ + ɛ 2, 40 for all 0 < δ 1, with probability at least 1 7 exp t = 1 N A /2. Now by the defiitio of ɛ, we obtai ɛ C max{ ɛ, ɛ ɛ + ɛ 2 1/2 }, 41 which implies that ɛ C ɛ with probability at least 1 N A /2. Similarly oe ca show that E ɛ sup h L2 Π δ R h C ɛδ + ɛ ɛ + ɛ 2, 42 for all 0 < δ 1, with probability at least 1 N A /2, which implies that ɛ C ɛ with probability at least 1 N A /2. The proof ca the be completed by the uio boud. Defie A log N ˇɛ := ˇɛK := if ɛ : sup R h ɛδ + ɛ 2, δ 0, 1]. 43 h L2 Π δ The ext statemet ca be proved similarly to Theorem 6. Theorem 7 There exist umerical costats C 1, C 2 > 0 such that C 1 ɛk ˇɛK C 2 ɛk, 44 with probability at least 1 N A. Suppose ow that {K 1,..., K N } is a dictioary of kerels. Recall that ɛ j = ɛk j, ˆɛ j = ˆɛK j ad ˇɛ j = ˇɛK j. 23

24 It follows from theorems 4, 6, 7 ad the uio boud that with probability at least 1 3N A+1 for all j = 1,..., N h L2 Π C h L2 Π + ɛ j h HK, h L2 Π C h L2 Π + ɛ j h HK, h Hj, 45 C 1 ɛ j ˆɛ j C 2 ɛ j ad C 1 ɛ j ˇɛ j C 2 ɛ j. 46 Note also that 3N A+1 = exp{ A 1 log N + log 3} exp{ A/2 log N} = N A/2, provided that A 4 ad N 3. Thus, uder these additioal costraits, 45 ad 46 hold for all j = 1,..., N with probability at least 1 N A/2. 4 Proofs of the Oracle Iequalities For a arbitrary set J {1,..., N} ad b 0, +, deote K b J := f 1,..., f N H N : ɛ j f j L2 Π b j / J j J ɛ j f j L2 Π 47 ad let { β b J = if β 0 : j J ɛ j f j L2 Π β f f N L2 Π, f 1,..., f N K b J }. 48 It is easy to see that, for all oempty sets J, β b J max j J ɛ j A log N. Theorems 2 ad 3 will be easily deduced from the followig techical result. Theorem 8 There exist umerical costats C 1, C 2, B > 0 ad b > 0 such that, for all τ BL i the defiitio of ɛ j = τˆɛ j, j = 1,..., N ad for all oracles f 1,..., f N D, N El ˆf + C 1 τ ɛ j ˆf N j f j L2 Π + τ 2 ɛ 2 j ˆf j Hj 49 2El f + C 2 τ 2 j Jf ɛ 2 j f j Hj + β2 b J f m 50 with probability at least 1 3N A/2. Here A 4 is a costat ivolved i the defiitios of ɛ j, ˆɛ j, j = 1,..., N. 24

25 Proof. Recall that ˆf1,..., ˆf N := argmi f 1,...,f N D [ P l f f N + N τˆɛj f j L2 Π + τ 2ˆɛ ] 2 j f j Hj, ad that we write f := f f N, ˆf := ˆf1 + + ˆf N. Hece, for all f 1,..., f N D, By a simple algebra, El ˆf + El f + P l ˆf + P l f + N τˆɛ j ˆf j L2 Π + τ 2ˆɛ 2j ˆf j Hj N τˆɛj f j L2 Π + τ 2ˆɛ 2 j f j Hj. N τˆɛ j ˆf j L2 Π + τ 2ˆɛ 2j ˆf j Hj N ad, by the triagle iequality, El ˆf + j / Jf τˆɛj f j L2 Π + τ 2ˆɛ 2 j f j Hj + P P l ˆf l f τˆɛ j ˆf j L2 Π + N τ 2ˆɛ 2 j ˆf j Hj El f + τˆɛ j ˆf j f j L2 Π + τ 2ˆɛ 2 j f j Hj + P P l ˆf l f. We ow take advatage of 45 ad 46 to replace ˆɛ j s by ɛ j s ad L2 Π by L2 Π. Specifically, there exists a umerical costat C > 1 ad a evet E of probability at least 1 N A/2 such that } 1 {ˆɛj C mi : j = 1,..., N ɛ j ad, for all j = 1,..., N, 1 C ˆf j L2 Π ɛ j ˆf j Hj } {ˆɛj max : j = 1,..., N C 51 ɛ j ˆf j L2 Π C ˆf j L2 Π + ɛ j ˆf j Hj

26 Takig τ C/C 1, we have that, o the evet E, Similarly, El ˆf + τˆɛ j ˆf N j L2 Π + τ 2ˆɛ 2 j ˆf j Hj j / Jf El ˆf + 1 τ ɛ C 2 j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El ˆf τ ɛ C 2 j C ˆf j L2 Π ɛ j ˆf N j Hj + τ 2 ɛ 2 j ˆf j Hj j / Jf El ˆf + 1 τ ɛ C 3 j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj. j / Jf El f + τˆɛ j f j ˆf j L2 Π + τ 2ˆɛ 2 j f j Hj El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2 j f j Hj El f + C 3 τ ɛ j f j ˆf j L2 Π + ɛ j f j ˆf j Hj + C 2 τ 2 ɛ 2 j f j Hj El f + C 3 τ ɛ j f j ˆf j L2 Π + ɛ j f j Hj + ɛ j ˆf j Hj + C 2 τ 2 ɛ 2 j f j Hj El f + 2C 3 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2 j f j Hj + C 3 τ ɛ 2 j ˆf j Hj. Therefore, by takig τ large eough, amely τ C C 1 2C6, we ca fid umerical costats 0 < C 1 < 1 < C 2 such that, o the evet E, El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj We ow boud the empirical process + P P l ˆf l f. P P l ˆf l f, where we use the followig result that will be proved i the ext sectio. Suppose that f = N f j, f j H j 26

27 ad f L R we will eed it with R = RD. Deote { N G, +, R = g : ɛ j g j f j L2 Π, N ɛ 2 j g j f j Hj +, N } g j L R. Lemma 9 There exists a umerical costat C > 0 such that for a arbitrary A 1 ivolved i the defiitio of ɛ j, j = 1,..., N with probability at least 1 2N A/2, for all the followig boud holds Assumig that e N, + e N, 53 sup g G, +,R D P P l g l f CL e N. 54 N ɛ j ˆf j f j L2 Π e N, ad usig the lemma, we get El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj N ɛ 2 j ˆf j f j Hj e N 55 N +C 3 L ɛ j ˆf j f j L2 Π + ɛ 2j ˆf j f j Hj + C 3 L e N El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj N +C 3 L ɛ j ˆf j f j L2 Π + ɛ 2j ˆf j Hj + ɛ 2j f j Hj + C 3 L e N for some umerical costat C 3 > 0. By choosig a umerical costat B properly, τ ca be made large eough so that 2C 3 L τc 1 τc 2. The, we have El ˆf C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El f + 2C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj + C 2 /2τe N, 56 27

28 which also implies El ˆf C N 1 τ ɛ j ˆf j f j L2 Π + N τ 2 ɛ 2 j ˆf j Hj El f + 2C 2 + C 1 τ ɛ j f j 2 ˆf j L2 Π + 2C 2 τ 2 ɛ 2 j f j Hj + C 2 /2τe N. 57 We first cosider the case whe 4C 2 The 56 implies that τ ɛ j f j ˆf j L2 Π El f + 2C 2 El ˆf C 1 τ ɛ j ˆf j L2 Π + j / Jf which yields τ 2 ɛ 2 j f j Hj + C 2 /2τe N. 58 N τ 2 ɛ 2 j ˆf j Hj 6C 2 τ ɛ j f j ˆf j L2 Π, 59 τ ɛ j ˆf j L2 Π 12C 2 C 1 j / J f τ ɛ j f j ˆf j L2 Π. 60 Therefore, ˆf 1 f 1,..., ˆf N f N K b J f with b := 12C 2 /C 1. Usig the defiitio of β b J f, it follows from 57, 58 ad the assumptio C 1 < 1 < C 2 that El ˆf + 1 N 2 C 1 τ ɛ j ˆf N j f j L2 Π + τ 2 ɛ 2 j ˆf j Hj 6C 2 + C 1 2 7C 2 τβ b J f Recall that for losses of quadratic type τβ b J f f ˆf L2 Π f f L2 Π + f ˆf L2 Π El f m f f 2 L 2 Π ad El ˆf m ˆf f 2 L 2 Π. 61. The El ˆf C N 1 τ ɛ j ˆf j f j L2 Π + N τ 2 ɛ 2 j ˆf j Hj 7τC 2 m 1/2 β b J f E 1/2 l f + E 1/2 l ˆf. 28

29 Usig the fact that ab a 2 + b 2 /2, we get 7τC 2 m 1/2 β b J f E 1/2 l f 49/2τ 2 C2m 2 1 βb 2 J f + 1 El f, 62 2 ad 7τC 2 m 1/2 β b J f E 1/2 l ˆf 49/2τ 2 C2m 2 1 βb 2 J f El ˆf. 63 Therefore, El ˆf + C 1 N τ ɛ j ˆf j L2 Π + C 1 N τ 2 ɛ 2 j ˆf j Hj El f + 100τ 2 C 2 2m 1 β 2 b J f. 64 We ow cosider the case whe 4C 2 τ ɛ j f j ˆf j L2 Π < El f + 2C 2 It is easy to derive from 57 that i this case El ˆf C N 1 τ ɛ j ˆf j f j L2 Π + Sice β b J f τ 2 ɛ 2 j f j Hj + C 2 /2τe N. 65 N τ 2 ɛ 2 j ˆf j Hj C 1 El f + 2C 2 τ 2 ɛ 2 8C j f j Hj + C 2 /2τe N A log N see the commet after the defiitio of β b J f, we have τe N τ 2 A log N τ 2 β 2 b J f, where we also used the assumptios that log N 2 log log ad A 4. Substitutig this i 66 ad the combiig the resultig boud with 64 cocludes the proof of 49 i the case whe coditios 55 hold. It remais to cosider the case whe 55 does ot hold. The mai idea is to show that i this case the right had side of the oracle iequality is rather large while we still ca cotrol the left had side, so, the iequality becomes trivial. To this ed, ote that, by the defiitio of ˆf, for some umerical costat c 1, P l ˆf N + τˆɛ j ˆf j L2 Π + τ 2ˆɛ 2j ˆf j Hj 1 ly j ; 0 c 1 29

30 sice the value of the pealized empirical risk at ˆf is ot larger tha its value at f = 0 ad, by the assumptios o the loss, ly, 0 is uiformly bouded by a umerical costat. The last equatio implies that, o the evet E defied earlier i the proof see 51, 52, the followig boud holds: Equivaletly, N τ 1 C ɛ j C ˆf j L2 Π ɛ j ˆf j Hj + τ C 2 N τ 2 C 2 ɛ2 j ˆf j Hj c 1. N ɛ j ˆf τ 2 j L2 Π + C τ N ɛ 2 2 C j ˆf j Hj c 1. As soo as τ 2C, so that τ 2 /C 2 τ/c τ 2 /2C 2, we have τ N ɛ j ˆf j L2 Π + τ 2 N ɛ 2 j ˆf j Hj 2c 1 C Note also that, by the assumptios o the loss fuctio, El ˆf P l ˆf ElY ; 0 + P l ˆf P l 0 c 1 + L ˆf L2 Π N c 1 + L ˆf L2 Π c 1 + 2c 1 C 2 1 L τ A log N, 68 where we used the Lipschitz coditio o l, ad also boud 67 ad the fact that ɛ j A log N/ by its defiitio. Recall that we are cosiderig the case whe 55 does ot hold. We will cosider two cases: a whe e N c 3, where c 3 c 1 is a umerical costat, ad b whe e N > c 3. The first case is very simple sice N ad are both upper bouded by a umerical costat recall the assumptio log N 2 log log. I this case, β b J f A log N is bouded from below by a umerical costat. As a cosequece of these observatios, bouds 67 ad 68 imply that El ˆf N + C 1 τ ɛ j ˆf j L2 Π + for some umerical costat C 2 > 0. I the case b, we have N τ 2 ɛ 2 j ˆf j Hj C 2 τ 2 βb 2 J f N ɛ j ˆf N j f j L2 Π + ɛ 2 j ˆf j f j Hj e N 30

31 ad, i view of 67, this implies N ɛ j f j L2 Π + So, either we have N ɛ 2 j f j Hj e N c 1 /2 e N /2. N ɛ 2 j f j Hj e N /4, or N ɛ j f j L2 Π e N /4. Moreover, i the secod case, we also have N A log N N A log N ɛ 2 j f j Hj ɛ j f j L2 Π e N /4. I both cases we ca coclude that, uder the assumptio that log N 2 log log ad e N > c 3 for a sufficietly large umerical costat c 3, El ˆf N + τ ɛ j ˆf j L2 Π + τ 2 ɛ 2j ˆf j Hj c 1 + 2c 1 C 2 1 L τ A log N + 2c 1C 2 τ 2 e N A log N τ 2 ɛ 2 4 j f j Hj. Thus, i both cases a ad b, the followig boud holds: N El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj C 2 τ 2 ɛ 2 j f j Hj + βb 2 J f. 69 To complete the proof, observe that El ˆf N + C 1 τ ɛ j ˆf j f j L2 Π + N τ 2 ɛ 2 j ˆf j Hj N El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj + C 1 τ ε j ˆf j f j L2 Π C 2 τ 2 ɛ 2 j f j Hj + βb 2 J f + C 2 τ ε j ˆf j f j L2 Π. 70 Note also that, by the defiitio of β b J f, for all b > 0, τ ε j ˆf j f j L2 Π τβ b J f ˆf j f j j J L2 f Π τβ b J f ˆf f L2 Π + τβ b J f ε j A log N ˆf j L2 Π j J f τβ b J f ˆf f L2 Π + τβ b J f 2c 1C 2 τ A log N

32 A log N where we used the fact that, for all j, ε j ad also boud 67. By a argumet similar to 61-64, it is easy to deduce from the last boud that C 2 τ ε j ˆf j f j L2 Π 3 C2τ 2 2 βb 2 J f m 2 El ˆf El f + 2c2 1C 4 τ 2 A log N. 72 Substitutig this i boud 70, we get 1 2 El ˆf N + C 1 τ ɛ j ˆf j f j L2 Π + C 2 τ 2 ɛ 2 j f j Hj + βb 2 J f N τ 2 ɛ 2 j ˆf j Hj + 3 C2τ 2 2 βb 2 J f m 2 El f + 2c2 1C 4 τ 2 A log N 1 2 El f + C 2τ 2 ɛ 2 j f j Hj + β2 b J f m + 2c2 1C 2 τ 2 A log N, 73 with some umerical costat C 2. It is eough ow to observe cosiderig agai the cases a ad b, as it was doe before, that either the last term is upper bouded by ε j f j Hj, or it is upper bouded by β 2 b J f, to complete the proof. Now, to derive Theorem 2, it is eough to check that, for a umerical costat c > 0, 1/2 1/2 β b J f ɛ 2 j β 2, J f c ɛ 2 j β 2, J f which easily follows from the defiitios of β b ad β 2,. Similarly, the proof of Theorem 3 follows from the fact that, uder the assumptio that Λ 1 ɛ j ɛ Λ, we have K b J K b J, where b = cλ 2 b, c beig a umerical costat. This easily implies the boud β b J f c 1 Λβ 2,b J f df ɛ, where c 1 is a umerical costat. 5 Boudig the Empirical Process We ow proceed to prove Lemma 9 that was used to boud P P l ˆf. l f To this ed, we begi with a fixed pair, +. Throughout the proof, we write R := R D. By 32

33 Talagrad s cocetratio iequality, with probability at least 1 e t sup P P l g l f 2 E g G, +,R Now ote that [ sup P P l g l f g G, +,R t + l g l f t L. + l g l f L2 P ] l g l f L2 P L g f L2 Π N 1 N L g j f j L2 Π L mi ɛ j ɛ j g j f j L2 j Π, where we used the fact that the Lipschitz costat of the loss l o the rage of fuctios from G, +, R is bouded by L. Together with the fact that ɛ j A log N/ 1/2 for all j, this yields Furthermore, l g l f L2 P L A log N. 74 I summary, we have l g l f L L g f L N L g j f j Hj sup g G, +,R Now, by symmetrizatio iequality, [ E L A log N +. P P l g l f [ ] 2 E sup P P l g l f g G, +,R t L A log N + L t +. A log N sup P P l g l f g G, +,R ] 2E sup g G, +,R + R l g l f

34 A applicatio of Rademacher cotractio iequality further yields [ ] E sup P P l g l f g G, +,R CL E sup g G, +,R R g f 76 where C > 0 is a umerical costat agai, it was used here that the Lipschitz costat of the loss l o the rage of fuctios from G, +, R is bouded by L. Applyig Talagrad s cocetratio iequality aother time, we get that with probability at least 1 e t E sup g G, +,R R g f t C sup R g f + g G, +,R A log N + t + A log N for some umerical costat C > 0. Recallig the defiitio of ˇɛ j := ˇɛK j, we get R h j ˇɛ j h j L2 Π + ˇɛ 2 j h j Hj, h j H j 77 Hece, with probability at least 1 2e t ad with some umerical costat C > 0 sup P P l g l f g G, +,R t CL sup R g f + g G, +,R A log N + t + A log N N t CL sup R g j f j + g G, +,R A log N + t + A log N N CL sup ˇɛj g j f j L2 Π + ˇɛ 2 j g j f j Hj g G, +,R t + A log N + t + A log N Usig 46, ˇɛ j ca be upper bouded by c ɛ j with some umerical costat c > 0 o a evet E of probability at least 1 N A/2. Therefore, the followig boud is obtaied:. sup P P l g l f g G, +,R t CL A log N + t + A log N 34.

35 It holds o the evet E F, +, t, where PF, +, t 1 2e t. We will ow choose t = A log N + 4 log N + 4 log2/ log 2 ad obtai a boud that holds uiformly over e N e N ad e N + e N. 78 To this ed, cosider For ay j ad + k satisfyig 78, we have j = + j := 2 j. 79 sup P P l g l f g G j, + k,r t CL j + + k + j A log N + t + k A log N o the evet E F j, + k, t. Therefore, simultaeously for all j ad + k we have satisfyig 78, sup P P l g l f g G j, + k,r CL j + + k + A log N + 4 log N + 4 log2/ log 2 j A log N + + A log N + 4 log N + 4 log2/ log 2 A log N o the evet E := E j,k F j, + k, t. The last itersectio is over all j, k such that coditios 78 hold for j, + k. The umber of the evets i this itersectio is bouded by 2/ log 2 2 N 2. Therefore, PE 1 2/ log 2 2 N 2 exp A log N 4 log N 4 log2/ log 2 PE 1 2N A/2. Usig mootoicity of the fuctios of, + ivolved i the iequalities, the bouds ca be exteded to the whole rage of values of, + satisfyig 78, so, with probability at least 1 2N A/2 we have for all such, + 80 sup P P l g l f CL g G, +,R 35

36 If e N, or + e N, it follows by mootoicity of the left had side that with the same probability which completes the proof. sup P P l g l f CL e N, 82 g G, +,R Ackowledgmet. The authors are thakful to the referees for a umber of helpful suggestios. The first author is thakful to Evarist Gié for useful coversatios about the paper. Refereces [1] Aroszaj, N. 1950, Theory of reproducig kerels, Tras. Am. Math. Soc., 68, [2] Bach, F. 2008, Cosistecy of the group Lasso ad multiple kerel learig, Joural of Machie Learig Research, 9, [3] Bickel, P., Ritov, Y. ad Tsybakov, A. 2009, Simultaeous aalysis of Lasso ad Datzig selector, Aals of Statistics, 37, 4, [4] Bousquet, O. ad Herrma, D. 2003, O the complexity of learig the kerel matrix, I: Advaces i Neural Iformatio Processig Systems 15, [5] Blachard, G., Bousquet, O. ad Massart, P. 2008, Statistical performace of support vector machies, Aals of Statistics, 36, [6] Bousquet, O. 2002, A Beett cocetratio iequality ad its applicatios to suprema of empirical processes, C.R. Acad. Sci. Paris, 334, [7] Crammer, K., Keshet, J. ad Siger, Y. 2003, Kerel desig usig boostig, I: Advaces i Neural Iformatio Processig Systems 15, [8] Koltchiskii, V. 2008, Oracle Iequalities i Empirical Risk Miimizatio ad Sparse Recovery Problems, Lecture Notes for Ecole d Eté de Probabilités de Sait-Flour. 36

SPARSITY IN MULTIPLE KERNEL LEARNING. BY VLADIMIR KOLTCHINSKII 1 AND MING YUAN 2 Georgia Institute of Technology

SPARSITY IN MULTIPLE KERNEL LEARNING. BY VLADIMIR KOLTCHINSKII 1 AND MING YUAN 2 Georgia Institute of Technology The Aals of Statistics 2010, Vol. 38, No. 6, 3660 3695 DOI: 10.1214/10-AOS825 Istitute of Mathematical Statistics, 2010 SPARSITY IN MULTIPLE KERNEL LEARNING BY VLADIMIR KOLTCHINSKII 1 AND MING YUAN 2 Georgia

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices A Hadamard-type lower boud for symmetric diagoally domiat positive matrices Christopher J. Hillar, Adre Wibisoo Uiversity of Califoria, Berkeley Jauary 7, 205 Abstract We prove a ew lower-boud form of

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Exponential Convergence Rates in Classification

Exponential Convergence Rates in Classification Expoetial Covergece Rates i Classificatio Vladimir Koltchiskii ad Olexadra Bezosova Departmet of Mathematics ad Statistics The Uiversity of New Mexico Albuquerque, NM 873-4, U.S.A. vlad@math.um.edu,bezosik@math.um.edu

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Homework Set #3 - Solutions

Homework Set #3 - Solutions EE 15 - Applicatios of Covex Optimizatio i Sigal Processig ad Commuicatios Dr. Adre Tkaceko JPL Third Term 11-1 Homework Set #3 - Solutios 1. a) Note that x is closer to x tha to x l i the Euclidea orm

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

5 Birkhoff s Ergodic Theorem

5 Birkhoff s Ergodic Theorem 5 Birkhoff s Ergodic Theorem Amog the most useful of the various geeralizatios of KolmogorovâĂŹs strog law of large umbers are the ergodic theorems of Birkhoff ad Kigma, which exted the validity of the

More information

Technical Proofs for Homogeneity Pursuit

Technical Proofs for Homogeneity Pursuit Techical Proofs for Homogeeity Pursuit bstract This is the supplemetal material for the article Homogeeity Pursuit, submitted for publicatio i Joural of the merica Statistical ssociatio. B Proofs B. Proof

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi,

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Dimensionality reduction in Hilbert spaces

Dimensionality reduction in Hilbert spaces Dimesioality reductio i Hilbert spaces Maxim Ragisky October 3, 014 Dimesioality reductio is a geeric ame for ay procedure that takes a complicated object livig i a high-dimesioal (or possibly eve ifiite-dimesioal)

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j. Eigevalue-Eigevector Istructor: Nam Su Wag eigemcd Ay vector i real Euclidea space of dimesio ca be uiquely epressed as a liear combiatio of liearly idepedet vectors (ie, basis) g j, j,,, α g α g α g α

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Ω ). Then the following inequality takes place:

Ω ). Then the following inequality takes place: Lecture 8 Lemma 5. Let f : R R be a cotiuously differetiable covex fuctio. Choose a costat δ > ad cosider the subset Ωδ = { R f δ } R. Let Ωδ ad assume that f < δ, i.e., is ot o the boudary of f = δ, i.e.,

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces Lecture : Bouded Liear Operators ad Orthogoality i Hilbert Spaces 34 Bouded Liear Operator Let ( X, ), ( Y, ) i i be ored liear vector spaces ad { } X Y The, T is said to be bouded if a real uber c such

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences Commuicatios of the Korea Statistical Society 29, Vol. 16, No. 5, 841 849 Precise Rates i Complete Momet Covergece for Negatively Associated Sequeces Dae-Hee Ryu 1,a a Departmet of Computer Sciece, ChugWoo

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

A Note on the Symmetric Powers of the Standard Representation of S n

A Note on the Symmetric Powers of the Standard Representation of S n A Note o the Symmetric Powers of the Stadard Represetatio of S David Savitt 1 Departmet of Mathematics, Harvard Uiversity Cambridge, MA 0138, USA dsavitt@mathharvardedu Richard P Staley Departmet of Mathematics,

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5 Ma 42: Itroductio to Lebesgue Itegratio Solutios to Homework Assigmet 5 Prof. Wickerhauser Due Thursday, April th, 23 Please retur your solutios to the istructor by the ed of class o the due date. You

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Glivenko-Cantelli Classes

Glivenko-Cantelli Classes CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

INEQUALITIES BJORN POONEN

INEQUALITIES BJORN POONEN INEQUALITIES BJORN POONEN 1 The AM-GM iequality The most basic arithmetic mea-geometric mea (AM-GM) iequality states simply that if x ad y are oegative real umbers, the (x + y)/2 xy, with equality if ad

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Questions and answers, kernel part

Questions and answers, kernel part Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

The Boolean Ring of Intervals

The Boolean Ring of Intervals MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,

More information