Sparsity in Multiple Kernel Learning

Size: px

Start display at page:

Download "Sparsity in Multiple Kernel Learning"

Spencer Walker
5 years ago
Views:

1 Sparsity i Multiple Kerel Learig Vladimir Koltchiskii School of Mathematics Georgia Istitute of Techology Atlata, GA USA vlad@math.gatech.edu ad Mig Yua School of Idustrial ad Systems Egieerig Georgia Istitute of Techology Atlata, GA USA myua@isye.gatech.edu April 28, 2010 The research of this author was supported i part by NSF grats MPSA-MCS , DMS ad CCF The research of this author was supported i part by NSF grats MPSA-MCS ad DMS

2 Abstract The problem of multiple kerel learig based o pealized empirical risk miimizatio is discussed. The complexity pealty is determied joitly by the empirical L 2 orms ad the reproducig kerel Hilbert space RKHS orms iduced by the kerels with a data-drive choice of regularizatio parameters. The mai focus is o the case whe the total umber of kerels is large, but oly a relatively small umber of them is eeded to represet the target fuctio, so that the problem is sparse. The goal is to establish oracle iequalities for the excess risk of the resultig predictio rule showig that the method is adaptive both to the ukow desig distributio ad to the sparsity of the problem. 1 Itroductio Let X i, Y i, i = 1,..., be idepedet copies of a radom couple X, Y with values i S T, where S is a measurable space with σ-algebra A typically, S is a compact subset of a fiite-dimesioal Euclidea space ad T is a Borel subset of R. I what follows, P will deote the distributio of X, Y ad Π the distributio of X. The correspodig empirical distributios, based o X 1, Y 1,... X, Y ad o X 1,..., X, will be deoted by P ad Π, respectively. For a measurable fuctio g : S T R, we deote P g := gdp = EgX, Y ad P g := gdp = 1 gx j, Y j. S T S T Similarly, we use the otatios Πf ad Π f for the itegrals of a fuctio f : S R with respect to the measures Π ad Π. The goal of predictio is to lear a reasoably good predictio rule f : S R from the empirical data {X i, Y i : i = 1, 2,..., }. To be more specific, cosider a loss fuctio l : T R R + ad defie the risk of a predictio rule f as P l f = ElY, fx, where l fx, y = ly, fx. A optimal predictio rule with respect to this loss is defied as f = argmi P l f, f:s R 2

3 where the miimizatio is take over all measurable fuctios ad, for simplicity, it is assumed that the miimum is attaied. The excess risk of a predictio rule f is defied as El f := P l f P l f. Throughout the paper, the otatio a b meas that there exists a umerical costat c > 0 such that c 1 a b c. By umerical costats we usually mea real umbers whose precise values are ot ecessarily specified, or, sometimes, costats that might deped o the characteristics of the problem that are of little iterest to us for istace, some costats that deped oly o the loss fuctio. 1.1 Learig i Reproducig Kerel Hilbert Spaces Let H K be a reproducig kerel Hilbert space RKHS associated with a symmetric oegatively defiite kerel K : S S R such that for ay x S, K x := K, x H K ad fx = f, K x HK for all f H K Aroszaj If it is kow that if f H K ad f HK 1, the it is atural to estimate f by a solutio ˆf of the followig empirical risk miimizatio problem: 1 ˆf := argmi f HK 1 ly i, fx i. 1 The size of the excess risk El ˆf of such a empirical solutio depeds o the smoothess of fuctios i the RKHS H K. A atural otio of smoothess i this cotext is related to the ukow desig distributio Π. Namely, let T K be the itegral operator from L 2 Π ito L 2 Π with kerel K. Uder a stadard assumptio that the kerel K is square itegrable i the theory of RKHS it is usually eve assumed that S is compact ad K is cotiuous, the operator T K is compact ad its spectrum is discrete. If {λ k } is the sequece of the eigevalues i=1 arraged i decreasig order of T K ad {φ k } is the correspodig L 2 Π-orthoormal sequece of eigefuctios, the it is well kow that the RKHS-orms of fuctios from the liear spa of {φ k } ca be writte as f 2 H K = k 1 f, φ k L2 Π 2 λ k, 3

4 which meas that the smoothess of fuctios i H K depeds o the rate of decay of eigevalues λ k that, i tur, depeds o the desig distributio Π. It is also clear that the uit balls i the RKHS H K are ellipsoids i the space L 2 Π with axes λ k. It was show by Medelso 2002 that the followig fuctio γ δ := 1 k 1λ k δ 2 1/2, δ [0, 1] provides tight upper ad lower bouds up to costats o localized Rademacher complexities of the uit ball i H K ad plays a importat role i the aalysis of the empirical risk miimizatio problem 1. It is easy to see that the fuctio γ 2 δ is cocave, γ 0 = 0 ad, as a cosequece, γ δ/δ is a decreasig fuctio of δ ad γ δ/δ 2 is strictly decreasig. Hece, there exists uique positive solutio of the equatio γ δ = δ 2. If δ deotes this solutio, the the results of Medelso 2002 imply that with some costat C > 0 ad with probability at least 1 e t El ˆf C δ 2 + t. The size of the quatity δ 2 ivolved i this upper boud o the excess risk depeds o the rate of decay of the eigevalues λ k as k. I particular, if λ k k 2β for some β > 1/2, the it is easy to see that γ δ 1/2 δ 1 1 2β ad δ2 2β/2β+1. Recall that uit balls i H K are ellipsoids i L 2 Π with axes of the order k β ad it is well kow that, i a variety of estimatio problems, 2β/2β+1 represets miimax covergece rates of the squared L 2 -risk for fuctios from such ellipsoids for istace, from Sobolev balls of smoothess β, as i famous Pisker s Theorem see, e.g., Tsybakov 2009, Chapter 3. Example. Sobolev spaces W α,2 G, G R d of smoothess α > d/2 is a well kow class of cocrete examples of RKHS. Let T d, d 1 deote the d-dimesioal torus ad let Π be the uiform distributio i T d. It is easy to check that, for all α > d/2, the Sobolev space W α,2 T d is a RKHS geerated by the kerel Kx, y = kx y, x, y T, where the fuctio k L 2 T d is defied by its Fourier coefficiets ˆk = α, = 1,..., d Z d, 2 := d. I this case, the eigefuctios of the operator T K are the fuctios of the Fourier basis ad its eigevalues are the umbers { α : Z d }. For d = 1 ad α > 1/2, we have 4

5 λ k k 2α recall that {λ k } are the eigevalues arraged i decreasig order so, β = α ad δ 2 2α/2α+1, which is a miimax oparametric covergece rate for Sobolev balls i W α,2 T see, e.g., Tsybakov 2009, Theorem 2.9. More geerally, for arbitrary d 1 ad α > d/2, we get β = α/d ad δ 2 2α/2α+d, which is also a miimax optimal covergece rate i this case. Suppose ow that the distributio Π is uiform i a torus T d T d of dimesio d < d. We will use the same kerel K, but restrict the RKHS H K to the torus T d of smaller dimesio. Let d = d d. For Z d, we will write =, with Z d, Z d. It is easy to prove that the eigevalues of the operator T K become i this case Z d Due to this fact, the orm of the space H K α α d /2. restricted to T d is equivalet to the orm of the Sobolev space W α d /2,2 T d. Sice the eigevalues of the operator T K coicide, up to a costat, with the umbers { α d /2 : Z d }, we get δ 2 2α d 2α d +d which is agai the miimax covergece rate for Sobolev balls i W α d /2,2 T d. I the case of more geeral desig distributios Π, the rate of decay of the eigevalues λ k ad the correspodig size of the excess risk boud δ 2 depeds o Π. If, for istace, Π is supported i a submaifold S T d of dimesio dims < d, the rate of covergece of δ 2 to 0 depeds o the dimesio of the submaifold S rather tha o the dimesio of the ambiet space T d. Usig the properties of the fuctio γ, i particular, the fact that γ δ/δ is decreasig, it is easy to observe that γ δ δ δ + δ 2, δ 0, 1]. Moreover, if ɛ = ɛk deotes the smallest value of ɛ such that the liear fuctio ɛδ + ɛ 2, δ 0, 1] provides a upper boud for the fuctio γ δ, δ 0, 1], the ɛ δ ɛ. Note that ɛ also depeds o, but we do ot have to emphasize this depedece i the otatios sice, i what follows, is fixed. Based o the observatios above, the quatity δ coicides up to a umerical costat with the slope ɛ of the smallest liear majorat of the form ɛδ+ɛ 2 of the fuctio γ δ. This iterpretatio of δ is of some importace i the desig of complexity pealties used i this paper. 5

6 1.2 Sparse Recovery via Regularizatio Istead of miimizig the empirical risk over a RKHS-ball as i problem 1, it is very commo to defie the estimator ˆf of the target fuctio f as a solutio of the pealized empirical risk miimizatio problem of the form [ ] 1 ˆf := argmi f H ly i, fx i + ɛ f α H K, 2 i=1 where ɛ > 0 is a tuig parameter that balaces the tradeoff betwee the empirical risk ad the smoothess of the estimate ad, most ofte, α = 2 sometimes, α = 1. The properties of the estimator ˆf has bee studied extesively. I particular, it was possible to derive probabilistic bouds o the excess risk El ˆf oracle iequalities with the cotrol of the radom error i terms of the rate of decay of the eigevalues {λ k }, or, equivaletly, i terms of the fuctio γ see, e.g., Blachard, Bousquet ad Massart I the recet years, there has bee a lot of iterest i a data depedet choice of kerel K i this type of problems. I particular, give a fiite possibly large dictioary {K j : j = 1, 2,..., N} of symmetric oegatively defiite kerels o S, oe ca try to fid a good kerel K as a covex combiatio of the kerels from the dictioary: { N } K K := θ j K j : θ j 0, θ θ N = 1. 3 The coefficiets of K eed to be estimated from the traiig data alog with the predictio rule. Usig this approach for problem 2 with α = 1 leads to the followig optimizatio problem: ˆf := argmi f HK K K P l f + ɛ f HK. 4 This learig problem, ofte referred to as the multiple kerel learig, has bee studied recetly by Bousquet ad Herrma 2003, Cramer, Keshet ad Siger 2003, Lackriet, Cristiaii, Bartlett, Ghaoui ad Jorda 2004, Micchelli ad Potil 2005, Li ad Zhag 2006, Srebro ad Be-David 2006, Bach 2008 ad Koltchiskii ad Yua 2008 amog others. I particular, see, e.g., Micchelli ad Potil 2005, problem 4 is equivalet to the followig: ˆf 1,..., ˆf N := argmi fj H Kj,,...,N P l f f N + ɛ 6 N f j HKj, 5

7 which is a ifiite-dimesioal versio of LASSO-type pealizatio. Koltchiskii ad Yua 2008 studied this method i the case whe the dictioary is large, but the target fuctio f has a sparse represetatio i terms of a relatively small subset of kerels {K j : j J}. It was show that this method is adaptive to sparsity extedig well kow properties of LASSO to this ifiite dimesioal framework. I this paper, we study a differet approach to the multiple kerel learig. It is closer to the recet work o sparse additive models see, e.g., Ravikumar, Liu, Lafferty ad Wasserma 2008 ad Meier, va de Geer ad Bühlma 2009 ad it is based o a double pealizatio with a combiatio of empirical L 2 -orms used to eforce the sparsity of the solutio ad RKHS-orms used to eforce the smoothess of the compoets. Moreover, we suggest a data-drive method of choosig the values of regularizatio parameters that is adaptive to ukow smoothess of the compoets determied by the behavior of distributio depedet eigevalues of the kerels. Let H j := H Kj, j = 1,..., N. Deote H := l.s. spa, ad H N := N H j { } h 1,..., h N : h j H j, j = 1,..., N. l.s. meaig the liear Note that f H if ad oly if there exists a additive represetatio possibly, o-uique f = f f N, where f j H j, j = 1,..., N. Also, H N has a atural structure of a liear space ad it ca be equipped with the followig ier product f 1,..., f N, g 1,..., g N H N := to become the direct sum of Hilbert spaces H j, j = 1,..., N. N f j, g j Hj Give a covex subset D H N, cosider the followig pealized empirical risk miimizatio problem: ˆf1,..., ˆf N = argmi f 1,...,f N D [ P l f f N + N ] ɛj f j L2 Π + ɛ 2 j f j Hj. 6 Note that for special choices of set D, for istace, for D := {f 1,..., f N : f j H j, f j Hj R j } for some R j > 0, j = 1,..., N, oe ca replace each compoet f j ivolved i the optimizatio problem by its orthogoal projectios i H j oto the liear spa of the fuctios {K j, X i, i = 1,..., } ad reduce the problem to a covex optimizatio over a fiite dimesioal space of dimesio N. 7

8 The complexity pealty i the problem 6 is based o two orms of the compoets f j of a additive represetatio: the empirical L 2 -orm, f j L2 Π, with regularizatio parameter ɛ j, ad a RKHS-orm, f j Hj, with regularizatio parameter ɛ 2 j. The empirical L 2 -orm the lighter orm is used to eforce the sparsity of the solutio whereas the RKHS orms the heavier orms are used to eforce the smoothess of the compoets. This is similar to the approach take i Meier, va de Geer ad Bühlma 2009 i the cotext of classical additive models, i.e., i the case whe S := [0, 1] N, H j := W α,2 [0, 1] for some smoothess α > 1/2 ad the space H j is a space of fuctios depedig o the j-th variable. I this case, the regularizatio parameters ɛ j are equal up to a costat to α/2α+1. The quatity ɛ 2 j, used i the smoothess part of the pealty, coicides with the miimax covergece rate i a oe compoet smooth problem. At the same time, the quatity ɛ j, used i the sparsity part of the pealty, is equal to the square root of the miimax rate which is similar to the choice of regularizatio parameter i stadard sparse recovery methods such as LASSO. This choice of regularizatio parameters results i the excess risk of the order d 2α/2α+1, where d is the umber of compoets of the target fuctio the degree of sparsity of the problem. The framework of multiple kerel learig cosidered i this paper icludes may geeralized versios of classical additive models. For istace, oe ca thik of the case whe S := [0, 1] m 1 [0, 1] m N ad H j = W α,2 [0, 1] m j is a space of fuctios depedig o the j-th block of variables. I this case, a proper choice of regularizatio parameters for uiform desig distributio would be ɛ j = α/2α+mj, j = 1,..., N so, these parameters ad the error rates for differet compoets of the model are differet. It should be also clear from the discussio i Sectio 1.1 that, if the desig distributio Π is ukow, the miimax covergece rates for the oe compoet problems are also ukow. For istace, if the projectios of desig poits o the cubes [0, 1] m j are distributed i lower dimesioal submaifolds of these cubes, the the ukow dimesios of the submaifolds rather tha the dimesios m j would be ivolved i the miimax rates ad i the regularizatio parameters ɛ j. Because of this, data drive choice of regularizatio parameters ɛ j that provides adaptatio to the ukow desig distributio Π ad to the ukow smoothess of the compoets related to this distributio is a major issue i multiple kerel learig. From this poit of view, eve i the case of classical additive models, the choice of regularizatio 8

9 parameters that is based oly o Sobolev type smoothess ad igores the desig distributio is ot adaptive. Note that, i the ifiite dimesioal LASSO studied i Koltchiskii ad Yua 2008, the regularizatio parameter ɛ is chose the same way as i the classical log N LASSO ɛ, so, it is ot related to the smoothess of the compoets. However, the oracle iequalities proved i Koltchiskii ad Yua 2008 give correct size of the excess risk oly for special choices of kerels that deped o ukow smoothess of the compoets of the target fuctio f, so, this method is ot adaptive either. 1.3 Adaptive Choice of Regularizatio Parameters Deote Kj X l, X k ˆK j :=. l,k=1, This Gram matrix ca be viewed as a empirical versio of the itegral operator j T Kj from L 2 Π ito L 2 Π with kerel K j. Deote ˆλ k, k = 1, 2,... the eigevalues of ˆKj arraged i decreasig order. We also use the otatio λ j, k = 1, 2,... for the eigevalues of the operator T Kj fuctios γ j, ˆγ j, 1 γ j δ := k : L 2 Π L 2 Π with kerel K j arraged i decreasig order. Defie k=1 λ j k δ 2 ad ˆγ j δ := 1/2 1 k=1 ˆλ j k δ 2 1/2, ad, for a fixed give A 1, let { A log N ˆɛ j := if ɛ } : ˆγ j δ ɛδ + ɛ 2, δ 0, 1]. 7 Oe ca view ˆɛ j as a empirical estimate of the quatity ɛ j = ɛk j that as we have already poited out plays a crucial role i the bouds o the excess risk i empirical risk miimizatio problems i the RKHS cotext. I fact, sice most ofte ɛ j A log N/, we will redefie this quatity as { A log N ɛ j := if ɛ } : γ j δ ɛδ + ɛ 2, δ 0, 1]. 8 We will use the followig values of regularizatio parameters i problem 6: ɛ j = τˆɛ j, where τ is a sufficietly large costat. 9

10 It should be emphasized that the structure of complexity pealty ad the choice of regularizatio parameters i 6 are closely related to the followig boud o Rademacher processes idexed by fuctios from a RKHS H K : with a high probability, for all h H K, ] R h C [ ɛk h L2Π + ɛ 2 K h HK. Such bouds follow from the results of Sectio 3 ad they provide a way to prove sparsity oracle iequalities for the estimators 6. The Rademacher process is defied as R f := 1 ε j fx j, where {ε j } is a sequece of i.i.d. Rademacher radom variables takig values +1 ad 1 with probability 1/2 each idepedet of {X j }. We will use several basic facts of the empirical processes theory throughout the paper. They iclude symmetrizatio iequalities ad cotractio compariso iequalities for Rademacher processes that ca be foud i the books of Ledoux ad Talagrad 1991 ad va der Vaart ad Weller empirical processes see, Talagrad 1996, Bousquet We also use Talagrad s cocetratio iequality for The mai goal of the paper is to establish oracle iequalities for the excess risk of the estimator ˆf = ˆf ˆf N. I these iequalities, the excess risk of ˆf is compared with the excess risk of a oracle f := f f N, f 1,..., f N D with a error term depedig o the degree of sparsity of the oracle, i.e., o the umber of o-zero compoets f j H j i its additive represetatio. The oracle iequalities will be stated i the ext sectio. Their proof relies o probabilistic bouds for empirical L 2 -orms ad data depedet regularizatio parameters ˆɛ j. The results of Sectio 3 show that they ca be bouded by their respective populatio couterparts. Usig these tools ad some bouds o empirical processes derived i Sectio 5, we prove i Sectio 4 the oracle iequalities for the estimator ˆf. 2 Oracle Iequalities Cosiderig the problem i the case whe the domai D of 6 is ot bouded, say, D = H N, leads to additioal techical complicatios ad might require some chages i the estimatio procedure. To avoid this, we assume below that D is a bouded covex subset of H N. It 10

11 will be also assumed that, for all j = 1,..., N, sup x S K j x, x 1, which, by elemetary properties of RKHS, implies that f j L f j Hj, R D := sup f f N L < +. f 1,...,f N D j = 1,..., N. Because of this, Deote R D := R D f L. We will allow the costats ivolved i the oracle iequalities stated ad proved below to deped o the value of RD so, implicitly, it is assumed that this value is ot too large. We shall also assume that N is large eough, say, so that log N 2 log log. This assumptio is ot essetial to our developmet ad is i place to avoid a extra term of the order 1 log log i our risk bouds. 2.1 Loss Fuctios of Quadratic Type We will formulate the assumptios o the loss fuctio l. The mai assumptio is that, for all y T, ly, is a oegative covex fuctio. I additio, we will assume that ly, 0, y T is uiformly bouded from above by a umerical costat. Moreover, suppose that, for all y T, ly, is twice cotiuously differetiable ad its first ad secod derivatives are uiformly bouded i T [ RD, R D ]. Deote mr := 1 2 if y T if 2 ly, u u R u 2, MR := 1 2 sup y T sup u R ad let m := mr D, M := MR D. We will assume that m > 0. Deote L := sup u R D,y T l y, u u. 2 ly, u u 2 9 Clearly, for all y T, the fuctio ly, satisfies Lipschitz coditio with costat L. The costats m, M, L will appear i a umber of places i what follows. Without loss of geerality, we ca also assume that m 1 ad L 1 otherwise, m ad L ca be replaced by a lower boud ad a upper boud, respectively. The loss fuctios satisfyig the assumptios stated above will be called the losses of quadratic type. If l is a loss of quadratic type ad f = f f N, f 1,..., f N D, the m f f 2 L 2 Π El f M f f 2 L 2 Π

12 This boud easily follows from a simple argumet based o Taylor expasio ad it will be used later i the paper. If H is dese i L 2 Π, the 10 implies that The quadratic loss ly, u := y u 2 if P l f = if P l f = P l f. 11 f H f L 2 Π i the case whe T R is a bouded set is oe of the mai examples of such loss fuctios. I this case, mr = 1 for all R > 0. I regressio problems with a bouded respose variable, more geeral loss fuctios of the form ly, u := φy u ca be also used, where φ is a eve oegative covex twice cotiuously differetiable fuctio with φ uiformly bouded i R, φ0 = 0 ad φ u > 0, u R. I classificatio problems, the loss fuctios of the form ly, u = φyu are commoly used, with φ beig a oegative decreasig covex twice cotiuously differetiable fuctio such that, agai, φ is uiformly bouded i R ad φ u > 0, u R. The loss fuctio φu = log e u ofte referred to as the logit loss is a specific example. 2.2 Geometry of the Dictioary Now we itroduce several importat geometric characteristics of dictioaries cosistig of kerels or, equivaletly, of RKHS. These characteristics are related to the degree of depedece of spaces of radom variables H j L 2 Π, j = 1,..., N ad they will be ivolved i the oracle iequalities for the excess risk El ˆf. First, for J {1,..., N} ad b [0, + ], deote C b J := { h 1,..., h N H N : h j L2 Π b h j L2 Π j J j J }. Clearly, the set C b J is a coe i the space H N that cosists of vectors h 1,..., h N whose compoets correspodig to j J domiate the rest of the compoets. This family of coes icreases as b icreases. For b = 0, C b J coicides with the liear subspace of vectors for which h j = 0, j J. For b = +, C b J is the whole space H N. The followig quatity will play the most importat role: { β 2,b J; Π := β 2,b J := if β > 0 : 1/2 h j 2 N L 2 Π β j J } h j, h 1,..., h N C L2 b J. Π 12

13 Clearly, β 2,b J; Π is a odecreasig fuctio of b. I the case of simple dictioary that cosists of oe-dimesioal spaces similar quatities have bee used i the literature o sparse recovery see, e.g., Koltchiskii 2008, 2009a,b,c. The quatity β 2,b J; Π ca be upper bouded i terms of some other geometric characteristics that describe how depedet the spaces of radom variables H j L 2 Π are. These characteristics will be itroduced below. Give h j H j, j = 1,..., N, deote by κ{h j : j J} the miimal eigevalue of the Gram matrix h j, h k L2 Π j,k J. Let We will also use the otatio { } κj := if κ{h j : j J} : h j H j, h j L2 Π = H J = l.s. H j. 13 The followig quatity is the maximal cosie of the agle i the space L 2 Π betwee the vectors i the subspaces H I ad H J for some I, J {1,..., N} : { } f, g L2 Π ρi, J := sup : f H I, g H J, f 0, g f L2 Π g L2 Π Deote ρj := ρj, J c. The quatities ρi, J ad ρj are very similar to the otio of caoical correlatio i the multivariate statistical aalysis. j J There are other importat geometric characteristics, frequetly used i the theory of sparse recovery, icludig so called restricted isometry costats by Cades ad Tao Defie δ d Π to be the smallest δ > 0 such that for all h 1,..., h N H N ad all J {1,..., N} with cardj = d, 1/2 1 δ h j 2 L 2 Π j J j J h j 1 + δ L2 Π j J h j 2 L 2 Π 1/2. This coditio with a sufficietly small value of δ d Π meas that for all choices of J with cardj = d the fuctios i the spaces H j, j J are almost orthogoal i L 2 Π. The followig simple propositio easily follows from some statemets i Koltchiskii 2009a,b, 2008 where the case of simple dictioaries cosistig of oe-dimesioal spaces H j was cosidered. 13

14 Propositio 1 For all J {1,..., N}, β 2, J; Π 1 κj1 ρ2 J. Also, if cardj = d ad δ 3d Π 1 8b, the β 2,bJ; Π 4. Thus, such quatities as β 2, J; Π or β 2,b J; Π, for fiite values of b, are reasoably small provided that the spaces of radom variables H j, j = 1,..., N satisfy proper coditios of weakess of correlatios. 2.3 Excess Risk Bouds We are ow i a positio to formulate our mai theorems that provide oracle iequalities for the excess risk El ˆf. I these theorems, El ˆf will be compared with the excess risk El f of a oracle f 1,..., f N D. Here ad i what follows, f := f 1 + +f N H. This is a little abuse of otatio: we are igorig the fact that such a additive represetatio of a fuctio f H is ot ecessarily uique. I some sese, f deotes both the vector f 1,..., f N H N ad the fuctio f f N H. However, this is ot goig to cause a cofusio i what follows. We will also use the followig otatios: J f := {1 j N : f j 0} ad df := cardj f. The error terms of the oracle iequalities will deped o the quatities ɛ j = ɛk j related to the smoothess properties of the RKHS ad also o the geometric characteristics of the dictioary itroduced above. I the first theorem, we will use the quatity β 2, J f ; Π to characterize the properties of the dictioary. I this case, there will be o assumptios o the quatities ɛ j : these quatities could be of differet order for differet kerel machies, so, differet compoets of the additive represetatio could have differet smoothess. I the secod theorem, we will use a smaller quatity β 2,b J; Π for a proper choice of parameter b <. I this case, we will have to make a additioal assumptio that ɛ j, j = 1,..., N are all of the same order up to a costat. I both cases, we cosider pealized empirical risk miimizatio problem 6 with datadepedet regularizatio parameters ɛ j = τˆɛ j, where ˆɛ j, j = 1,..., N are defied by 7 with some A 4 ad τ BL for a umerical costat B. 14

15 Theorem 2 There exist umerical costats C 1, C 2 > 0 such that, for all all oracles f 1,..., f N D, with probability at least 1 3N A/2, El ˆf + C 1 τ N ɛ j ˆf j f j L2 Π + τ 2 m N ɛ 2 j ˆf j Hj 2El f + C 2 τ β 2 2 ɛ 2 2, J f, Π j + f j Hj. 15 This result meas that if there exists a oracle f 1,..., f N D such that a the excess risk El f is small; b the spaces H j, j J f are ot strogly correlated with the spaces H j, j J f ; c H j, j J f are well posed i the sese that κj f is ot too small; d f j Hj, j J f are all bouded by a reasoable costat, the the excess risk El ˆf is essetially cotrolled by ɛ 2 j. At the same time, the oracle iequality provides a boud o the L 2 Π-distaces betwee the estimated compoets ˆf j ad the compoets of the oracle of course, everythig is uder the assumptio that the loss is of quadratic type ad m is bouded away from 0. Not also that the costat 2 i frot of the excess risk of the oracle El f ca be replaced by 1 + δ for ay δ > 0 with mior modificatios of the proof i this case, the costat C 2 depeds o δ ad is of the order 1/δ. Suppose ow that there exists ɛ > 0 ad a costat Λ > 0 such that Λ 1 ɛ j ɛ Λ, j = 1,..., N. Theorem 3 There exist umerical costats C 1, C 2, b > 0 such that, for all oracles f 1,..., f N D, with probability at least 1 3N A/2, El ˆf + C 1 τ ɛ Λ N ˆf j f j L2 Π + τ 2 ɛ 2 N ˆf j Hj 2El f + C 2 Λτ 2 ɛ β 2 2 2,bΛ J 2 f, Π df + f j Hj m

16 As before, the costat 2 i the upper boud ca be replaced by 1 + δ, but, i this case, the costats C 2 ad b would be of the order 1. The meaig of this result is that if there δ exists a oracle f 1,..., f N D such that a the excess risk El f is small; b the restricted isometry costat δ 3d Π is small for d = df; c f j Hj, j J f are all bouded by a reasoable costat, the the excess risk El ˆf is essetially cotrolled by df ɛ 2. At the same time, the distace N ˆf j f j L2 Π betwee the estimator ad the oracle is cotrolled by df ɛ. I particular, this implies that the empirical solutio ˆf 1,..., ˆf N is approximately sparse i the sese that j J f ˆf L2 Π is of the order df ɛ. Remarks. 1. It is easy to check that theorems 2 ad 3 hold also if oe replaces N i the defiitios 7 of ˆɛ j ad 8 of ɛ j by a arbitrary N N such that log N 2 log log a similar coditio o N itroduced early i Sectio 2 is ot eeded here. I this case, the probability bouds i the theorems become 1 3 N A/2. This chage might be of iterest if oe uses the results for a dictioary cosistig of just oe RKHS N = 1, which is ot the focus of this paper. 2. If the distributio depedet quatities ɛ j, j = 1,..., N are kow ad used as regularizatio parameters i 6, the oracle iequalities of theorems 2 ad 3 also hold with obvious simplificatios of their proofs. For istace, i the case whe S = [0, 1] N, the desig distributio Π is uiform ad, for each j = 1,..., N, H j is a Sobolev space of fuctios of smoothess α > 1/2 depedig oly o the j-th variable, we have ɛ j α/2α+1. Takig i this case ɛ j = τ α/2α+1 A log N would lead to oracle iequalities for sparse additive models is spirit of Meier, va de Geer ad Bühlma More precisely, if H j := {h W α,2 [0, 1] : 1 hxdx = 0}, the, for 0 uiform distributio Π, the spaces H j are orthogoal i L 2 Π recall that H j is viewed as a space of fuctios depedig o the j-th coordiate. Assume, for simplicity, that l is the quadratic loss ad that the regressio fuctio f ca be represeted as f =,j, where J is a subset of {1,..., N} of cardiality d ad f,j Hj 1. The it easily follows 16

17 from the boud of Theorem 3 that with probability at least 1 3N A/2 Ef = f f 2 L 2 Π Cτ 2 d 2α/2α+1 A log N. Note that, up to a costat, this essetially coicides with the miimax lower boud i this type of problems obtaied recetly by Raskutti, Waiwright ad Yu Of course, if the desig distributio is ot ecessarily uiform, a adaptive choice of regularizatio parameters might be eeded eve i such simple examples ad the approach described above leads to miimax optimal rates. 3 Prelimiary Bouds I this sectio, the case of a sigle RKHS H K associated with a kerel K is cosidered. We assume that Kx, x 1, x S. This implies that, for all h H K, h L2 Π h L h HK. 3.1 Compariso of L2 Π ad L2 Π First, we study the relatioship betwee the empirical ad the populatio L 2 orms for fuctios i H K. Theorem 4 Assume that A 1 ad log N 2 log log. The there exists a umerical costat C > 0 such that with probability at least 1 N A for all h H K h L2 Π C h L2 Π + ɛ h HK ; 17 h L2 Π C h L2 Π + ɛ h HK, 18 where { A log N ɛ = ɛk := if ɛ } : E sup h L2 Π δ R h ɛδ + ɛ 2, δ 0, 1]. 19 Proof. Observe that the iequalities hold trivially whe h = 0. We shall therefore cosider oly the case whe h 0. By symmetrizatio iequality, E sup 2 j < h L2 Π 2 j+1 Π Πh 2 2E 17 sup 2 j < h L2 Π 2 j+1 R h 2, 20

18 ad, by cotractio iequality, we further have E sup 2 j < h L2 Π 2 j+1 The defiitio of ɛ implies that E sup 2 j < h L2 Π 2 j+1 Π Πh 2 8E Π Πh 2 8E sup R h j < h L2 Π 2 j+1 sup R h 8 h L2 Π 2 j+1 A applicatio of Talagrad s cocetratio iequality yields sup Π Πh 2 2 E 2 j < h L2 Π 2 j+1 32 sup 2 j < h L2 Π 2 j+1 ɛ2 j+1 + ɛ Π Πh 2 +2 j+1 t + 2 log j ɛ2 j + ɛ j t + 2 log j + t + 2 log j + t + 2 log j with probability at least 1 exp t 2 log j for ay atural umber j. Now, by the uio boud, for all j such that 2 log j t, sup Π Πh 2 t + 2 log j 32 ɛ2 j + ɛ j 2 j < h L2 Π 2 j+1 with probability at least 1 j:2 log j t exp t 2 log j = 1 exp t j:2 log j t + t + 2 log j 23 j exp t. 24 Recall that ɛ A log N/ 1/2 ad h L2 Π h HK. Takig t = A log N + log 4, we easily get that, for all h H K such that h HK = 1 ad h L2 Π exp{ N A/2 }, Π Πh 2 C ɛ h L2 Π + ɛ 2 25 with probability at least 1 0.5N A ad with a umerical costat C > 0. I other words, with the same probability, for all h H K such that h L 2 Π h HK exp{ N A/2 }, Π Πh 2 C ɛ h L2 Π h HK + ɛ 2 h 2 H K

19 Therefore, for all h H K such that we have h L2 Π h HK > exp N A/2 27 h 2 L 2 Π = Πh 2 h 2 L 2 Π + C ɛ h L2 Π h HK + ɛ 2 h 2 H K, h 2 L 2 Π = Π h 2 h 2 L 2 Π + C ɛ h L2 Π h HK + ɛ 2 h 2 H K. It ca be ow deduced that, for a proper value of umerical costat C, h L2 Π C h L2 Π + ɛ h HK ad h L2 Π C h L2 Π + ɛ h HK. 28 It remais to cosider the case whe h L2 Π h HK exp N A/2. 29 Followig a similar argumet as before, with probability at least 1 0.5N A, sup Π Πh 2 16 ɛ exp N A/2 + ɛ 2 h L2 Π exp N A/2 Uder the coditios A 1, log N 2 log log, + exp N A/2 A log N + A log N. 1/2 A log N ɛ exp N A/2. 30 The sup h L2 Π exp N A/2 Π Πh 2 C ɛ with probability at least 1 0.5N A, which also implies 17 ad 18, ad the result follows. Theorem 4 shows that the two orms h L2 Π ad h L2 Π are of the same order up to a error term ɛ h HK. 19

20 3.2 Compariso of ˆɛK, ɛk, ɛk ad ˇɛK Recall the defiitios γ δ := 1 k=1 λ k δ 2 1/2, δ 0, 1] where {λ k } are the eigevalues of the itegral operator T K from L 2 Π ito L 2 Π with kerel K, ad, for some A 1, { ɛk := if ɛ A log N } : γ δ ɛδ + ɛ 2, δ 0, 1]. It follows from Lemma 42 of Medelso 2002 with a additioal applicatio of Cauchy- Schwarz iequality for the upper boud ad Hoffma-Jørgese iequality for the lower boud, see also Koltchiskii 2008 that, for some umerical costats C 1, C 2 > 0, C 1 1 k=1 λ k δ 2 1/2 1 E sup h L2 Π δ R h C 2 This fact ad the defiitios of ɛk, ɛk easily imply the followig result. 1 1/2 λ k δ 2, 32 k=1 Propositio 5 Uder the coditio Kx, x 1, x S, there exist umerical costats C 1, C 2 > 0 such that C 1 ɛk ɛk C 2 ɛk. 33 If K is the kerel of the projectio operator oto a fiite-dimesioal subspace H K of L 2 Π, it is easy to check that ɛk dimh K recall the otatio a b, which meas that there exists a umerical costat c > 0 such that c 1 a/b c. If the eigevalues λ k decay at a polyomial rate, i.e., λ k k 2β for some β > 1/2, the ɛk β/2β+1. Recall the otatio { A log N ˆɛK := if ɛ : 1 1/2 ˆλk δ 2 k=1 } ɛδ + ɛ 2, δ 0, 1], 34 where {ˆλ k } deote the eigevalues of the Gram matrix ˆK := KX i, X j. It follows i,,..., agai from the results of Medelso 2002 [amely, oe ca follow the proof of Lemma 42 20

21 i the case whe the RKHS H K is restricted to the sample X 1,..., X ad the expectatios are coditioal o the sample; the oe uses Cauchy-Schwarz ad Hoffma-Jørgese iequalities as i the proof of 32] that for some umerical costats C 1, C 2 > 0 C 1 1 k=1 ˆλ k δ 2 1/2 1 E ε sup h L2 Π δ R h C 2 1 1/2 ˆλ k δ 2, 35 where E ε idicates that the expectatio is take over the Rademacher radom variables oly coditioally o X 1,..., X. Therefore, if we deote by { } A log N ɛk := if ɛ : E ε sup R h ɛδ + ɛ 2, δ 0, 1] h L2 Π δ the empirical versio of ɛk, the ˆɛK ɛk. We will ow show that ɛk ɛk with a high probability. Theorem 6 Suppose that A 1 ad log N 2 log log. There exist umerical costats C 1, C 2 > 0 such that with probability at least 1 N A. k=1 36 C 1 ɛk ɛk C 2 ɛk, 37 Proof. Let t := A log N + log 14. It follows from Talagrad cocetratio iequality that E sup R h 2 j < h L2 Π 2 j+1 2 sup R h + 2 j+1 t + 2 log j 2 j < h L2 Π 2 j+1 + t + 2 log j. with probability at least 1 exp t 2 log j. O the other had, as derived i the proof of Theorem 4 see 23 sup Π Πh j < h L2 Π 2 j+1 ɛ2 j + ɛ j t + 2 log j 21 + t + 2 log j 38

22 with probability at least 1 exp t 2 log j. We will use these bouds oly for j such that 2 log j t. I this case, the secod boud implies that, for some umerical costat c > 0 ad all h satisfyig the coditios h HK = 1, 2 j < h L2 Π 2 j+1, we have h L2 Π c2 j + ɛ agai, see the proof of Theorem 4. Combiig these bouds, we get that with probability at least 1 2 exp t 2 log j, E sup R h 2 sup R h + 2 j+1 t + 2 log j 2 j < h L2 Π 2 j+1 h L2 Π cδ j where δ j = ɛ + 2 j. + t + 2 log j. Applyig ow Talagrad cocetratio iequality to the Rademacher process coditioally o the observed data X 1,..., X yields E ε sup R h 2 h L2 Π cδ j sup h L2 Π cδ j R h + Cδ j t + 2 log j + t + 2 log j, with coditioal probability at least 1 exp t 2 log j. From this ad from the previous boud it is ot hard to deduce that, for some umerical costats C, C ad for all j such that 2 log j t, E sup R h 2 j < h L2 Π 2 j+1 C E ε sup h L2 Π cδ j R h + δ j C ɛδ j + ɛ 2 C ɛ2 j + ɛ ɛ + ɛ 2 t + 2 log j + t + 2 log j with probability at least 1 3 exp t 2 log j. I obtaiig the secod iequality, we used the defiitio of ɛ ad the fact that, for t = A log N + log 14, 2 log j t, c 1 ɛ t + 2 log j/ 1/2, where c 1 is a umerical costat. Now, by the uio boud, the above iequality holds with probability at least 1 3 j:2 log j t exp t 2 log j 1 6 exp t 39 for all j such that 2 log j t simultaeously. Similarly, it ca be show that E sup R h C ɛ exp N A/2 + ɛ ɛ + ɛ 2 h L2 Π exp N A/2 22

23 with probability at least 1 exp t. For t = A log N + log 14, we get E sup h L2 Π δ R h C ɛδ + ɛ ɛ + ɛ 2, 40 for all 0 < δ 1, with probability at least 1 7 exp t = 1 N A /2. Now by the defiitio of ɛ, we obtai ɛ C max{ ɛ, ɛ ɛ + ɛ 2 1/2 }, 41 which implies that ɛ C ɛ with probability at least 1 N A /2. Similarly oe ca show that E ɛ sup h L2 Π δ R h C ɛδ + ɛ ɛ + ɛ 2, 42 for all 0 < δ 1, with probability at least 1 N A /2, which implies that ɛ C ɛ with probability at least 1 N A /2. The proof ca the be completed by the uio boud. Defie A log N ˇɛ := ˇɛK := if ɛ : sup R h ɛδ + ɛ 2, δ 0, 1]. 43 h L2 Π δ The ext statemet ca be proved similarly to Theorem 6. Theorem 7 There exist umerical costats C 1, C 2 > 0 such that C 1 ɛk ˇɛK C 2 ɛk, 44 with probability at least 1 N A. Suppose ow that {K 1,..., K N } is a dictioary of kerels. Recall that ɛ j = ɛk j, ˆɛ j = ˆɛK j ad ˇɛ j = ˇɛK j. 23

24 It follows from theorems 4, 6, 7 ad the uio boud that with probability at least 1 3N A+1 for all j = 1,..., N h L2 Π C h L2 Π + ɛ j h HK, h L2 Π C h L2 Π + ɛ j h HK, h Hj, 45 C 1 ɛ j ˆɛ j C 2 ɛ j ad C 1 ɛ j ˇɛ j C 2 ɛ j. 46 Note also that 3N A+1 = exp{ A 1 log N + log 3} exp{ A/2 log N} = N A/2, provided that A 4 ad N 3. Thus, uder these additioal costraits, 45 ad 46 hold for all j = 1,..., N with probability at least 1 N A/2. 4 Proofs of the Oracle Iequalities For a arbitrary set J {1,..., N} ad b 0, +, deote K b J := f 1,..., f N H N : ɛ j f j L2 Π b j / J j J ɛ j f j L2 Π 47 ad let { β b J = if β 0 : j J ɛ j f j L2 Π β f f N L2 Π, f 1,..., f N K b J }. 48 It is easy to see that, for all oempty sets J, β b J max j J ɛ j A log N. Theorems 2 ad 3 will be easily deduced from the followig techical result. Theorem 8 There exist umerical costats C 1, C 2, B > 0 ad b > 0 such that, for all τ BL i the defiitio of ɛ j = τˆɛ j, j = 1,..., N ad for all oracles f 1,..., f N D, N El ˆf + C 1 τ ɛ j ˆf N j f j L2 Π + τ 2 ɛ 2 j ˆf j Hj 49 2El f + C 2 τ 2 j Jf ɛ 2 j f j Hj + β2 b J f m 50 with probability at least 1 3N A/2. Here A 4 is a costat ivolved i the defiitios of ɛ j, ˆɛ j, j = 1,..., N. 24

25 Proof. Recall that ˆf1,..., ˆf N := argmi f 1,...,f N D [ P l f f N + N τˆɛj f j L2 Π + τ 2ˆɛ ] 2 j f j Hj, ad that we write f := f f N, ˆf := ˆf1 + + ˆf N. Hece, for all f 1,..., f N D, By a simple algebra, El ˆf + El f + P l ˆf + P l f + N τˆɛ j ˆf j L2 Π + τ 2ˆɛ 2j ˆf j Hj N τˆɛj f j L2 Π + τ 2ˆɛ 2 j f j Hj. N τˆɛ j ˆf j L2 Π + τ 2ˆɛ 2j ˆf j Hj N ad, by the triagle iequality, El ˆf + j / Jf τˆɛj f j L2 Π + τ 2ˆɛ 2 j f j Hj + P P l ˆf l f τˆɛ j ˆf j L2 Π + N τ 2ˆɛ 2 j ˆf j Hj El f + τˆɛ j ˆf j f j L2 Π + τ 2ˆɛ 2 j f j Hj + P P l ˆf l f. We ow take advatage of 45 ad 46 to replace ˆɛ j s by ɛ j s ad L2 Π by L2 Π. Specifically, there exists a umerical costat C > 1 ad a evet E of probability at least 1 N A/2 such that } 1 {ˆɛj C mi : j = 1,..., N ɛ j ad, for all j = 1,..., N, 1 C ˆf j L2 Π ɛ j ˆf j Hj } {ˆɛj max : j = 1,..., N C 51 ɛ j ˆf j L2 Π C ˆf j L2 Π + ɛ j ˆf j Hj

26 Takig τ C/C 1, we have that, o the evet E, Similarly, El ˆf + τˆɛ j ˆf N j L2 Π + τ 2ˆɛ 2 j ˆf j Hj j / Jf El ˆf + 1 τ ɛ C 2 j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El ˆf τ ɛ C 2 j C ˆf j L2 Π ɛ j ˆf N j Hj + τ 2 ɛ 2 j ˆf j Hj j / Jf El ˆf + 1 τ ɛ C 3 j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj. j / Jf El f + τˆɛ j f j ˆf j L2 Π + τ 2ˆɛ 2 j f j Hj El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2 j f j Hj El f + C 3 τ ɛ j f j ˆf j L2 Π + ɛ j f j ˆf j Hj + C 2 τ 2 ɛ 2 j f j Hj El f + C 3 τ ɛ j f j ˆf j L2 Π + ɛ j f j Hj + ɛ j ˆf j Hj + C 2 τ 2 ɛ 2 j f j Hj El f + 2C 3 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2 j f j Hj + C 3 τ ɛ 2 j ˆf j Hj. Therefore, by takig τ large eough, amely τ C C 1 2C6, we ca fid umerical costats 0 < C 1 < 1 < C 2 such that, o the evet E, El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj We ow boud the empirical process + P P l ˆf l f. P P l ˆf l f, where we use the followig result that will be proved i the ext sectio. Suppose that f = N f j, f j H j 26

27 ad f L R we will eed it with R = RD. Deote { N G, +, R = g : ɛ j g j f j L2 Π, N ɛ 2 j g j f j Hj +, N } g j L R. Lemma 9 There exists a umerical costat C > 0 such that for a arbitrary A 1 ivolved i the defiitio of ɛ j, j = 1,..., N with probability at least 1 2N A/2, for all the followig boud holds Assumig that e N, + e N, 53 sup g G, +,R D P P l g l f CL e N. 54 N ɛ j ˆf j f j L2 Π e N, ad usig the lemma, we get El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj N ɛ 2 j ˆf j f j Hj e N 55 N +C 3 L ɛ j ˆf j f j L2 Π + ɛ 2j ˆf j f j Hj + C 3 L e N El f + C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj N +C 3 L ɛ j ˆf j f j L2 Π + ɛ 2j ˆf j Hj + ɛ 2j f j Hj + C 3 L e N for some umerical costat C 3 > 0. By choosig a umerical costat B properly, τ ca be made large eough so that 2C 3 L τc 1 τc 2. The, we have El ˆf C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj j / Jf El f + 2C 2 τ ɛ j f j ˆf j L2 Π + τ 2 ɛ 2j f j Hj + C 2 /2τe N, 56 27

28 which also implies El ˆf C N 1 τ ɛ j ˆf j f j L2 Π + N τ 2 ɛ 2 j ˆf j Hj El f + 2C 2 + C 1 τ ɛ j f j 2 ˆf j L2 Π + 2C 2 τ 2 ɛ 2 j f j Hj + C 2 /2τe N. 57 We first cosider the case whe 4C 2 The 56 implies that τ ɛ j f j ˆf j L2 Π El f + 2C 2 El ˆf C 1 τ ɛ j ˆf j L2 Π + j / Jf which yields τ 2 ɛ 2 j f j Hj + C 2 /2τe N. 58 N τ 2 ɛ 2 j ˆf j Hj 6C 2 τ ɛ j f j ˆf j L2 Π, 59 τ ɛ j ˆf j L2 Π 12C 2 C 1 j / J f τ ɛ j f j ˆf j L2 Π. 60 Therefore, ˆf 1 f 1,..., ˆf N f N K b J f with b := 12C 2 /C 1. Usig the defiitio of β b J f, it follows from 57, 58 ad the assumptio C 1 < 1 < C 2 that El ˆf + 1 N 2 C 1 τ ɛ j ˆf N j f j L2 Π + τ 2 ɛ 2 j ˆf j Hj 6C 2 + C 1 2 7C 2 τβ b J f Recall that for losses of quadratic type τβ b J f f ˆf L2 Π f f L2 Π + f ˆf L2 Π El f m f f 2 L 2 Π ad El ˆf m ˆf f 2 L 2 Π. 61. The El ˆf C N 1 τ ɛ j ˆf j f j L2 Π + N τ 2 ɛ 2 j ˆf j Hj 7τC 2 m 1/2 β b J f E 1/2 l f + E 1/2 l ˆf. 28

29 Usig the fact that ab a 2 + b 2 /2, we get 7τC 2 m 1/2 β b J f E 1/2 l f 49/2τ 2 C2m 2 1 βb 2 J f + 1 El f, 62 2 ad 7τC 2 m 1/2 β b J f E 1/2 l ˆf 49/2τ 2 C2m 2 1 βb 2 J f El ˆf. 63 Therefore, El ˆf + C 1 N τ ɛ j ˆf j L2 Π + C 1 N τ 2 ɛ 2 j ˆf j Hj El f + 100τ 2 C 2 2m 1 β 2 b J f. 64 We ow cosider the case whe 4C 2 τ ɛ j f j ˆf j L2 Π < El f + 2C 2 It is easy to derive from 57 that i this case El ˆf C N 1 τ ɛ j ˆf j f j L2 Π + Sice β b J f τ 2 ɛ 2 j f j Hj + C 2 /2τe N. 65 N τ 2 ɛ 2 j ˆf j Hj C 1 El f + 2C 2 τ 2 ɛ 2 8C j f j Hj + C 2 /2τe N A log N see the commet after the defiitio of β b J f, we have τe N τ 2 A log N τ 2 β 2 b J f, where we also used the assumptios that log N 2 log log ad A 4. Substitutig this i 66 ad the combiig the resultig boud with 64 cocludes the proof of 49 i the case whe coditios 55 hold. It remais to cosider the case whe 55 does ot hold. The mai idea is to show that i this case the right had side of the oracle iequality is rather large while we still ca cotrol the left had side, so, the iequality becomes trivial. To this ed, ote that, by the defiitio of ˆf, for some umerical costat c 1, P l ˆf N + τˆɛ j ˆf j L2 Π + τ 2ˆɛ 2j ˆf j Hj 1 ly j ; 0 c 1 29

30 sice the value of the pealized empirical risk at ˆf is ot larger tha its value at f = 0 ad, by the assumptios o the loss, ly, 0 is uiformly bouded by a umerical costat. The last equatio implies that, o the evet E defied earlier i the proof see 51, 52, the followig boud holds: Equivaletly, N τ 1 C ɛ j C ˆf j L2 Π ɛ j ˆf j Hj + τ C 2 N τ 2 C 2 ɛ2 j ˆf j Hj c 1. N ɛ j ˆf τ 2 j L2 Π + C τ N ɛ 2 2 C j ˆf j Hj c 1. As soo as τ 2C, so that τ 2 /C 2 τ/c τ 2 /2C 2, we have τ N ɛ j ˆf j L2 Π + τ 2 N ɛ 2 j ˆf j Hj 2c 1 C Note also that, by the assumptios o the loss fuctio, El ˆf P l ˆf ElY ; 0 + P l ˆf P l 0 c 1 + L ˆf L2 Π N c 1 + L ˆf L2 Π c 1 + 2c 1 C 2 1 L τ A log N, 68 where we used the Lipschitz coditio o l, ad also boud 67 ad the fact that ɛ j A log N/ by its defiitio. Recall that we are cosiderig the case whe 55 does ot hold. We will cosider two cases: a whe e N c 3, where c 3 c 1 is a umerical costat, ad b whe e N > c 3. The first case is very simple sice N ad are both upper bouded by a umerical costat recall the assumptio log N 2 log log. I this case, β b J f A log N is bouded from below by a umerical costat. As a cosequece of these observatios, bouds 67 ad 68 imply that El ˆf N + C 1 τ ɛ j ˆf j L2 Π + for some umerical costat C 2 > 0. I the case b, we have N τ 2 ɛ 2 j ˆf j Hj C 2 τ 2 βb 2 J f N ɛ j ˆf N j f j L2 Π + ɛ 2 j ˆf j f j Hj e N 30

31 ad, i view of 67, this implies N ɛ j f j L2 Π + So, either we have N ɛ 2 j f j Hj e N c 1 /2 e N /2. N ɛ 2 j f j Hj e N /4, or N ɛ j f j L2 Π e N /4. Moreover, i the secod case, we also have N A log N N A log N ɛ 2 j f j Hj ɛ j f j L2 Π e N /4. I both cases we ca coclude that, uder the assumptio that log N 2 log log ad e N > c 3 for a sufficietly large umerical costat c 3, El ˆf N + τ ɛ j ˆf j L2 Π + τ 2 ɛ 2j ˆf j Hj c 1 + 2c 1 C 2 1 L τ A log N + 2c 1C 2 τ 2 e N A log N τ 2 ɛ 2 4 j f j Hj. Thus, i both cases a ad b, the followig boud holds: N El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj C 2 τ 2 ɛ 2 j f j Hj + βb 2 J f. 69 To complete the proof, observe that El ˆf N + C 1 τ ɛ j ˆf j f j L2 Π + N τ 2 ɛ 2 j ˆf j Hj N El ˆf + C 1 τ ɛ j ˆf N j L2 Π + τ 2 ɛ 2 j ˆf j Hj + C 1 τ ε j ˆf j f j L2 Π C 2 τ 2 ɛ 2 j f j Hj + βb 2 J f + C 2 τ ε j ˆf j f j L2 Π. 70 Note also that, by the defiitio of β b J f, for all b > 0, τ ε j ˆf j f j L2 Π τβ b J f ˆf j f j j J L2 f Π τβ b J f ˆf f L2 Π + τβ b J f ε j A log N ˆf j L2 Π j J f τβ b J f ˆf f L2 Π + τβ b J f 2c 1C 2 τ A log N

32 A log N where we used the fact that, for all j, ε j ad also boud 67. By a argumet similar to 61-64, it is easy to deduce from the last boud that C 2 τ ε j ˆf j f j L2 Π 3 C2τ 2 2 βb 2 J f m 2 El ˆf El f + 2c2 1C 4 τ 2 A log N. 72 Substitutig this i boud 70, we get 1 2 El ˆf N + C 1 τ ɛ j ˆf j f j L2 Π + C 2 τ 2 ɛ 2 j f j Hj + βb 2 J f N τ 2 ɛ 2 j ˆf j Hj + 3 C2τ 2 2 βb 2 J f m 2 El f + 2c2 1C 4 τ 2 A log N 1 2 El f + C 2τ 2 ɛ 2 j f j Hj + β2 b J f m + 2c2 1C 2 τ 2 A log N, 73 with some umerical costat C 2. It is eough ow to observe cosiderig agai the cases a ad b, as it was doe before, that either the last term is upper bouded by ε j f j Hj, or it is upper bouded by β 2 b J f, to complete the proof. Now, to derive Theorem 2, it is eough to check that, for a umerical costat c > 0, 1/2 1/2 β b J f ɛ 2 j β 2, J f c ɛ 2 j β 2, J f which easily follows from the defiitios of β b ad β 2,. Similarly, the proof of Theorem 3 follows from the fact that, uder the assumptio that Λ 1 ɛ j ɛ Λ, we have K b J K b J, where b = cλ 2 b, c beig a umerical costat. This easily implies the boud β b J f c 1 Λβ 2,b J f df ɛ, where c 1 is a umerical costat. 5 Boudig the Empirical Process We ow proceed to prove Lemma 9 that was used to boud P P l ˆf. l f To this ed, we begi with a fixed pair, +. Throughout the proof, we write R := R D. By 32

33 Talagrad s cocetratio iequality, with probability at least 1 e t sup P P l g l f 2 E g G, +,R Now ote that [ sup P P l g l f g G, +,R t + l g l f t L. + l g l f L2 P ] l g l f L2 P L g f L2 Π N 1 N L g j f j L2 Π L mi ɛ j ɛ j g j f j L2 j Π, where we used the fact that the Lipschitz costat of the loss l o the rage of fuctios from G, +, R is bouded by L. Together with the fact that ɛ j A log N/ 1/2 for all j, this yields Furthermore, l g l f L2 P L A log N. 74 I summary, we have l g l f L L g f L N L g j f j Hj sup g G, +,R Now, by symmetrizatio iequality, [ E L A log N +. P P l g l f [ ] 2 E sup P P l g l f g G, +,R t L A log N + L t +. A log N sup P P l g l f g G, +,R ] 2E sup g G, +,R + R l g l f

34 A applicatio of Rademacher cotractio iequality further yields [ ] E sup P P l g l f g G, +,R CL E sup g G, +,R R g f 76 where C > 0 is a umerical costat agai, it was used here that the Lipschitz costat of the loss l o the rage of fuctios from G, +, R is bouded by L. Applyig Talagrad s cocetratio iequality aother time, we get that with probability at least 1 e t E sup g G, +,R R g f t C sup R g f + g G, +,R A log N + t + A log N for some umerical costat C > 0. Recallig the defiitio of ˇɛ j := ˇɛK j, we get R h j ˇɛ j h j L2 Π + ˇɛ 2 j h j Hj, h j H j 77 Hece, with probability at least 1 2e t ad with some umerical costat C > 0 sup P P l g l f g G, +,R t CL sup R g f + g G, +,R A log N + t + A log N N t CL sup R g j f j + g G, +,R A log N + t + A log N N CL sup ˇɛj g j f j L2 Π + ˇɛ 2 j g j f j Hj g G, +,R t + A log N + t + A log N Usig 46, ˇɛ j ca be upper bouded by c ɛ j with some umerical costat c > 0 o a evet E of probability at least 1 N A/2. Therefore, the followig boud is obtaied:. sup P P l g l f g G, +,R t CL A log N + t + A log N 34.

35 It holds o the evet E F, +, t, where PF, +, t 1 2e t. We will ow choose t = A log N + 4 log N + 4 log2/ log 2 ad obtai a boud that holds uiformly over e N e N ad e N + e N. 78 To this ed, cosider For ay j ad + k satisfyig 78, we have j = + j := 2 j. 79 sup P P l g l f g G j, + k,r t CL j + + k + j A log N + t + k A log N o the evet E F j, + k, t. Therefore, simultaeously for all j ad + k we have satisfyig 78, sup P P l g l f g G j, + k,r CL j + + k + A log N + 4 log N + 4 log2/ log 2 j A log N + + A log N + 4 log N + 4 log2/ log 2 A log N o the evet E := E j,k F j, + k, t. The last itersectio is over all j, k such that coditios 78 hold for j, + k. The umber of the evets i this itersectio is bouded by 2/ log 2 2 N 2. Therefore, PE 1 2/ log 2 2 N 2 exp A log N 4 log N 4 log2/ log 2 PE 1 2N A/2. Usig mootoicity of the fuctios of, + ivolved i the iequalities, the bouds ca be exteded to the whole rage of values of, + satisfyig 78, so, with probability at least 1 2N A/2 we have for all such, + 80 sup P P l g l f CL g G, +,R 35

36 If e N, or + e N, it follows by mootoicity of the left had side that with the same probability which completes the proof. sup P P l g l f CL e N, 82 g G, +,R Ackowledgmet. The authors are thakful to the referees for a umber of helpful suggestios. The first author is thakful to Evarist Gié for useful coversatios about the paper. Refereces [1] Aroszaj, N. 1950, Theory of reproducig kerels, Tras. Am. Math. Soc., 68, [2] Bach, F. 2008, Cosistecy of the group Lasso ad multiple kerel learig, Joural of Machie Learig Research, 9, [3] Bickel, P., Ritov, Y. ad Tsybakov, A. 2009, Simultaeous aalysis of Lasso ad Datzig selector, Aals of Statistics, 37, 4, [4] Bousquet, O. ad Herrma, D. 2003, O the complexity of learig the kerel matrix, I: Advaces i Neural Iformatio Processig Systems 15, [5] Blachard, G., Bousquet, O. ad Massart, P. 2008, Statistical performace of support vector machies, Aals of Statistics, 36, [6] Bousquet, O. 2002, A Beett cocetratio iequality ad its applicatios to suprema of empirical processes, C.R. Acad. Sci. Paris, 334, [7] Crammer, K., Keshet, J. ad Siger, Y. 2003, Kerel desig usig boostig, I: Advaces i Neural Iformatio Processig Systems 15, [8] Koltchiskii, V. 2008, Oracle Iequalities i Empirical Risk Miimizatio ad Sparse Recovery Problems, Lecture Notes for Ecole d Eté de Probabilités de Sait-Flour. 36

SPARSITY IN MULTIPLE KERNEL LEARNING. BY VLADIMIR KOLTCHINSKII 1 AND MING YUAN 2 Georgia Institute of Technology

The Aals of Statistics 2010, Vol. 38, No. 6, 3660 3695 DOI: 10.1214/10-AOS825 Istitute of Mathematical Statistics, 2010 SPARSITY IN MULTIPLE KERNEL LEARNING BY VLADIMIR KOLTCHINSKII 1 AND MING YUAN 2 Georgia