Learning Bounds for Support Vector Machines with Learned Kernels

Size: px

Start display at page:

Download "Learning Bounds for Support Vector Machines with Learned Kernels"

Mavis Jenkins
5 years ago
Views:

1 Learig Bouds for Support Vector Machies with Leared Kerels Natha Srebro 1 ad Shai Be-David 2 1 Uiversity of Toroto Departmet of Computer Sciece, Toroto ON, CANADA 2 Uiversity of Waterloo School of Computer Sciece, Waterloo ON, CANADA ati@cs.toroto.edu, shai@cs.uwaterloo.ca Abstract. Cosider the problem of learig a kerel for use i SVM classificatio. We boud the estimatio error of a large margi classifier whe the kerel, relative to which this margi is defied, is chose from a family of kerels based o the traiig sample. For a kerel family with pseudodimesio d φ, we preset a boud of Õ(dφ + 1/γ 2 )/ o the estimatio error for SVMs with margi γ. This is the first boud i which the relatio betwee the margi term ad the family-of-kerels term is additive rather the multiplicative. The pseudodimesio of families of liear combiatios of base kerels is the umber of base kerels. Ulike i previous (multiplicative) bouds, there is o o-egativity requiremet o the coefficiets of the liear combiatios. We also give simple bouds o the pseudodimesio for families of Gaussia kerels. 1 Itroductio I support vector machies (SVMs), as well as other similar methods, prior kowledge is represeted through a kerel fuctio specifyig the ier products betwee a implicit represetatio of iput poits i some Hilbert space. A large margi liear classifier is the sought i this implicit Hilbert space. Usig a good kerel fuctio, appropriate for the problem, is crucial for successful learig: The kerel fuctio essetially specifies the permitted hypothesis class, or at least which hypotheses are preferred. I the stadard SVM framework, oe commits to a fixed kerel fuctio apriori, ad the searches for a large margi classifier with respect to this kerel. If it turs out that this fixed kerel i iappropriate for the data, it might be impossible to fid a good large margi classifier. Istead, oe ca search for a data-appropriate kerel fuctio, from some class of allowed kerels, permittig large margi classificatio. That is, search for both a kerel ad a large margi classifier with respect to the kerel. I this paper we develop bouds for the sample complexity cost of allowig such kerel adaptatio. 1.1 Learig the Kerel As i stadard hypothesis learig, the process of learig a kerel is guided by some family of potetial kerels. A popular type of kerel family cosists of

2 kerels that are a liear, or covex, combiatios of several base kerels [1 3] 3 : K liear (K 1,..., K k ) def = K covex (K 1,..., K k ) def = { K λ = { K λ = k λ i K i K λ 0 ad i=1 k λ i K i λ i 0 ad i=1 } k λ i = 1 i=1 } k λ i = 1 Such kerel families are useful for itegratig several sources of iformatio, each ecoded i a differet kerel, ad are especially popular i bioiformatics applicatios [4 6, ad others]. Aother commo approach is learig (or tuig ) parameters of a parameterized kerel, such as the covariace matrix of a Gaussia kerel, based o traiig data [7 10, ad others]. This amouts to learig a kerel from a parametric family, such as the family of Gaussia kerels: K l Gaussia def = i=1 { K A : (x 1, x 2 ) e (x1 x2) A(x 1 x 2) A R l l, A 0 Ifiite-dimesioal kerel families have also bee cosidered, either through hyperkerels [11] or as covex combiatios of a cotiuum of base kerels (e.g. covex combiatios of Gaussia kerels) [12, 13]. I this paper we focus o fiite-dimesioal kerel families, such as those defied by equatios (1) (3). Learig the kerel matrix allows for greater flexibility i matchig the target fuctio, but this of course comes at the cost of higher estimatio error, i.e. a looser boud o the expected error of the leared classifier i terms of its empirical error. Boudig this estimatio gap is essetial for buildig theoretical support for kerel learig, ad this is the focus of this paper. } (1) (2) (3) 1.2 Learig Bouds with Leared Kerels Previous Work For stadard SVM learig, with a fixed kerel, oe ca show that, with high probability, the estimatio error (gap betwee the expected error ad empirical error) of a leared classifier with margi γ is bouded by Õ(1/γ2 )/ where is the sample size ad the Õ() otatio hides logarithmic factors i its argumet, the sample size ad the allowed failure probability. That is, the umber of samples eeded for learig is Õ( 1/γ 2). Lackriet et al. [1] showed that whe a kerel is chose from a covex combiatio of k base kerels, the estimatio error of the leared classifier is bouded by Õ(k/γ2 )/ where γ is the margi of the leared classifier uder the leared kerel. Note the multiplicative iteractio betwee the margi complexity term 1/γ 2 ad the umber of base kerels k. Recetly, Micchelli et al. [14] derived bouds for the family of Gaussia kerels of equatio (3). The depedece of 3 Lackriet et al. [1] impose a boud o the trace of the Gram matrix of K λ this is equivalet to boudig λ i whe the base kerels are ormalized.

3 these bouds o the margi ad the complexity of the kerel family is also multiplicative the estimatio error is bouded by Õ(Cl /γ 2 )/, where C l is a costat that depeds o the iput dimesioality l. The multiplicative iteractio betwee the margi ad the complexity measure of the kerel class is disappoitig. It suggests that learig eve a few kerel parameters (e.g. the coefficiets λ) leads to a multiplicative icrease i the required sample size. It is importat to uderstad whether such a multiplicative icrease i the umber of traiig samples is i fact ecessary. Bousquet ad Herrma [2, Theorem 2] ad Lackriet et al. [1] also discuss bouds for families of covex ad liear combiatios of kerels that appear to be idepedet of the umber of base kerels. However, we show i the Appedix that these bouds are meaigless: The boud o the expected error is ever less tha oe. We are ot aware of ay previous work describig meaigful explicit bouds for the family of liear combiatios of kerels give i equatio (1). 1.3 New, Additive, Learig Bouds I this paper, we boud the estimatio error, whe the kerel is chose from a kerel family K, by Õ(dφ + 1/γ 2 )/, where d φ is the pseudodimesio of the family K (Theorem 2; the pseudodimesio is defied i Defiitio 5). This establishes that the boud o the required sample size, Õ( d φ + 1/γ 2) grows oly additively with the dimesioality of the allowed kerel family (up to logarithmic factors). This is a much more reasoable price to pay for ot committig to a sigle kerel apriori. The pseudodimesio of most kerel families matches our ituitive otio of the dimesioality of the family, ad i particular: The pseudodimesio of a family of liear, or covex, combiatios of k base kerels (equatios 1,2) is at most k (Lemma 7). The pseudodimesio of the family KGaussia l of Gaussia kerels (equatio 3) for iputs x R l, is at most l(l + 1)/2 (Lemma 9). If oly diagoal covariaces are allowed, the pseudodimesio is l (Lemma 10). If the covariaces (ad therefore A) are costraied to be of rak at most k, the pseudodimesio is at most kl log 2 (22kl) (Lemma 11). 1.4 Pla of Attack For a fixed kerel, it is well kow that, with probability at least 1 δ, the estimatio error of all margi-γ classifiers is at most O(1/γ 2 log δ)/ [15]. To obtai a boud that holds for all margi-γ classifiers with respect to ay kerel K i some fiite kerel family K, cosider a uio boud over the K evets the estimatio error is large for some margi-γ classifier with respect to K for each K K. Usig the above boud with δ scaled by the cardiality K, the uio boud esures us that with probability at least 1 δ, the estimatio

4 error will be bouded by O(log K + 1/γ 2 log δ)/ for all margi-γ classifiers with respect to ay kerel i the family. I order to exted this type of result also to ifiite-cardiality families, we employ the stadard otio of ɛ-ets: Roughly speakig, eve though a cotiuous family K might be ifiite, may kerels i it will be very similar ad it will ot matter which oe we use. Istead of takig a uio boud over all kerels i K, we oly take a uio boud over essetially differet kerels. I Sectio 4 we use stadard results to show that the umber of essetially differet kerels i a family grows expoetially oly with the dimesioality of the family, yieldig a additive term (almost) proportioal to the dimesioality. As is stadard i obtaiig such bouds, our otio of essetially differet refers to a specific sample ad so symmetrizatio argumets are required i order to make the above coceptual argumets cocrete. To do so clealy ad cheaply, we use a ɛ-et of kerels to costruct a ɛ-et of classifiers with respect to the kerels, otig that the size of the ɛ-et icreases oly multiplicatively relative to the size of a ɛ-et for ay oe kerel (Sectio 3). A importat compoet of this costructio is the observatio that kerels that are close as real-valued fuctios also yield similar classes of classifiers (Lemma 2). Usig our costructed ɛ-et, we ca apply stadard results boudig the estimatio error i terms of the log-size of ɛ-ets, without eedig to ivoke symmetrizatio argumets directly. For the sake of simplicity ad cociseess of presetatio, the results i this paper are stated for biary classificatio usig a homogeeous large-margi classifier, i.e. ot allowig a bias term, ad refer to zero-oe error. The results ca be easily exteded to other loss fuctios ad to allow a bias term. 2 Prelimiaries Notatio: We use v to deote the orm of a vector i a abstract Hilbert space. For a vector v R, v is the Euclidea orm of v. For a matrix A R, A 2 = max v =1 Av is the L 2 operator orm of A, A = max ij A ij is the l orm of A ad A 0 idicates that A is positive semi-defiite (p.s.d.) ad symmetric. We use boldface x for samples (multisets, though we refer to them simply as sets) of poits, where x is the umber of poits i a sample. 2.1 Support Vector Machies Let (x 1, y 1 ),..., (x, y ) be a traiig set of pairs of iput poits x i X ad target labels y i {±1}. Let φ : X H be a mappig of iput poits ito a Hilbert space H with ier product,. A vector w H ca be used as a predictor for poits i X, predictig the label sig( w, φ(x) ) for iput x. Cosider learig by seekig a uit-orm predictor w achievig low empirical hige loss ĥγ (w) = 1 i=1 max(γ y i w, φ(x i ), 0), relative to a margi γ > 0. The Represeter Theorem [16, Theorem 4.2] guaratees that the predictor w miimizig ĥγ (w) ca be writte as w = i=1 α iφ(x i ). For such w, predictios

5 w, φ(x) = i α i φ(x i ), φ(x) ad the orm w 2 = ij α iα j φ(x i ), φ(x j ) deped oly o ier products betwee mappigs of iput poits. The Hilbert space H ad mappig φ ca therefore be represeted implicitly by a kerel fuctio K:X X R specifyig these ier products: K(x, x ) = φ(x ), φ(x ). Defiitio 1. A fuctio K : X X R is a kerel fuctio if for some Hilbert space H ad mappig φ : X H, K(x, x ) = φ(x ), φ(x ) for all x, x. For a set x = {x 1,..., x } X of poits, it will be useful to cosider their Gram matrix K x R, K x [i, j] = K(x i, x j ). A fuctio K : X X R is a kerel fuctio iff for ay fiite x X, the Gram matrix K x is p.s.d [16]. Whe specifyig the mappig φ implicitly through a kerel fuctio, it is useful to thik about a predictor as a fuctio f : X R istead of cosiderig w explicitly. Give a kerel K, learig ca the be phrased as choosig a predictor from the class miimizig F K def = {x w, φ(x) w 1, K(x, x ) = φ(x ), φ(x ) } (4) ĥ γ (f) def = 1 max(γ y i f(x i ), 0). (5) i=1 For a set of poits x = {x 1,..., x }, let f(x) R be the vector whose etries are f(x i ). The followig restricted variat of the Represeter Theorem characterizes the possible predictio vectors f(x) by suggestig the matrix square root of the Gram matrix (Kx 1/2 0 such that K x = Kx 1/2 Kx 1/2 ) as a possible feature mappig for poits i x: Lemma 1. For ay kerel fuctio K ad set x = {x 1,..., x } of poits: {f(x) f F K } = {K 1/2 x w w R, w 1}, Proof. For ay f F K we ca write f(x) = w, φ(x) with w 1 (equatio 4). Cosider the projectio w = i α iφ(x i ) of w oto spa(φ(x 1 ),..., φ(x )). We have f(x i ) = w, φ(x i ) = w, φ(x i ) = j α jk(x j, x i ) ad 1 w 2 w 2 = ij α iα j K(x i, x j ). I matrix form: f(x) = K x α ad α K x α 1. Settig w = Kx 1/2 α we have f(x) = K x α = Kx 1/2 Kx 1/2 α = Kx 1/2 w while w 2 = α Kx 1/2 Kx 1/2 α = αk x α 1. This establishes that the left-had side is a subset of the right-had side. For ay w R with w 1 we would like to defie w = i α iφ(x i ) with α = Kx 1/2 w ad get w, φ(x i ) = j α j φ(x j ), φ(x i ) = K x α = K x Kx 1/2 w = Kx 1/2 w. However, K x might be sigular. Istead, cosider the sigular value decompositio K x = USU, with U U = I, where zero sigular values have bee removed, i.e. S is a all-positive diagoal matrix ad U might be rectagular. Set α = US 1/2 U w ad cosider w = i α iφ(x i ). We ca ow calculate: w, φ(x i ) = j α j φ(x j ), φ(x i ) = K x α = USU US 1/2 U w = US 1/2 U w = K 1/2 x w (6)

6 while w 2 = α Kα = w US 1/2 U USU US 1/2 U w = w UU w w 2 1 To remove cofusio we ote some differeces betwee the presetatio here ad other commo, ad equivalet, presetatios of SVMs. Istead of fixig the margi γ ad miimizig the empirical hige loss, it is commo to try to maximize γ while miimizig the loss. The most commo combied objective, i our otatio, is to miimize 1 γ + C 1 2 γ ĥγ (w) for some trade-off parameter C. This is usually doe with a chage of variable to w = w/γ, which results i a equivalet problem where the margi is fixed to oe, ad the orm of w varies. Expressed i terms of w the objective is w 2 + C ĥ1 ( w). Varyig the trade-off parameter C is equivalet to varyig the margi ad miimizig the loss. The variat of the Represeter Theorem give i Lemma 1 applies to ay predictor i F K, but oly describes the behavior of the predictor o the set x. This will be sufficiet for our purposes. 2.2 Learig Bouds ad Coverig Numbers We derive geeralizatio error bouds i the stadard agostic learig settig. That is, we assume data is geerated by some ukow joit distributio P (X, Y ) over iput poits i X ad labels i ±1. The traiig set cosists of i.i.d. samples (x i, y i ) from this joit distributio. We would like to boud the differece est γ (f) = err(f) êrr γ (f) (the estimatio error) betwee the expected error rate err(f) = Pr (Y f(x) 0), (7) X,Y ad the empirical margi error rate êrr γ (f) = {i y if(x i ) < γ}. (8) The mai challege of derivig such bouds is boudig the estimatio error uiformly over all predictors i a class. The techique we employ i this paper to obtai such uiform bouds is boudig the coverig umbers of classes. Defiitio 2. A subset Ã A is a ɛ-et of A uder the metric d if for ay a A there exists ã Ã with d(a, ã) ɛ. The coverig umber N d(a, ɛ) is the size of the smallest ɛ-et of A. We will study coverigs of classes of predictors uder the sample-based l metric, which depeds o a sample x = {x 1,..., x }: d x (f 1, f 2 ) = max i=1 f 1(x i ) f 2 (x i ) (9) Defiitio 3. The uiform l coverig umber N (F, ɛ) of a predictor class F is give by cosiderig all possible samples x of size : N (F, ɛ) = sup N d x (F, ɛ) x =

7 The uiform l coverig umber ca be used to boud the estimatio error uiformly. For a predictor class F ad fixed γ > 0, with probability at least 1 δ over the choice of a traiig set of size [17, Theorem 10.1]: sup est γ (f) f F log N 2(F, γ/2) log δ (10) The uiform coverig umber of the class F K (uit-orm predictors correspodig to a kerel fuctio K; recall eq. (4)), with K(x, x) B for all x, ca be bouded by applyig Theorems ad 12.8 of Athoy ad Bartlett [17]: ( 4B N (F, ɛ) 2 ɛ 2 ) 16B ɛ 2 log 2 ( ɛe 4 B ) (11) yieldig sup f FK est γ (f) = Õ(B/γ2 )/ ad implyig that Õ( B/γ 2) traiig examples are eough to guaratee that the estimatio error dimiishes. 2.3 Learig the Kerel Istead of committig to a fixed kerel, we cosider a family K {K :X X R} of allowed kerels ad the correspodig predictor class: F K = K K F K (12) The learig problem is ow oe of miimizig ĥγ (f) for f F K. We are iterested i boudig the estimatio error uiformly for the class F K ad will do so by boudig the coverig umbers of the class. The bouds will deped o the dimesioality of K, which we will defie later, the margi γ, ad a boud B such that K(x, x) B for all K K ad all x. We will say that such a kerel family is bouded by B. Note that B is the radius of a ball (aroud the origi) cotaiig φ(x) i the implied Hilbert space, ad scalig φ scales both B ad γ liearly. Our bouds will therefore deped o the relative margi γ/ B. 3 Coverig Numbers with Multiple Kerels I this sectio, we will show how to use bouds o coverig umbers of a family K of kerels to obtai bouds o the coverig umber of the class F K of predictors that are low-orm liear predictors uder some kerel K K. We will show how to combie a ɛ-et of K with ɛ-ets for the classes F K to obtai a ɛ-et for the class F K. I the ext sectio, we will see how to boud the coverig umbers of a kerel family K ad will the be able to apply the mai result of this sectio to get a boud o the coverig umber of F K. I order to state the mai result of this sectio, we will eed to cosider coverig umbers of kerel families. We will use the followig sample-based metric betwee kerels. For a sample x = {x 1,..., x }: D (K, x def K) = max K(x i, x j ) K(x i, x j ) = K x K x (13) i,j=1

8 Defiitio 4. The uiform l kerel coverig umber N D (K, ɛ) of a kerel class K is give by cosiderig all possible samples x of size : N D (K, ɛ) = sup N D x (K, ɛ) x = Theorem 1. For a family K of kerels bouded by B ad ay ɛ < 1: N (F K, ɛ) 2 N D (K, ɛ2 4 ) ( ) 16B 64B ɛ log ( ) ɛe 2 8 ɛ 2 B I order to prove Theorem 1, we will first show how all the predictors of oe kerel ca be approximated by predictors of a earby kerel. Roughly speakig, we do so by showig that the possible feature mappig Kx 1/2 of Lemma 1 does ot chage too much: Lemma 2. Let K, K be two kerel fuctios. The for ay predictor f F K there exists a predictor f F K with d x (f, f) D (K, x K). Proof. Let w R, w = 1 such that f(x) = Kx 1/2 w, as guarateed by Lemma 1. Cosider the predictor f F K such that f(x) 1/2 = K x w, guarateed by the reverse directio of Lemma 1: d x (f, f) = max f(x i ) f(x i ) f(x) f(x) (14) i = Kx 1/2 1/2 w K x w K x K x = K 1/2 x 1/2 K x Kx w K 2 x 1 (15) 2 D x (K, K) (16) See, e.g., Theorem X.1.1 of Bhatia [18] for the third iequality i (15). Proof of Theorem 1: Set ɛ k = ɛ2 4 ad ɛ f = ɛ/2. Let K be a ɛ k -et of K. For each K K, let F K be a ɛ f -et of F K. We will show that F K def = K K F K (17) is a ɛ-et of F K. For ay f F K we have f F K for some K K. The kerel K is covered by some K K with D (K, x K) ɛ k. Let f F K be a predictor with d x (f, f) D (K, x K) ɛ k guarateed by Lemma 2, ad f F K such that d x ( f, f) ɛf. The f FK is a predictor with: d x (f, f) d x (f, f) + d x ( f, f) ɛk + ɛ f = ɛ (18) This establishes that F K is ideed a ɛ-et. Its size is bouded by F K F K K max F K N D (K, ɛ2 4 ) max N (F K, ɛ/2). (19) K K K K Substitutig i (11) yields the desired boud.

9 4 Learig Bouds i terms of the Pseudodimesio We saw that if we could boud the coverig umbers of a kerel family K, we could use Theorem 1 to obtai a boud o the coverig umbers of the class F K of predictors that are low-orm liear predictors uder some kerel K K. We could the use (10) to establish a learig boud. I this sectio, we will see how to boud the coverig umbers of a kerel family by its pseudodimesio, ad use this to state learig bouds i terms of this measure. To do so, we will use well-kow results boudig coverig umbers i terms of the pseudodimesio, payig a bit of attetio to the subtleties of the differeces betwee Defiitio 4 of uiform kerel coverig umbers, ad the stadard Defiitio 3 of uiform coverig umbers. To defie the pseudodimesio of a kerel family we will treat kerels as fuctios from pairs of poits to the reals: Defiitio 5. Let K = {K : X X R} be a kerel family. The class K pseudo-shatters a set of pairs of poits (x 1, x 1 ),..., (x, x ) if there exist thresholds t 1,..., t R such that for ay b 1,..., b {±1} there exists K K with sig(k(x i, x i ) t i) = b i. The pseudodimesio d φ (K) is the largest such that there exists a set of pairs of poits that are pseudo-shattered by K. The uiform l coverig umbers of a class G of real-valued fuctios takig values i [ B, B] ca be bouded i terms of its pseudodimesio. Let d φ be the pseudodimesio of G; the for ay > d φ ad ɛ > 0 [17, Theorem 12.2]: ( ) dφ eb N (G, ɛ) (20) ɛd φ We should be careful here, sice the coverig umbers N (K, ɛ) are i relatio to the metrics: d x (K, K) = max i=1 K(x i, x i ) K(x i, x i ) (21) defied for a sample x X X of pairs of poits (x i, x i ). The supremum i Defiitio 3 of N (K, ɛ) should the be take over all samples of pairs of poits. Compare with (13) where the kerels are evaluated over the 2 pairs of poits (x i, x j ) arisig from a sample of poits. However, for ay sample of poits x = {x 1,..., x } X, we ca always cosider the 2 poit pairs x 2 = {(x i, x j ) i, j = 1..} ad observe that D (K, x K) = d x2 (K, K) ad so N D x (K, ɛ) = N d x 2 (K, ɛ). Although such sets of poit pairs do ot accout for all sets of 2 poit pairs i the supremum of Defiitio 3, we ca still coclude that for ay K,, ɛ > 0: Combiig (22) ad (20): N D (K, ɛ) N 2(K, ɛ) (22)

10 Lemma 3. For ay kerel family K bouded by B with pseudodimesio d φ : ( e N D 2 ) dφ B (K, ɛ) ɛd φ Usig Lemma 3 ad relyig o (10) ad Theorem 1 we have: Theorem 2. For ay kerel family K, bouded by B ad with pseudodimesio d φ, ad ay fixed γ > 0, with probability at least 1 δ over the choice of a traiig set of size : sup est γ (f) f F K γ 2 d φ d φ log 128e 3 B B γ log γe 128B 2 8 log B γ log δ 2 Theorem 2 is stated for a fixed margi but it ca also be stated uiformly over all margis, at the price of a additioal log γ term (e.g. [15]). Also, istead of boudig K(x, x) for all x, it is eough to boud it oly o average, i.e. require E[K(X, X)] B. This correspods to boudig the trace of the Gram matrix as was doe by Lackriet et al.. I ay case, we ca set B = 1 without loss of geerality ad scale the kerel ad margi appropriately. The learig settig ivestigated here differs slightly from that of Lackriet et al., who studied trasductio, but learig bouds ca easily be traslated betwee the two settigs. 5 The Pseudodimesio of Commo Kerel Families I this sectio, we aalyze the pseudodimesio of several kerel families i commo use. Most pseudodimesio bouds we preset follow easily from wellkow properties of the pseudodimesio of fuctio families, which we review at the begiig of the sectio. The aalyses i this sectio serve also as examples of how the pseudodimesio of other kerel families ca be bouded. 5.1 Prelimiaries We review some basic properties of the pseudodimesio of a class of fuctios: Fact 4 If G G the d φ (G ) d φ (G). Fact 5 ([17, Theorem 11.3]) Let G be a class of real-valued fuctios ad σ : R R a mootoe fuctio. The d φ ({σ g g G}) d φ (G). Fact 6 ([17, Theorem 11.4]) The pseudodimesio of a k-dimesioal vector space of real-valued fuctios is k. We will also use a classic result of Warre that is useful, amog other thigs, for boudig the pseudodimesio of classes ivolvig low-rak matrices. We say that the real-valued fuctios (g 1, g 2,..., g m ) realize a sig vector b {±1} m iff there exists a iput x for which b i = sig g i (x) for all i. The umber of sig vectors realizable by m polyomials of degree at most d over R, where m, is at most (4edm/) [19].

11 5.2 Combiatio of Base Kerels Sice families of liear or covex combiatios of k base kerels are subsets of k-dimesioal vector spaces of fuctios, we ca easily boud their pseudodimesio by k. Note that the pseudodimesio depeds oly o the umber of base kerels, but does ot deped o the particular choice of base kerels. Lemma 7. For ay fiite set of kerels S = {K 1,... K k }, d φ (K covex (S)) d φ (K liear (S)) k Proof. We have K covex K liear spa S where spa S = { i λ ik i λ i R} is a vector space of dimesioality k. The bouds follow from Facts 4 ad Gaussia Kerels with a Leared Covariace Matrix Before cosiderig the family K Gaussia of Gaussia kerels, let us cosider a sigle-parameter family that geeralizes tuig a sigle scale parameter (i.e. variace) of a Gaussia kerel. For a fuctio d : X X R +, cosider the class K scale (d) def = { K d λ : (x 1, x 2 ) e λd(x1,x2) λ R +}. (23) The family of spherical Gaussia kerels is obtaied with d(x 1, x 2 ) = x 1 x 2 2. Lemma 8. For ay fuctio d, d φ (K scale (d)) 1. Proof. The set { λd λ R + } of fuctios over X X is a subset of a oedimesioal vector space ad so has pseudodimesio at most oe. Composig them with the mootoe expoetiatio fuctio ad usig Fact 5 yields the desired boud. I order to aalyze the pseudodimesio of more geeral families of Gaussia kerels, we will use the same techique of aalyzig the fuctios i the expoet ad the composig them with the expoetiatio fuctio. Recall that class K l Gaussia of Gaussia kerels over Rl defied i (3). Lemma 9. d φ (KGaussia l ) l(l + 1)/2 Proof. Cosider the fuctios at the expoet: {(x 1, x 2 ) (x 1 x 2 )A(x 1 x 2 ) A R l l, A 0} spa{(x 1, x 2 ) (x 1 x 2 )[i] (x 1 x 2 )[j] i j l} where v[i] deotes the i th coordiate of a vector i R l. This is a vector space of dimesioality l(l + 1) ad the result follows by compositio with the expoetiatio fuctio. We ext aalyze the pseudodimesio of the family of Gaussia kerels with a diagoal covariace matrix, i.e. whe we apply a arbitrary scalig to iput coordiates: K (l diag) Gaussia = {K λ : (x 1, x 2 ) e ( λ (x 1 x 2)) 2 λ R l} (24)

12 Lemma 10. d φ (K (l diag) Gaussia ) l Proof. We use the same argumets. The expoets are spaed by the l fuctios (x 1, x 2 ) ((x 1 x 2 )[i]) 2. As a fial example, we aalyze the pseudodimesio of the family of Gaussia kerels with a low-rak covariace matrix, correspodig to a low-rak A i our otatio: K l,k Gaussia = {(x 1, x 2 ) e (x1 x2) A(x 1 x 2) A R l l, A 0, rak A k This family correspods to learig a dimesioality reducig liear trasformatio of the iputs that is applied before calculatig the Gaussia kerel. Lemma 11. d φ (K l,k Gaussia ) kl log 2(8ekl) Proof. Ay A 0 of rak at most k ca be writte as A = U U with U R k l. Cosider the set G = {(x, x ) (x x ) U U(x x ) U R k l } of fuctios at the expoet. Assume G pseudo-shatters a set of m poit pairs S = {(x 1, x 1 )..., m, x (x m)}. By the defiitio of pseudo-shatterig, we get that there exist t 1,..., t m R so that for every b {±1} m there exist U b R k l with b i = sig ( (x i x i ) U U(x i x i ) t i) for all i m. Viewig each p i (U) def = (x i x i ) U U(x i x i ) t i as a quadratic polyomial i the kl etries of U, where x i x i ad t i determie the coefficiets of p i, we get a set of m quadratic polyomials over kl variables which realize all 2 m sig vectors. Applyig Warre s boud [19] discussed above we get 2 m (8em/kl) kl which implies m kl log 2 (8ekl). This is a boud o the umber of poits that ca be pseudo-shattered by G, ad hece o the pseudodimesio of G, ad by compositio with expoetiatio we get the desired boud. } 6 Coclusio ad Discussio Learig with a family of allowed kerel matrices has bee a topic of sigificat iterest ad the focus of cosiderable body of research i recet years, ad several attempts have bee made to establish learig bouds for this settig. I this paper we establish the first geeralizatio error bouds for kerel-learig SVMs where the margi complexity term ad the dimesioality of the kerel family iteract additively rather the multiplicatively (up to log factors). The additive iteractio yields stroger bouds. We believe that the implied additive bouds o the sample complexity represet its correct behavior (up to log factors), although this remais to be proved. The results we preset sigificatly improve o previous results for covex combiatios of base kerels, for which the oly previously kow boud had a multiplicative iteractio [1], ad for Gaussia kerels with a leared covariace matrix, for which oly a boud with a multiplicative iteractio ad a uspecified depedece o the iput dimesioality was previously show [14]. We

13 also provide the first explicit o-trivial boud for liear combiatios of base kerels a boud that depeds oly o the (relative) margi ad the umber of base kerels. The techiques we itroduce for obtaiig bouds based o the pseudodimesio of the class of kerels should readily apply to straightforward derivatio of bouds for may other classes. We ote that previous attempts at establishig bouds for this settig [1, 2, 14] relied o boudig the Rademacher complexity [15] of the class F K. However, geeralizatio error bouds derived solely from the Rademacher complexity R[F K ] of the class F K must have a multiplicative depedece o B/γ: The Rademacher complexity R[F K ] scales liearly with the scale B of fuctios i F K, ad to obtai a estimatio error boud it is multiplied by the Lipschitz costat 1/γ [15]. This might be avoidable by clippig predictors i F K to the rage [ γ, γ]: F γ K def = {f [±γ] f F K }, f [±γ] (x) = γ f(x) γ if f(x) γ if γ f(x) γ if γ f(x) (25) Whe usig the Rademacher complexity R[F K ] to obtai geeralizatio error bouds i terms of the margi error, the class is implicitly clipped ad oly the Rademacher complexity of F γ K is actually relevat. This Rademacher complexity R[F γ K ] is bouded by R[F K]. I our case, it seems that this last boud is loose. It is possible ( though, that coverig umbers of K ca be used to boud R[F γ K ] by O γ log N2(K, D 4B/ 2 ) + ) B /, yieldig a geeralizatio error boud with a additive iteractio, ad perhaps avoidig the log factors of the margi complexity term Õ( B/γ 2) of Theorem 2. Refereces 1. Lackriet, G.R., Cristiaii, N., Bartlett, P., Ghaoui, L.E., Jorda, M.I.: Learig the kerel matrix with semidefiite programmig. J Mach Lear Res 5 (2004) Bousquet, O., Herrma, D.J.L.: O the complexity of learig the kerel matrix. I: Adv. i Neural Iformatio Processig Systems 15. (2003) 3. Crammer, K., Keshet, J., Siger, Y.: Kerel desig usig boostig. I: Advaces i Neural Iformatio Processig Systems 15. (2003) 4. Lackriet, G.R.G., De Bie, T., Cristiaii, N., Jorda, M.I., Noble, W.S.: A statistical framework for geomic data fusio. Bioiformatics 20 (2004) 5. Soeburg, S., Rätsch, G., Schafer, C.: Learig iterpretable SVMs for biological sequece classificatio. I: Research i Computatioal Molecular Biology. (2005) 6. Be-Hur, A., Noble, W.S.: Kerel methods for predictig protei-protei iteractios. Bioiformatics 21 (2005) 7. Cristiaii, N., Campbell, C., Shawe-Taylor, J.: Dyamically adaptig kerels i support vector machies. I: Adv. i Neural Iformatio Proceedigs Systems 11. (1999) 8. Chapelle, O., Vapik, V., Bousquet, O., Makhuerjee, S.: Choosig multiple parameters for support vector machies. Machie Learig 46 (2002)

14 9. Keerthi, S.S.: Efficiet tuig of SVM hyperparameters usig radius/margi boud ad iterative algorithms. IEEE Tra. o Neural Networks 13 (2002) Glasmachers, T., Igel, C.: Gradiet-based adaptatio of geeral gaussia kerels. Neural Comput. 17 (2005) Og, C.S., Smola, A.J., Williamso, R.C.: Learig the kerel with hyperkerels. J. Mach. Lear. Res. 6 (2005) 12. Micchelli, C.A., Potil, M.: Learig the kerel fuctio via regularizatio. J. Mach. Lear. Res. 6 (2005) 13. Argyriou, A., Micchelli, C.A., Potil, M.: Learig covex combiatios of cotiuously parameterized basic kerels. I: 18th Aual Cof. o Learig Theory. (2005) 14. Micchelli, C.A., Potil, M., Wu, Q., Zhou, D.X.: Error bouds for learig the kerel. Research Note RN/05/09, Uiversity College Lodo Dept. of Computer Sciece (2005) 15. Koltchiskii, V., Pacheko, D.: Empirical margi distributios ad boudig the geeralizatio error of combied classifiers. A. Statist. 30 (2002) 16. Smola, A.J., Schölkopf, B.: Learig with Kerels. MIT Press (2002) 17. Athoy, M., Bartlett, P.L.: Neural Networks Learig: Theoretical Foudatios. Cambridge Uiversity Press (1999) 18. Bhatia, R.: Matrix Aalysis. Spriger (1997) 19. Warre, H.E.: Lower bouds for approximatio by oliear maifolds. T. Am. Math. Soc. 133 (1968) A Aalysis of Previous Bouds We show that some of the previously suggested bouds for SVM kerel learig ca ever lead to meaigful bouds o the expected error. Lackriet et al. [1, Theorem 24] show that for ay class K ad margi γ, with probability at least 1 δ, every f F K satisfies: err(f) êrr γ (f) + 1 (4 + ) C(K) 2 log(1/δ) + γ (26) 2 Where C(K) = E σ [max K K σ K x σ], with σ chose uiformly from {±1} 2 ad x beig a set of traiig ad test poits. The boud is for a trasductive settig ad the Gram matrix of both traiig ad test data is cosidered. We cotiue deotig the empirical margi error, o the traiig poits, by êrr γ (f), but ow err(f) is the test error o the specific test poits. The expectatio C(K) is ot easy to compute i geeral, ad Lackriet et al. provide specific bouds for families of liear, ad covex, combiatios of base kerels. A.1 Boud for liear combiatios of base kerels For the family K = K liear of liear combiatios of base kerels (equatio (1)), Lackriet et al. ote that C(K) c, where c = max K K tr K x is a upper

15 boud o the trace of the possible Gram matrices. Substitutig this explicit boud o C(K) i (26) results i: err(f) êrr γ (f) + 1 (4 + ) 2 log(1/δ) + c γ (27) 2 However, the followig lemma shows that if a kerel allows classifyig much of the traiig poits withi a large margi, the the trace of its Gram matrix caot be too small: Lemma 12. For all f F K : tr K x γ 2 (1 êrr γ (f)) Proof. Let f(x) = w, φ(x), w = 1. The for ay i for which y i f(x i ) = y i w, φ(x i ) γ we must have K(x i,x i) = φ(x i ) γ. Hece tr K x i y K(x if(x i) γ i, x i ) {i y i f(x i ) γ} γ 2 = (1 êrr γ (f)) γ 2. Usig Lemma 12 we get that the right-had side of (27) is at least: êrr γ (f) log(1/δ) γ + 2 (1 êrr γ (f)) γ > êrr γ (f) + 1 êrr γ (f) 1 (28) 2 A.2 Boud for covex combiatios of base kerels For the family K = K covex of covex combiatios ( of base kerels ) (equatio (K i) (2)), Lackriet et al. boud C(K) c mi m, max x 2 Ki tr((k i) x ), where m is the umber of base kerels, c = max K K tr(k x ) as before, ad the maximum is over the base kerels K i. The first miimizatio argumet yields a o-trivial geeralizatio boud that is multiplicative i the umber of base kerels, ad is discussed i Sectio 1.2. The secod argumet yields the followig boud, which was also obtaied by Bousquet ad Herrma [2]: err(f) êrr γ (f) + 1 (4 + ) c b 2 log(1/δ) + γ (29) 2 where b = max Ki (K i ) x 2 / tr (K i ) x. This implies K x 2 b tr K x b c for all base kerels ad so (by covexity) also for all K K. However, similar to the boud o the trace of Gram matrices i Lemma 12, we ca also boud the L 2 operator orm required for classificatio of most poits with a margi: Lemma 13. For all f F K : K x 2 γ 2 (1 êrr γ (f)) Proof. From Lemma 1 we have f(x) = Kx 1/2 w for some w such that w 1, ad so K x 2 = Kx 1/2 2 1/2 2 Kx w 2 = f(x) 2. To boud the right-had side, cosider that for (1 êrr γ (f)) of the poits i x we have f(x i ) = y i f(x i ) γ, ad so f(x) 2 = i f(x i) 2 (1 êrr γ (f)) γ 2. Lemma 13 implies bc γ 2 (1 êrr γ (f)) ad a calculatio similar to (28) reveals that the right-had side of (29) is always greater tha oe.

Learning Bounds for Support Vector Machines with Learned Kernels

Learning Bounds for Support Vector Machines with Learned Kernels Learig Bouds for Support Vector Machies with Leared Kerels Nati Srebro TTI-Chicago Shai Be-David Uiversity of Waterloo Mostly based o a paper preseted at COLT 06 Kerelized Large-Margi Liear Classificatio