Learning Bounds for Support Vector Machines with Learned Kernels

Size: px
Start display at page:

Download "Learning Bounds for Support Vector Machines with Learned Kernels"

Transcription

1 Learig Bouds for Support Vector Machies with Leared Kerels Natha Srebro 1 ad Shai Be-David 2 1 Uiversity of Toroto Departmet of Computer Sciece, Toroto ON, CANADA 2 Uiversity of Waterloo School of Computer Sciece, Waterloo ON, CANADA ati@cs.toroto.edu, shai@cs.uwaterloo.ca Abstract. Cosider the problem of learig a kerel for use i SVM classificatio. We boud the estimatio error of a large margi classifier whe the kerel, relative to which this margi is defied, is chose from a family of kerels based o the traiig sample. For a kerel family with pseudodimesio d φ, we preset a boud of Õ(dφ + 1/γ 2 )/ o the estimatio error for SVMs with margi γ. This is the first boud i which the relatio betwee the margi term ad the family-of-kerels term is additive rather the multiplicative. The pseudodimesio of families of liear combiatios of base kerels is the umber of base kerels. Ulike i previous (multiplicative) bouds, there is o o-egativity requiremet o the coefficiets of the liear combiatios. We also give simple bouds o the pseudodimesio for families of Gaussia kerels. 1 Itroductio I support vector machies (SVMs), as well as other similar methods, prior kowledge is represeted through a kerel fuctio specifyig the ier products betwee a implicit represetatio of iput poits i some Hilbert space. A large margi liear classifier is the sought i this implicit Hilbert space. Usig a good kerel fuctio, appropriate for the problem, is crucial for successful learig: The kerel fuctio essetially specifies the permitted hypothesis class, or at least which hypotheses are preferred. I the stadard SVM framework, oe commits to a fixed kerel fuctio apriori, ad the searches for a large margi classifier with respect to this kerel. If it turs out that this fixed kerel i iappropriate for the data, it might be impossible to fid a good large margi classifier. Istead, oe ca search for a data-appropriate kerel fuctio, from some class of allowed kerels, permittig large margi classificatio. That is, search for both a kerel ad a large margi classifier with respect to the kerel. I this paper we develop bouds for the sample complexity cost of allowig such kerel adaptatio. 1.1 Learig the Kerel As i stadard hypothesis learig, the process of learig a kerel is guided by some family of potetial kerels. A popular type of kerel family cosists of

2 kerels that are a liear, or covex, combiatios of several base kerels [1 3] 3 : K liear (K 1,..., K k ) def = K covex (K 1,..., K k ) def = { K λ = { K λ = k λ i K i K λ 0 ad i=1 k λ i K i λ i 0 ad i=1 } k λ i = 1 i=1 } k λ i = 1 Such kerel families are useful for itegratig several sources of iformatio, each ecoded i a differet kerel, ad are especially popular i bioiformatics applicatios [4 6, ad others]. Aother commo approach is learig (or tuig ) parameters of a parameterized kerel, such as the covariace matrix of a Gaussia kerel, based o traiig data [7 10, ad others]. This amouts to learig a kerel from a parametric family, such as the family of Gaussia kerels: K l Gaussia def = i=1 { K A : (x 1, x 2 ) e (x1 x2) A(x 1 x 2) A R l l, A 0 Ifiite-dimesioal kerel families have also bee cosidered, either through hyperkerels [11] or as covex combiatios of a cotiuum of base kerels (e.g. covex combiatios of Gaussia kerels) [12, 13]. I this paper we focus o fiite-dimesioal kerel families, such as those defied by equatios (1) (3). Learig the kerel matrix allows for greater flexibility i matchig the target fuctio, but this of course comes at the cost of higher estimatio error, i.e. a looser boud o the expected error of the leared classifier i terms of its empirical error. Boudig this estimatio gap is essetial for buildig theoretical support for kerel learig, ad this is the focus of this paper. } (1) (2) (3) 1.2 Learig Bouds with Leared Kerels Previous Work For stadard SVM learig, with a fixed kerel, oe ca show that, with high probability, the estimatio error (gap betwee the expected error ad empirical error) of a leared classifier with margi γ is bouded by Õ(1/γ2 )/ where is the sample size ad the Õ() otatio hides logarithmic factors i its argumet, the sample size ad the allowed failure probability. That is, the umber of samples eeded for learig is Õ( 1/γ 2). Lackriet et al. [1] showed that whe a kerel is chose from a covex combiatio of k base kerels, the estimatio error of the leared classifier is bouded by Õ(k/γ2 )/ where γ is the margi of the leared classifier uder the leared kerel. Note the multiplicative iteractio betwee the margi complexity term 1/γ 2 ad the umber of base kerels k. Recetly, Micchelli et al. [14] derived bouds for the family of Gaussia kerels of equatio (3). The depedece of 3 Lackriet et al. [1] impose a boud o the trace of the Gram matrix of K λ this is equivalet to boudig λ i whe the base kerels are ormalized.

3 these bouds o the margi ad the complexity of the kerel family is also multiplicative the estimatio error is bouded by Õ(Cl /γ 2 )/, where C l is a costat that depeds o the iput dimesioality l. The multiplicative iteractio betwee the margi ad the complexity measure of the kerel class is disappoitig. It suggests that learig eve a few kerel parameters (e.g. the coefficiets λ) leads to a multiplicative icrease i the required sample size. It is importat to uderstad whether such a multiplicative icrease i the umber of traiig samples is i fact ecessary. Bousquet ad Herrma [2, Theorem 2] ad Lackriet et al. [1] also discuss bouds for families of covex ad liear combiatios of kerels that appear to be idepedet of the umber of base kerels. However, we show i the Appedix that these bouds are meaigless: The boud o the expected error is ever less tha oe. We are ot aware of ay previous work describig meaigful explicit bouds for the family of liear combiatios of kerels give i equatio (1). 1.3 New, Additive, Learig Bouds I this paper, we boud the estimatio error, whe the kerel is chose from a kerel family K, by Õ(dφ + 1/γ 2 )/, where d φ is the pseudodimesio of the family K (Theorem 2; the pseudodimesio is defied i Defiitio 5). This establishes that the boud o the required sample size, Õ( d φ + 1/γ 2) grows oly additively with the dimesioality of the allowed kerel family (up to logarithmic factors). This is a much more reasoable price to pay for ot committig to a sigle kerel apriori. The pseudodimesio of most kerel families matches our ituitive otio of the dimesioality of the family, ad i particular: The pseudodimesio of a family of liear, or covex, combiatios of k base kerels (equatios 1,2) is at most k (Lemma 7). The pseudodimesio of the family KGaussia l of Gaussia kerels (equatio 3) for iputs x R l, is at most l(l + 1)/2 (Lemma 9). If oly diagoal covariaces are allowed, the pseudodimesio is l (Lemma 10). If the covariaces (ad therefore A) are costraied to be of rak at most k, the pseudodimesio is at most kl log 2 (22kl) (Lemma 11). 1.4 Pla of Attack For a fixed kerel, it is well kow that, with probability at least 1 δ, the estimatio error of all margi-γ classifiers is at most O(1/γ 2 log δ)/ [15]. To obtai a boud that holds for all margi-γ classifiers with respect to ay kerel K i some fiite kerel family K, cosider a uio boud over the K evets the estimatio error is large for some margi-γ classifier with respect to K for each K K. Usig the above boud with δ scaled by the cardiality K, the uio boud esures us that with probability at least 1 δ, the estimatio

4 error will be bouded by O(log K + 1/γ 2 log δ)/ for all margi-γ classifiers with respect to ay kerel i the family. I order to exted this type of result also to ifiite-cardiality families, we employ the stadard otio of ɛ-ets: Roughly speakig, eve though a cotiuous family K might be ifiite, may kerels i it will be very similar ad it will ot matter which oe we use. Istead of takig a uio boud over all kerels i K, we oly take a uio boud over essetially differet kerels. I Sectio 4 we use stadard results to show that the umber of essetially differet kerels i a family grows expoetially oly with the dimesioality of the family, yieldig a additive term (almost) proportioal to the dimesioality. As is stadard i obtaiig such bouds, our otio of essetially differet refers to a specific sample ad so symmetrizatio argumets are required i order to make the above coceptual argumets cocrete. To do so clealy ad cheaply, we use a ɛ-et of kerels to costruct a ɛ-et of classifiers with respect to the kerels, otig that the size of the ɛ-et icreases oly multiplicatively relative to the size of a ɛ-et for ay oe kerel (Sectio 3). A importat compoet of this costructio is the observatio that kerels that are close as real-valued fuctios also yield similar classes of classifiers (Lemma 2). Usig our costructed ɛ-et, we ca apply stadard results boudig the estimatio error i terms of the log-size of ɛ-ets, without eedig to ivoke symmetrizatio argumets directly. For the sake of simplicity ad cociseess of presetatio, the results i this paper are stated for biary classificatio usig a homogeeous large-margi classifier, i.e. ot allowig a bias term, ad refer to zero-oe error. The results ca be easily exteded to other loss fuctios ad to allow a bias term. 2 Prelimiaries Notatio: We use v to deote the orm of a vector i a abstract Hilbert space. For a vector v R, v is the Euclidea orm of v. For a matrix A R, A 2 = max v =1 Av is the L 2 operator orm of A, A = max ij A ij is the l orm of A ad A 0 idicates that A is positive semi-defiite (p.s.d.) ad symmetric. We use boldface x for samples (multisets, though we refer to them simply as sets) of poits, where x is the umber of poits i a sample. 2.1 Support Vector Machies Let (x 1, y 1 ),..., (x, y ) be a traiig set of pairs of iput poits x i X ad target labels y i {±1}. Let φ : X H be a mappig of iput poits ito a Hilbert space H with ier product,. A vector w H ca be used as a predictor for poits i X, predictig the label sig( w, φ(x) ) for iput x. Cosider learig by seekig a uit-orm predictor w achievig low empirical hige loss ĥγ (w) = 1 i=1 max(γ y i w, φ(x i ), 0), relative to a margi γ > 0. The Represeter Theorem [16, Theorem 4.2] guaratees that the predictor w miimizig ĥγ (w) ca be writte as w = i=1 α iφ(x i ). For such w, predictios

5 w, φ(x) = i α i φ(x i ), φ(x) ad the orm w 2 = ij α iα j φ(x i ), φ(x j ) deped oly o ier products betwee mappigs of iput poits. The Hilbert space H ad mappig φ ca therefore be represeted implicitly by a kerel fuctio K:X X R specifyig these ier products: K(x, x ) = φ(x ), φ(x ). Defiitio 1. A fuctio K : X X R is a kerel fuctio if for some Hilbert space H ad mappig φ : X H, K(x, x ) = φ(x ), φ(x ) for all x, x. For a set x = {x 1,..., x } X of poits, it will be useful to cosider their Gram matrix K x R, K x [i, j] = K(x i, x j ). A fuctio K : X X R is a kerel fuctio iff for ay fiite x X, the Gram matrix K x is p.s.d [16]. Whe specifyig the mappig φ implicitly through a kerel fuctio, it is useful to thik about a predictor as a fuctio f : X R istead of cosiderig w explicitly. Give a kerel K, learig ca the be phrased as choosig a predictor from the class miimizig F K def = {x w, φ(x) w 1, K(x, x ) = φ(x ), φ(x ) } (4) ĥ γ (f) def = 1 max(γ y i f(x i ), 0). (5) i=1 For a set of poits x = {x 1,..., x }, let f(x) R be the vector whose etries are f(x i ). The followig restricted variat of the Represeter Theorem characterizes the possible predictio vectors f(x) by suggestig the matrix square root of the Gram matrix (Kx 1/2 0 such that K x = Kx 1/2 Kx 1/2 ) as a possible feature mappig for poits i x: Lemma 1. For ay kerel fuctio K ad set x = {x 1,..., x } of poits: {f(x) f F K } = {K 1/2 x w w R, w 1}, Proof. For ay f F K we ca write f(x) = w, φ(x) with w 1 (equatio 4). Cosider the projectio w = i α iφ(x i ) of w oto spa(φ(x 1 ),..., φ(x )). We have f(x i ) = w, φ(x i ) = w, φ(x i ) = j α jk(x j, x i ) ad 1 w 2 w 2 = ij α iα j K(x i, x j ). I matrix form: f(x) = K x α ad α K x α 1. Settig w = Kx 1/2 α we have f(x) = K x α = Kx 1/2 Kx 1/2 α = Kx 1/2 w while w 2 = α Kx 1/2 Kx 1/2 α = αk x α 1. This establishes that the left-had side is a subset of the right-had side. For ay w R with w 1 we would like to defie w = i α iφ(x i ) with α = Kx 1/2 w ad get w, φ(x i ) = j α j φ(x j ), φ(x i ) = K x α = K x Kx 1/2 w = Kx 1/2 w. However, K x might be sigular. Istead, cosider the sigular value decompositio K x = USU, with U U = I, where zero sigular values have bee removed, i.e. S is a all-positive diagoal matrix ad U might be rectagular. Set α = US 1/2 U w ad cosider w = i α iφ(x i ). We ca ow calculate: w, φ(x i ) = j α j φ(x j ), φ(x i ) = K x α = USU US 1/2 U w = US 1/2 U w = K 1/2 x w (6)

6 while w 2 = α Kα = w US 1/2 U USU US 1/2 U w = w UU w w 2 1 To remove cofusio we ote some differeces betwee the presetatio here ad other commo, ad equivalet, presetatios of SVMs. Istead of fixig the margi γ ad miimizig the empirical hige loss, it is commo to try to maximize γ while miimizig the loss. The most commo combied objective, i our otatio, is to miimize 1 γ + C 1 2 γ ĥγ (w) for some trade-off parameter C. This is usually doe with a chage of variable to w = w/γ, which results i a equivalet problem where the margi is fixed to oe, ad the orm of w varies. Expressed i terms of w the objective is w 2 + C ĥ1 ( w). Varyig the trade-off parameter C is equivalet to varyig the margi ad miimizig the loss. The variat of the Represeter Theorem give i Lemma 1 applies to ay predictor i F K, but oly describes the behavior of the predictor o the set x. This will be sufficiet for our purposes. 2.2 Learig Bouds ad Coverig Numbers We derive geeralizatio error bouds i the stadard agostic learig settig. That is, we assume data is geerated by some ukow joit distributio P (X, Y ) over iput poits i X ad labels i ±1. The traiig set cosists of i.i.d. samples (x i, y i ) from this joit distributio. We would like to boud the differece est γ (f) = err(f) êrr γ (f) (the estimatio error) betwee the expected error rate err(f) = Pr (Y f(x) 0), (7) X,Y ad the empirical margi error rate êrr γ (f) = {i y if(x i ) < γ}. (8) The mai challege of derivig such bouds is boudig the estimatio error uiformly over all predictors i a class. The techique we employ i this paper to obtai such uiform bouds is boudig the coverig umbers of classes. Defiitio 2. A subset à A is a ɛ-et of A uder the metric d if for ay a A there exists ã à with d(a, ã) ɛ. The coverig umber N d(a, ɛ) is the size of the smallest ɛ-et of A. We will study coverigs of classes of predictors uder the sample-based l metric, which depeds o a sample x = {x 1,..., x }: d x (f 1, f 2 ) = max i=1 f 1(x i ) f 2 (x i ) (9) Defiitio 3. The uiform l coverig umber N (F, ɛ) of a predictor class F is give by cosiderig all possible samples x of size : N (F, ɛ) = sup N d x (F, ɛ) x =

7 The uiform l coverig umber ca be used to boud the estimatio error uiformly. For a predictor class F ad fixed γ > 0, with probability at least 1 δ over the choice of a traiig set of size [17, Theorem 10.1]: sup est γ (f) f F log N 2(F, γ/2) log δ (10) The uiform coverig umber of the class F K (uit-orm predictors correspodig to a kerel fuctio K; recall eq. (4)), with K(x, x) B for all x, ca be bouded by applyig Theorems ad 12.8 of Athoy ad Bartlett [17]: ( 4B N (F, ɛ) 2 ɛ 2 ) 16B ɛ 2 log 2 ( ɛe 4 B ) (11) yieldig sup f FK est γ (f) = Õ(B/γ2 )/ ad implyig that Õ( B/γ 2) traiig examples are eough to guaratee that the estimatio error dimiishes. 2.3 Learig the Kerel Istead of committig to a fixed kerel, we cosider a family K {K :X X R} of allowed kerels ad the correspodig predictor class: F K = K K F K (12) The learig problem is ow oe of miimizig ĥγ (f) for f F K. We are iterested i boudig the estimatio error uiformly for the class F K ad will do so by boudig the coverig umbers of the class. The bouds will deped o the dimesioality of K, which we will defie later, the margi γ, ad a boud B such that K(x, x) B for all K K ad all x. We will say that such a kerel family is bouded by B. Note that B is the radius of a ball (aroud the origi) cotaiig φ(x) i the implied Hilbert space, ad scalig φ scales both B ad γ liearly. Our bouds will therefore deped o the relative margi γ/ B. 3 Coverig Numbers with Multiple Kerels I this sectio, we will show how to use bouds o coverig umbers of a family K of kerels to obtai bouds o the coverig umber of the class F K of predictors that are low-orm liear predictors uder some kerel K K. We will show how to combie a ɛ-et of K with ɛ-ets for the classes F K to obtai a ɛ-et for the class F K. I the ext sectio, we will see how to boud the coverig umbers of a kerel family K ad will the be able to apply the mai result of this sectio to get a boud o the coverig umber of F K. I order to state the mai result of this sectio, we will eed to cosider coverig umbers of kerel families. We will use the followig sample-based metric betwee kerels. For a sample x = {x 1,..., x }: D (K, x def K) = max K(x i, x j ) K(x i, x j ) = K x K x (13) i,j=1

8 Defiitio 4. The uiform l kerel coverig umber N D (K, ɛ) of a kerel class K is give by cosiderig all possible samples x of size : N D (K, ɛ) = sup N D x (K, ɛ) x = Theorem 1. For a family K of kerels bouded by B ad ay ɛ < 1: N (F K, ɛ) 2 N D (K, ɛ2 4 ) ( ) 16B 64B ɛ log ( ) ɛe 2 8 ɛ 2 B I order to prove Theorem 1, we will first show how all the predictors of oe kerel ca be approximated by predictors of a earby kerel. Roughly speakig, we do so by showig that the possible feature mappig Kx 1/2 of Lemma 1 does ot chage too much: Lemma 2. Let K, K be two kerel fuctios. The for ay predictor f F K there exists a predictor f F K with d x (f, f) D (K, x K). Proof. Let w R, w = 1 such that f(x) = Kx 1/2 w, as guarateed by Lemma 1. Cosider the predictor f F K such that f(x) 1/2 = K x w, guarateed by the reverse directio of Lemma 1: d x (f, f) = max f(x i ) f(x i ) f(x) f(x) (14) i = Kx 1/2 1/2 w K x w K x K x = K 1/2 x 1/2 K x Kx w K 2 x 1 (15) 2 D x (K, K) (16) See, e.g., Theorem X.1.1 of Bhatia [18] for the third iequality i (15). Proof of Theorem 1: Set ɛ k = ɛ2 4 ad ɛ f = ɛ/2. Let K be a ɛ k -et of K. For each K K, let F K be a ɛ f -et of F K. We will show that F K def = K K F K (17) is a ɛ-et of F K. For ay f F K we have f F K for some K K. The kerel K is covered by some K K with D (K, x K) ɛ k. Let f F K be a predictor with d x (f, f) D (K, x K) ɛ k guarateed by Lemma 2, ad f F K such that d x ( f, f) ɛf. The f FK is a predictor with: d x (f, f) d x (f, f) + d x ( f, f) ɛk + ɛ f = ɛ (18) This establishes that F K is ideed a ɛ-et. Its size is bouded by F K F K K max F K N D (K, ɛ2 4 ) max N (F K, ɛ/2). (19) K K K K Substitutig i (11) yields the desired boud.

9 4 Learig Bouds i terms of the Pseudodimesio We saw that if we could boud the coverig umbers of a kerel family K, we could use Theorem 1 to obtai a boud o the coverig umbers of the class F K of predictors that are low-orm liear predictors uder some kerel K K. We could the use (10) to establish a learig boud. I this sectio, we will see how to boud the coverig umbers of a kerel family by its pseudodimesio, ad use this to state learig bouds i terms of this measure. To do so, we will use well-kow results boudig coverig umbers i terms of the pseudodimesio, payig a bit of attetio to the subtleties of the differeces betwee Defiitio 4 of uiform kerel coverig umbers, ad the stadard Defiitio 3 of uiform coverig umbers. To defie the pseudodimesio of a kerel family we will treat kerels as fuctios from pairs of poits to the reals: Defiitio 5. Let K = {K : X X R} be a kerel family. The class K pseudo-shatters a set of pairs of poits (x 1, x 1 ),..., (x, x ) if there exist thresholds t 1,..., t R such that for ay b 1,..., b {±1} there exists K K with sig(k(x i, x i ) t i) = b i. The pseudodimesio d φ (K) is the largest such that there exists a set of pairs of poits that are pseudo-shattered by K. The uiform l coverig umbers of a class G of real-valued fuctios takig values i [ B, B] ca be bouded i terms of its pseudodimesio. Let d φ be the pseudodimesio of G; the for ay > d φ ad ɛ > 0 [17, Theorem 12.2]: ( ) dφ eb N (G, ɛ) (20) ɛd φ We should be careful here, sice the coverig umbers N (K, ɛ) are i relatio to the metrics: d x (K, K) = max i=1 K(x i, x i ) K(x i, x i ) (21) defied for a sample x X X of pairs of poits (x i, x i ). The supremum i Defiitio 3 of N (K, ɛ) should the be take over all samples of pairs of poits. Compare with (13) where the kerels are evaluated over the 2 pairs of poits (x i, x j ) arisig from a sample of poits. However, for ay sample of poits x = {x 1,..., x } X, we ca always cosider the 2 poit pairs x 2 = {(x i, x j ) i, j = 1..} ad observe that D (K, x K) = d x2 (K, K) ad so N D x (K, ɛ) = N d x 2 (K, ɛ). Although such sets of poit pairs do ot accout for all sets of 2 poit pairs i the supremum of Defiitio 3, we ca still coclude that for ay K,, ɛ > 0: Combiig (22) ad (20): N D (K, ɛ) N 2(K, ɛ) (22)

10 Lemma 3. For ay kerel family K bouded by B with pseudodimesio d φ : ( e N D 2 ) dφ B (K, ɛ) ɛd φ Usig Lemma 3 ad relyig o (10) ad Theorem 1 we have: Theorem 2. For ay kerel family K, bouded by B ad with pseudodimesio d φ, ad ay fixed γ > 0, with probability at least 1 δ over the choice of a traiig set of size : sup est γ (f) f F K γ 2 d φ d φ log 128e 3 B B γ log γe 128B 2 8 log B γ log δ 2 Theorem 2 is stated for a fixed margi but it ca also be stated uiformly over all margis, at the price of a additioal log γ term (e.g. [15]). Also, istead of boudig K(x, x) for all x, it is eough to boud it oly o average, i.e. require E[K(X, X)] B. This correspods to boudig the trace of the Gram matrix as was doe by Lackriet et al.. I ay case, we ca set B = 1 without loss of geerality ad scale the kerel ad margi appropriately. The learig settig ivestigated here differs slightly from that of Lackriet et al., who studied trasductio, but learig bouds ca easily be traslated betwee the two settigs. 5 The Pseudodimesio of Commo Kerel Families I this sectio, we aalyze the pseudodimesio of several kerel families i commo use. Most pseudodimesio bouds we preset follow easily from wellkow properties of the pseudodimesio of fuctio families, which we review at the begiig of the sectio. The aalyses i this sectio serve also as examples of how the pseudodimesio of other kerel families ca be bouded. 5.1 Prelimiaries We review some basic properties of the pseudodimesio of a class of fuctios: Fact 4 If G G the d φ (G ) d φ (G). Fact 5 ([17, Theorem 11.3]) Let G be a class of real-valued fuctios ad σ : R R a mootoe fuctio. The d φ ({σ g g G}) d φ (G). Fact 6 ([17, Theorem 11.4]) The pseudodimesio of a k-dimesioal vector space of real-valued fuctios is k. We will also use a classic result of Warre that is useful, amog other thigs, for boudig the pseudodimesio of classes ivolvig low-rak matrices. We say that the real-valued fuctios (g 1, g 2,..., g m ) realize a sig vector b {±1} m iff there exists a iput x for which b i = sig g i (x) for all i. The umber of sig vectors realizable by m polyomials of degree at most d over R, where m, is at most (4edm/) [19].

11 5.2 Combiatio of Base Kerels Sice families of liear or covex combiatios of k base kerels are subsets of k-dimesioal vector spaces of fuctios, we ca easily boud their pseudodimesio by k. Note that the pseudodimesio depeds oly o the umber of base kerels, but does ot deped o the particular choice of base kerels. Lemma 7. For ay fiite set of kerels S = {K 1,... K k }, d φ (K covex (S)) d φ (K liear (S)) k Proof. We have K covex K liear spa S where spa S = { i λ ik i λ i R} is a vector space of dimesioality k. The bouds follow from Facts 4 ad Gaussia Kerels with a Leared Covariace Matrix Before cosiderig the family K Gaussia of Gaussia kerels, let us cosider a sigle-parameter family that geeralizes tuig a sigle scale parameter (i.e. variace) of a Gaussia kerel. For a fuctio d : X X R +, cosider the class K scale (d) def = { K d λ : (x 1, x 2 ) e λd(x1,x2) λ R +}. (23) The family of spherical Gaussia kerels is obtaied with d(x 1, x 2 ) = x 1 x 2 2. Lemma 8. For ay fuctio d, d φ (K scale (d)) 1. Proof. The set { λd λ R + } of fuctios over X X is a subset of a oedimesioal vector space ad so has pseudodimesio at most oe. Composig them with the mootoe expoetiatio fuctio ad usig Fact 5 yields the desired boud. I order to aalyze the pseudodimesio of more geeral families of Gaussia kerels, we will use the same techique of aalyzig the fuctios i the expoet ad the composig them with the expoetiatio fuctio. Recall that class K l Gaussia of Gaussia kerels over Rl defied i (3). Lemma 9. d φ (KGaussia l ) l(l + 1)/2 Proof. Cosider the fuctios at the expoet: {(x 1, x 2 ) (x 1 x 2 )A(x 1 x 2 ) A R l l, A 0} spa{(x 1, x 2 ) (x 1 x 2 )[i] (x 1 x 2 )[j] i j l} where v[i] deotes the i th coordiate of a vector i R l. This is a vector space of dimesioality l(l + 1) ad the result follows by compositio with the expoetiatio fuctio. We ext aalyze the pseudodimesio of the family of Gaussia kerels with a diagoal covariace matrix, i.e. whe we apply a arbitrary scalig to iput coordiates: K (l diag) Gaussia = {K λ : (x 1, x 2 ) e ( λ (x 1 x 2)) 2 λ R l} (24)

12 Lemma 10. d φ (K (l diag) Gaussia ) l Proof. We use the same argumets. The expoets are spaed by the l fuctios (x 1, x 2 ) ((x 1 x 2 )[i]) 2. As a fial example, we aalyze the pseudodimesio of the family of Gaussia kerels with a low-rak covariace matrix, correspodig to a low-rak A i our otatio: K l,k Gaussia = {(x 1, x 2 ) e (x1 x2) A(x 1 x 2) A R l l, A 0, rak A k This family correspods to learig a dimesioality reducig liear trasformatio of the iputs that is applied before calculatig the Gaussia kerel. Lemma 11. d φ (K l,k Gaussia ) kl log 2(8ekl) Proof. Ay A 0 of rak at most k ca be writte as A = U U with U R k l. Cosider the set G = {(x, x ) (x x ) U U(x x ) U R k l } of fuctios at the expoet. Assume G pseudo-shatters a set of m poit pairs S = {(x 1, x 1 )..., m, x (x m)}. By the defiitio of pseudo-shatterig, we get that there exist t 1,..., t m R so that for every b {±1} m there exist U b R k l with b i = sig ( (x i x i ) U U(x i x i ) t i) for all i m. Viewig each p i (U) def = (x i x i ) U U(x i x i ) t i as a quadratic polyomial i the kl etries of U, where x i x i ad t i determie the coefficiets of p i, we get a set of m quadratic polyomials over kl variables which realize all 2 m sig vectors. Applyig Warre s boud [19] discussed above we get 2 m (8em/kl) kl which implies m kl log 2 (8ekl). This is a boud o the umber of poits that ca be pseudo-shattered by G, ad hece o the pseudodimesio of G, ad by compositio with expoetiatio we get the desired boud. } 6 Coclusio ad Discussio Learig with a family of allowed kerel matrices has bee a topic of sigificat iterest ad the focus of cosiderable body of research i recet years, ad several attempts have bee made to establish learig bouds for this settig. I this paper we establish the first geeralizatio error bouds for kerel-learig SVMs where the margi complexity term ad the dimesioality of the kerel family iteract additively rather the multiplicatively (up to log factors). The additive iteractio yields stroger bouds. We believe that the implied additive bouds o the sample complexity represet its correct behavior (up to log factors), although this remais to be proved. The results we preset sigificatly improve o previous results for covex combiatios of base kerels, for which the oly previously kow boud had a multiplicative iteractio [1], ad for Gaussia kerels with a leared covariace matrix, for which oly a boud with a multiplicative iteractio ad a uspecified depedece o the iput dimesioality was previously show [14]. We

13 also provide the first explicit o-trivial boud for liear combiatios of base kerels a boud that depeds oly o the (relative) margi ad the umber of base kerels. The techiques we itroduce for obtaiig bouds based o the pseudodimesio of the class of kerels should readily apply to straightforward derivatio of bouds for may other classes. We ote that previous attempts at establishig bouds for this settig [1, 2, 14] relied o boudig the Rademacher complexity [15] of the class F K. However, geeralizatio error bouds derived solely from the Rademacher complexity R[F K ] of the class F K must have a multiplicative depedece o B/γ: The Rademacher complexity R[F K ] scales liearly with the scale B of fuctios i F K, ad to obtai a estimatio error boud it is multiplied by the Lipschitz costat 1/γ [15]. This might be avoidable by clippig predictors i F K to the rage [ γ, γ]: F γ K def = {f [±γ] f F K }, f [±γ] (x) = γ f(x) γ if f(x) γ if γ f(x) γ if γ f(x) (25) Whe usig the Rademacher complexity R[F K ] to obtai geeralizatio error bouds i terms of the margi error, the class is implicitly clipped ad oly the Rademacher complexity of F γ K is actually relevat. This Rademacher complexity R[F γ K ] is bouded by R[F K]. I our case, it seems that this last boud is loose. It is possible ( though, that coverig umbers of K ca be used to boud R[F γ K ] by O γ log N2(K, D 4B/ 2 ) + ) B /, yieldig a geeralizatio error boud with a additive iteractio, ad perhaps avoidig the log factors of the margi complexity term Õ( B/γ 2) of Theorem 2. Refereces 1. Lackriet, G.R., Cristiaii, N., Bartlett, P., Ghaoui, L.E., Jorda, M.I.: Learig the kerel matrix with semidefiite programmig. J Mach Lear Res 5 (2004) Bousquet, O., Herrma, D.J.L.: O the complexity of learig the kerel matrix. I: Adv. i Neural Iformatio Processig Systems 15. (2003) 3. Crammer, K., Keshet, J., Siger, Y.: Kerel desig usig boostig. I: Advaces i Neural Iformatio Processig Systems 15. (2003) 4. Lackriet, G.R.G., De Bie, T., Cristiaii, N., Jorda, M.I., Noble, W.S.: A statistical framework for geomic data fusio. Bioiformatics 20 (2004) 5. Soeburg, S., Rätsch, G., Schafer, C.: Learig iterpretable SVMs for biological sequece classificatio. I: Research i Computatioal Molecular Biology. (2005) 6. Be-Hur, A., Noble, W.S.: Kerel methods for predictig protei-protei iteractios. Bioiformatics 21 (2005) 7. Cristiaii, N., Campbell, C., Shawe-Taylor, J.: Dyamically adaptig kerels i support vector machies. I: Adv. i Neural Iformatio Proceedigs Systems 11. (1999) 8. Chapelle, O., Vapik, V., Bousquet, O., Makhuerjee, S.: Choosig multiple parameters for support vector machies. Machie Learig 46 (2002)

14 9. Keerthi, S.S.: Efficiet tuig of SVM hyperparameters usig radius/margi boud ad iterative algorithms. IEEE Tra. o Neural Networks 13 (2002) Glasmachers, T., Igel, C.: Gradiet-based adaptatio of geeral gaussia kerels. Neural Comput. 17 (2005) Og, C.S., Smola, A.J., Williamso, R.C.: Learig the kerel with hyperkerels. J. Mach. Lear. Res. 6 (2005) 12. Micchelli, C.A., Potil, M.: Learig the kerel fuctio via regularizatio. J. Mach. Lear. Res. 6 (2005) 13. Argyriou, A., Micchelli, C.A., Potil, M.: Learig covex combiatios of cotiuously parameterized basic kerels. I: 18th Aual Cof. o Learig Theory. (2005) 14. Micchelli, C.A., Potil, M., Wu, Q., Zhou, D.X.: Error bouds for learig the kerel. Research Note RN/05/09, Uiversity College Lodo Dept. of Computer Sciece (2005) 15. Koltchiskii, V., Pacheko, D.: Empirical margi distributios ad boudig the geeralizatio error of combied classifiers. A. Statist. 30 (2002) 16. Smola, A.J., Schölkopf, B.: Learig with Kerels. MIT Press (2002) 17. Athoy, M., Bartlett, P.L.: Neural Networks Learig: Theoretical Foudatios. Cambridge Uiversity Press (1999) 18. Bhatia, R.: Matrix Aalysis. Spriger (1997) 19. Warre, H.E.: Lower bouds for approximatio by oliear maifolds. T. Am. Math. Soc. 133 (1968) A Aalysis of Previous Bouds We show that some of the previously suggested bouds for SVM kerel learig ca ever lead to meaigful bouds o the expected error. Lackriet et al. [1, Theorem 24] show that for ay class K ad margi γ, with probability at least 1 δ, every f F K satisfies: err(f) êrr γ (f) + 1 (4 + ) C(K) 2 log(1/δ) + γ (26) 2 Where C(K) = E σ [max K K σ K x σ], with σ chose uiformly from {±1} 2 ad x beig a set of traiig ad test poits. The boud is for a trasductive settig ad the Gram matrix of both traiig ad test data is cosidered. We cotiue deotig the empirical margi error, o the traiig poits, by êrr γ (f), but ow err(f) is the test error o the specific test poits. The expectatio C(K) is ot easy to compute i geeral, ad Lackriet et al. provide specific bouds for families of liear, ad covex, combiatios of base kerels. A.1 Boud for liear combiatios of base kerels For the family K = K liear of liear combiatios of base kerels (equatio (1)), Lackriet et al. ote that C(K) c, where c = max K K tr K x is a upper

15 boud o the trace of the possible Gram matrices. Substitutig this explicit boud o C(K) i (26) results i: err(f) êrr γ (f) + 1 (4 + ) 2 log(1/δ) + c γ (27) 2 However, the followig lemma shows that if a kerel allows classifyig much of the traiig poits withi a large margi, the the trace of its Gram matrix caot be too small: Lemma 12. For all f F K : tr K x γ 2 (1 êrr γ (f)) Proof. Let f(x) = w, φ(x), w = 1. The for ay i for which y i f(x i ) = y i w, φ(x i ) γ we must have K(x i,x i) = φ(x i ) γ. Hece tr K x i y K(x if(x i) γ i, x i ) {i y i f(x i ) γ} γ 2 = (1 êrr γ (f)) γ 2. Usig Lemma 12 we get that the right-had side of (27) is at least: êrr γ (f) log(1/δ) γ + 2 (1 êrr γ (f)) γ > êrr γ (f) + 1 êrr γ (f) 1 (28) 2 A.2 Boud for covex combiatios of base kerels For the family K = K covex of covex combiatios ( of base kerels ) (equatio (K i) (2)), Lackriet et al. boud C(K) c mi m, max x 2 Ki tr((k i) x ), where m is the umber of base kerels, c = max K K tr(k x ) as before, ad the maximum is over the base kerels K i. The first miimizatio argumet yields a o-trivial geeralizatio boud that is multiplicative i the umber of base kerels, ad is discussed i Sectio 1.2. The secod argumet yields the followig boud, which was also obtaied by Bousquet ad Herrma [2]: err(f) êrr γ (f) + 1 (4 + ) c b 2 log(1/δ) + γ (29) 2 where b = max Ki (K i ) x 2 / tr (K i ) x. This implies K x 2 b tr K x b c for all base kerels ad so (by covexity) also for all K K. However, similar to the boud o the trace of Gram matrices i Lemma 12, we ca also boud the L 2 operator orm required for classificatio of most poits with a margi: Lemma 13. For all f F K : K x 2 γ 2 (1 êrr γ (f)) Proof. From Lemma 1 we have f(x) = Kx 1/2 w for some w such that w 1, ad so K x 2 = Kx 1/2 2 1/2 2 Kx w 2 = f(x) 2. To boud the right-had side, cosider that for (1 êrr γ (f)) of the poits i x we have f(x i ) = y i f(x i ) γ, ad so f(x) 2 = i f(x i) 2 (1 êrr γ (f)) γ 2. Lemma 13 implies bc γ 2 (1 êrr γ (f)) ad a calculatio similar to (28) reveals that the right-had side of (29) is always greater tha oe.

Learning Bounds for Support Vector Machines with Learned Kernels

Learning Bounds for Support Vector Machines with Learned Kernels Learig Bouds for Support Vector Machies with Leared Kerels Nati Srebro TTI-Chicago Shai Be-David Uiversity of Waterloo Mostly based o a paper preseted at COLT 06 Kerelized Large-Margi Liear Classificatio

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

1 Approximating Integrals using Taylor Polynomials

1 Approximating Integrals using Taylor Polynomials Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Fast Rates for Regularized Objectives

Fast Rates for Regularized Objectives Fast Rates for Regularized Objectives Karthik Sridhara, Natha Srebro, Shai Shalev-Shwartz Toyota Techological Istitute Chicago Abstract We study covergece properties of empirical miimizatio of a stochastic

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

CALCULATION OF FIBONACCI VECTORS

CALCULATION OF FIBONACCI VECTORS CALCULATION OF FIBONACCI VECTORS Stuart D. Aderso Departmet of Physics, Ithaca College 953 Daby Road, Ithaca NY 14850, USA email: saderso@ithaca.edu ad Dai Novak Departmet of Mathematics, Ithaca College

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi,

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula Joural of Multivariate Aalysis 102 (2011) 1315 1319 Cotets lists available at ScieceDirect Joural of Multivariate Aalysis joural homepage: www.elsevier.com/locate/jmva Superefficiet estimatio of the margials

More information

The Choquet Integral with Respect to Fuzzy-Valued Set Functions

The Choquet Integral with Respect to Fuzzy-Valued Set Functions The Choquet Itegral with Respect to Fuzzy-Valued Set Fuctios Weiwei Zhag Abstract The Choquet itegral with respect to real-valued oadditive set fuctios, such as siged efficiecy measures, has bee used i

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

Questions and answers, kernel part

Questions and answers, kernel part Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel

More information

Ω ). Then the following inequality takes place:

Ω ). Then the following inequality takes place: Lecture 8 Lemma 5. Let f : R R be a cotiuously differetiable covex fuctio. Choose a costat δ > ad cosider the subset Ωδ = { R f δ } R. Let Ωδ ad assume that f < δ, i.e., is ot o the boudary of f = δ, i.e.,

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm 8.409 A Algorithmist s Toolkit Nov. 9, 2009 Lecturer: Joatha Keler Lecture 20 Brief Review of Gram-Schmidt ad Gauss s Algorithm Our mai task of this lecture is to show a polyomial time algorithm which

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Solutions to home assignments (sketches)

Solutions to home assignments (sketches) Matematiska Istitutioe Peter Kumli 26th May 2004 TMA401 Fuctioal Aalysis MAN670 Applied Fuctioal Aalysis 4th quarter 2003/2004 All documet cocerig the course ca be foud o the course home page: http://www.math.chalmers.se/math/grudutb/cth/tma401/

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

The Method of Least Squares. To understand least squares fitting of data.

The Method of Least Squares. To understand least squares fitting of data. The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve

More information

Minimal surface area position of a convex body is not always an M-position

Minimal surface area position of a convex body is not always an M-position Miimal surface area positio of a covex body is ot always a M-positio Christos Saroglou Abstract Milma proved that there exists a absolute costat C > 0 such that, for every covex body i R there exists a

More information

Lesson 10: Limits and Continuity

Lesson 10: Limits and Continuity www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Session 5. (1) Principal component analysis and Karhunen-Loève transformation 200 Autum semester Patter Iformatio Processig Topic 2 Image compressio by orthogoal trasformatio Sessio 5 () Pricipal compoet aalysis ad Karhue-Loève trasformatio Topic 2 of this course explais the image

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get

More information

THE KALMAN FILTER RAUL ROJAS

THE KALMAN FILTER RAUL ROJAS THE KALMAN FILTER RAUL ROJAS Abstract. This paper provides a getle itroductio to the Kalma filter, a umerical method that ca be used for sesor fusio or for calculatio of trajectories. First, we cosider

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j. Eigevalue-Eigevector Istructor: Nam Su Wag eigemcd Ay vector i real Euclidea space of dimesio ca be uiquely epressed as a liear combiatio of liearly idepedet vectors (ie, basis) g j, j,,, α g α g α g α

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Mixtures of Gaussians and the EM Algorithm

Mixtures of Gaussians and the EM Algorithm Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity

More information

Lecture 9: Expanders Part 2, Extractors

Lecture 9: Expanders Part 2, Extractors Lecture 9: Expaders Part, Extractors Topics i Complexity Theory ad Pseudoradomess Sprig 013 Rutgers Uiversity Swastik Kopparty Scribes: Jaso Perry, Joh Kim I this lecture, we will discuss further the pseudoradomess

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Principle Of Superposition

Principle Of Superposition ecture 5: PREIMINRY CONCEP O RUCUR NYI Priciple Of uperpositio Mathematically, the priciple of superpositio is stated as ( a ) G( a ) G( ) G a a or for a liear structural system, the respose at a give

More information

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities Polyomials with Ratioal Roots that Differ by a No-zero Costat Philip Gibbs The problem of fidig two polyomials P(x) ad Q(x) of a give degree i a sigle variable x that have all ratioal roots ad differ by

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

Lecture 11: Decision Trees

Lecture 11: Decision Trees ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

A REMARK ON A PROBLEM OF KLEE

A REMARK ON A PROBLEM OF KLEE C O L L O Q U I U M M A T H E M A T I C U M VOL. 71 1996 NO. 1 A REMARK ON A PROBLEM OF KLEE BY N. J. K A L T O N (COLUMBIA, MISSOURI) AND N. T. P E C K (URBANA, ILLINOIS) This paper treats a property

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7 Statistical Machie Learig II Sprig 2017, Learig Theory, Lecture 7 1 Itroductio Jea Hoorio jhoorio@purdue.edu So far we have see some techiques for provig geeralizatio for coutably fiite hypothesis classes

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information