Learning Bounds for Support Vector Machines with Learned Kernels

Size: px

Start display at page:

Download "Learning Bounds for Support Vector Machines with Learned Kernels"

Kory Hunt
6 years ago
Views:

1 Learig Bouds for Support Vector Machies with Leared Kerels Nati Srebro TTI-Chicago Shai Be-David Uiversity of Waterloo Mostly based o a paper preseted at COLT 06

2 Kerelized Large-Margi Liear Classificatio φ(x) B γ K(x 1,x 2 ) = φ(x 1 ), φ(x 2 ) Implicitly defies a Hilbert space i which we seek large-margi separatio Represets our prior kowledge, or bias K(x,x) B 2 estimatio = E[] - traiig O ( (B/γ) 2) logδ failure probability sample complexity (B/γ) 2 sample size

3 Learig the Kerel Success of learig rests o choice of a good Kerel, appropriate for the task How ca we kow which kerel is good for the task at had? Joitly lear classifier ad Kerel, usig the traiig data: Search for a kerel from some family K of allowed kerels Lear badwidth, or covariace matrix of Gaussia kerel; other kerel parameters [Cristiaii+98][Chapelle+02][Keerthi02] etc Liear, or covex, combiatio of base kerels [Lackriet+02,04][Crammer+03]; applicatios, esp. i Bioiformatics [Soeburg+05][Be-Hur&Noble05] etc More flexibility: lower approximatio, but higher estimatio What is the sample complexity cost of this flexibility?

4 With a fixed kerel: estimatio Outlie How does this chage whe the kerel is leared from some family K? What is the cost of learig the kerel? Mai result: Learig boud for geeral kerel families Additive icrease to the sample complexity Examples: bouds for specific families Lear i α i K i or just use i K i? Group Lasso (block-l 1 ) O demad: proof techique (very simple) ad why usig the Rademacher complexity ca t work O ( (B/γ) 2) logδ

5 Previous Bouds: Specific Kerel Families K covex (K 1,...,K k ) def = estimatio 2 k ( B γ )2 logδ λ i K i λ i 0ad λ i =1 [Lackriet+ JMLR 2004] K l Gaussia def = estimatio {(x 1,x 2 ) e (x 1 x 2 ) A(x 1 x 2 ) psd A R l l } uspecified fuctio of iput dimesioality 2 C l ( B γ )2 logδ [Micchelli+ 2005] Suggests a multiplicative icrease i the required sample size.

6 Fiite Cardiality K={K 1,K 2,...,K K } For a sigle kerel K: Pr margi-γ classifier w.r.t. K estimatio > O ( (B/γ) 2) logδ < δ bad evet for a kerel K For a fiite kerel family K, set δ δ/ K, ad take a uio boud over bad evets : O ( (B/γ) 2) logδ/ K Pr K K margi-γ class. w.r.t. K estimatio > < K δ K Pr K K margi-γ class. w.r.t. K estimatio > O ( (B/γ) 2 +log K ) logδ < δ

7 Mai Result A additive boud for geeral kerel families, i terms of their pseodo-dimesio: For ay K chose from K, ad ay classifier with margi γ with respect to K: 16+8d φ log 128e3 B 2 γ ( B d φ γ )2 log γe 8B log128b γ 2 estimatio Õ( (B/γ) 2 +d φ (K) ) logδ sample complexity (B/γ) 2 + d φ (K) d φ (K) = pseudo-dimesio of K = VC-dimesio of subgraphs of K K { (x 1,x 2,t) K(x 1,x 2 )<t }

8 Bouds for Specific Kerel Families K covex (K 1,...,K k ) def = Previous result: K liear (K 1,...,K k ) def = No previous bouds Applyig our result: d φ (K liear ), d φ (K covex ) k estimatio estimatio λ i K i λ i 0ad 2 k ( B γ )2 logδ λ i K i K λ is psdad Õ( (B/γ) 2 +k ) logδ λ i =1 [Lackriet+ JMLR 2004] λ i =1

9 Bouds for Specific Kerel Families K l Gaussia Previous result: Applyig our result: def = { (x 1,x 2 ) e (x 1 x 2 ) A(x 1 x 2 ) psd A R l l} estimatio 2 C l ( B γ )2 logδ uspecified fuctio of iput dimesioality [Micchelli+ 2005] iput dimesioality d φ (K Gaussia ) l(l+1)/2 Oly diagoal A: Oly rak(a)k: l kllog 2 (8ekl) estimatio Õ( (B/γ) 2 +l 2) logδ

10 Additive vs. Multiplicative K covex (K 1,...,K k ) def = λ i K i λ i 0ad λ i =1 Sample complexity aalysis: If predictor with err at margi γ relative to some K K, How may sample eeded to get err+ε? Aswer accordig to multiplicative boud: O ( k(b/γ) 2 ǫ 2 ) Aswer accordig to our (additive) boud: Õ ( (B/γ) 2 +k ǫ 2 ) Relaxed approach: Just use i K i

11 Feature Space View Istead of multiple kerels K i, ca thik of implied feature spaces directly: φ(x)= w= α1 φ 1 (x) w 1 α2 φ 2 (x) w 2 αk φ k (x) w k Weightig each feature space byα i K = i α i K i K i (x,x )= φ i (x),φ i (x ) Relaxed approach: use uweighted feature space φ(x) K= i K i w 2 = i w i 2 required i uweighted space w 2 i ay weighted space B 2 K = kb2 Estimatio boud: O kb 2 w 2

12 Additive vs. Multiplicative K covex (K 1,...,K k ) def = λ i K i λ i 0ad λ i =1 Sample complexity aalysis: If predictor with err at margi γ relative to some K K, How may sample eeded to get err+ε? Aswer accordig to multiplicative boud: O ( k(b/γ) 2 ǫ 2 ) Aswer accordig to our (additive) boud: Õ ( (B/γ) 2 +k ǫ 2 ) Relaxed approach: Just use i K i margi γ relative to some K K margi γ relative to i K i B 2 K i = sup x K(x,x) k B 2 ( ) K k(b/γ) 2 Sample complexity: O ǫ 2

13 Lear i α i K i or use i K i? Relative to margi γ for some i α i K i : Lear i α i K i : Use i K i : of leared predictor Do we have eough samples to afford the factor of k? Is decrease i estimatio worth the computatioal cost? (maybe ot if we have eough data ad the estimatio is small ayway) Relative to margi γ for i (1/k)K i : Use i K i : of best margi γ predictor with some i α i K i + Flexibility with settig weights Lower approximatio but k/ icrease to estimatio Is the decrease i approximatio worth the icrease i estimatio? (ad the extra computatioal cost) Õ( (B/γ) 2 +k ) of leared of best margi γ O ( k(b/γ) 2) predictor predictor with some i α i K + i of leared of best margi γ O ( (B/γ) 2) predictor predictor with i (1/k)K + i

14 Alterate View: Group Lasso Istead of multiple kerels K i, ca thik of implied feature spaces directly: φ(x)= w= α1 φ 1 (x) w 1 α2 φ 2 (x) w 2 αk φ k (x) w k Weightig each feature space byα i K = i α i K i Relaxed approach: use uweighted feature space φ(x) K= i K i, B 2 K = kb2 K i (x,x )= φ i (x),φ i (x ) w 2 = i w i 2 required i uweighted space w 2 i ay weighted space Estimatio boud: O kb 2 i w i 2 [Bach et al 04] Learig with K covex equivalet to usig uweighted feature space φ(x) ad Block-L 1 regularizer i w i est for group lasso Õ B 2 ( i w i ) 2 +k w 2 = i w i 2 ( i w i ) 2

15 Proof Sketch boud pseudodimesio d φ (K) stadard result o coverig umbers i terms of d φ stadard results o coverig umbers of the uit sphere coverig of K of size (L) d φ(k) coverig of F K of size (L) (B/ε)2 Costruct coverig for F K as cross-product : for each kerel K i the coverig of K, take the coverig of F K. coverig of F K of size (L) d φ(k) (L) (B/ε)2 Lemma: if K, K are similar as real-valued fuctios, every K- classifier ca be approximated by K -classifier geeralizatio bouds i terms of log(coverig umber)

16 Rademacher vs. Coverig Numbers Other boud rely o calculatig the Rademacher complexity R[F K ] of the class of classifiers (uit orm) classifiers with respect to ay K K R[F K ] scales with the scale of fuctios i F K, i.e. with B. Geeralizatio bouds deped o R[F K ]/γ Bouds based o the Rademacher Complexity ecessarily have a multiplicative depedece o B/γ Coverig umbers allow us to combie scale-sesitive ad fiite-dimesioality (scale isesitive) argumets (at the cost of messier log-factors)

17 Learig Bouds for SVMs with Leared Kerels Nati Srebro Shai Be-David Boud o estimatio for large margi classifier with respect to kerel which is chose, from family K, based o traiig data: pseudodimesio of K, as family of real-valued fuctios Õ( d φ (K)+(B/γ) 2) logδ Valid for geeric keralized L 2 -regularized learig Easy to obtai bouds for further kerel families For K covex : usig i K i may require k times more data

Learning Bounds for Support Vector Machines with Learned Kernels

Learning Bounds for Support Vector Machines with Learned Kernels Learig Bouds for Support Vector Machies with Leared Kerels Natha Srebro 1 ad Shai Be-David 2 1 Uiversity of Toroto Departmet of Computer Sciece, Toroto ON, CANADA 2 Uiversity of Waterloo School of Computer