Learning Bounds for Support Vector Machines with Learned Kernels

Learig Bouds for Support Vector Machies with Leared Kerels Nati Srebro TTI-Chicago Shai Be-David Uiversity of Waterloo Mostly based o a paper preseted at COLT 06

Kerelized Large-Margi Liear Classificatio φ(x) B γ K(x 1,x 2 ) = φ(x 1 ), φ(x 2 ) Implicitly defies a Hilbert space i which we seek large-margi separatio Represets our prior kowledge, or bias K(x,x) B 2 estimatio = E[] - traiig O ( (B/γ) 2) logδ failure probability sample complexity (B/γ) 2 sample size

Learig the Kerel Success of learig rests o choice of a good Kerel, appropriate for the task How ca we kow which kerel is good for the task at had? Joitly lear classifier ad Kerel, usig the traiig data: Search for a kerel from some family K of allowed kerels Lear badwidth, or covariace matrix of Gaussia kerel; other kerel parameters [Cristiaii+98][Chapelle+02][Keerthi02] etc Liear, or covex, combiatio of base kerels [Lackriet+02,04][Crammer+03]; applicatios, esp. i Bioiformatics [Soeburg+05][Be-Hur&Noble05] etc More flexibility: lower approximatio, but higher estimatio What is the sample complexity cost of this flexibility?

With a fixed kerel: estimatio Outlie How does this chage whe the kerel is leared from some family K? What is the cost of learig the kerel? Mai result: Learig boud for geeral kerel families Additive icrease to the sample complexity Examples: bouds for specific families Lear i α i K i or just use i K i? Group Lasso (block-l 1 ) O demad: proof techique (very simple) ad why usig the Rademacher complexity ca t work O ( (B/γ) 2) logδ

Previous Bouds: Specific Kerel Families K covex (K 1,...,K k ) def = estimatio 2 k ( B γ )2 logδ λ i K i λ i 0ad λ i =1 [Lackriet+ JMLR 2004] K l Gaussia def = estimatio {(x 1,x 2 ) e (x 1 x 2 ) A(x 1 x 2 ) psd A R l l } uspecified fuctio of iput dimesioality 2 C l ( B γ )2 logδ [Micchelli+ 2005] Suggests a multiplicative icrease i the required sample size.

Fiite Cardiality K={K 1,K 2,...,K K } For a sigle kerel K: Pr margi-γ classifier w.r.t. K estimatio > O ( (B/γ) 2) logδ < δ bad evet for a kerel K For a fiite kerel family K, set δ δ/ K, ad take a uio boud over bad evets : O ( (B/γ) 2) logδ/ K Pr K K margi-γ class. w.r.t. K estimatio > < K δ K Pr K K margi-γ class. w.r.t. K estimatio > O ( (B/γ) 2 +log K ) logδ < δ

Mai Result A additive boud for geeral kerel families, i terms of their pseodo-dimesio: For ay K chose from K, ad ay classifier with margi γ with respect to K: 16+8d φ log 128e3 B 2 γ 2 +2048( B d φ γ )2 log γe 8B log128b γ 2 estimatio Õ( (B/γ) 2 +d φ (K) ) logδ sample complexity (B/γ) 2 + d φ (K) d φ (K) = pseudo-dimesio of K = VC-dimesio of subgraphs of K K { (x 1,x 2,t) K(x 1,x 2 )<t }

Bouds for Specific Kerel Families K covex (K 1,...,K k ) def = Previous result: K liear (K 1,...,K k ) def = No previous bouds Applyig our result: d φ (K liear ), d φ (K covex ) k estimatio estimatio λ i K i λ i 0ad 2 k ( B γ )2 logδ λ i K i K λ is psdad Õ( (B/γ) 2 +k ) logδ λ i =1 [Lackriet+ JMLR 2004] λ i =1

Bouds for Specific Kerel Families K l Gaussia Previous result: Applyig our result: def = { (x 1,x 2 ) e (x 1 x 2 ) A(x 1 x 2 ) psd A R l l} estimatio 2 C l ( B γ )2 logδ uspecified fuctio of iput dimesioality [Micchelli+ 2005] iput dimesioality d φ (K Gaussia ) l(l+1)/2 Oly diagoal A: Oly rak(a)k: l kllog 2 (8ekl) estimatio Õ( (B/γ) 2 +l 2) logδ

Additive vs. Multiplicative K covex (K 1,...,K k ) def = λ i K i λ i 0ad λ i =1 Sample complexity aalysis: If predictor with err at margi γ relative to some K K, How may sample eeded to get err+ε? Aswer accordig to multiplicative boud: O ( k(b/γ) 2 ǫ 2 ) Aswer accordig to our (additive) boud: Õ ( (B/γ) 2 +k ǫ 2 ) Relaxed approach: Just use i K i

Feature Space View Istead of multiple kerels K i, ca thik of implied feature spaces directly: φ(x)= w= α1 φ 1 (x) w 1 α2 φ 2 (x) w 2 αk φ k (x) w k Weightig each feature space byα i K = i α i K i K i (x,x )= φ i (x),φ i (x ) Relaxed approach: use uweighted feature space φ(x) K= i K i w 2 = i w i 2 required i uweighted space w 2 i ay weighted space B 2 K = kb2 Estimatio boud: O kb 2 w 2

Lear i α i K i or use i K i? Relative to margi γ for some i α i K i : Lear i α i K i : Use i K i : of leared predictor Do we have eough samples to afford the factor of k? Is decrease i estimatio worth the computatioal cost? (maybe ot if we have eough data ad the estimatio is small ayway) Relative to margi γ for i (1/k)K i : Use i K i : of best margi γ predictor with some i α i K i + Flexibility with settig weights Lower approximatio but k/ icrease to estimatio Is the decrease i approximatio worth the icrease i estimatio? (ad the extra computatioal cost) Õ( (B/γ) 2 +k ) of leared of best margi γ O ( k(b/γ) 2) predictor predictor with some i α i K + i of leared of best margi γ O ( (B/γ) 2) predictor predictor with i (1/k)K + i

Alterate View: Group Lasso Istead of multiple kerels K i, ca thik of implied feature spaces directly: φ(x)= w= α1 φ 1 (x) w 1 α2 φ 2 (x) w 2 αk φ k (x) w k Weightig each feature space byα i K = i α i K i Relaxed approach: use uweighted feature space φ(x) K= i K i, B 2 K = kb2 K i (x,x )= φ i (x),φ i (x ) w 2 = i w i 2 required i uweighted space w 2 i ay weighted space Estimatio boud: O kb 2 i w i 2 [Bach et al 04] Learig with K covex equivalet to usig uweighted feature space φ(x) ad Block-L 1 regularizer i w i est for group lasso Õ B 2 ( i w i ) 2 +k w 2 = i w i 2 ( i w i ) 2

Proof Sketch boud pseudodimesio d φ (K) stadard result o coverig umbers i terms of d φ stadard results o coverig umbers of the uit sphere coverig of K of size (L) d φ(k) coverig of F K of size (L) (B/ε)2 Costruct coverig for F K as cross-product : for each kerel K i the coverig of K, take the coverig of F K. coverig of F K of size (L) d φ(k) (L) (B/ε)2 Lemma: if K, K are similar as real-valued fuctios, every K- classifier ca be approximated by K -classifier geeralizatio bouds i terms of log(coverig umber)

Rademacher vs. Coverig Numbers Other boud rely o calculatig the Rademacher complexity R[F K ] of the class of classifiers (uit orm) classifiers with respect to ay K K R[F K ] scales with the scale of fuctios i F K, i.e. with B. Geeralizatio bouds deped o R[F K ]/γ Bouds based o the Rademacher Complexity ecessarily have a multiplicative depedece o B/γ Coverig umbers allow us to combie scale-sesitive ad fiite-dimesioality (scale isesitive) argumets (at the cost of messier log-factors)

Learig Bouds for SVMs with Leared Kerels Nati Srebro Shai Be-David Boud o estimatio for large margi classifier with respect to kerel which is chose, from family K, based o traiig data: pseudodimesio of K, as family of real-valued fuctios Õ( d φ (K)+(B/γ) 2) logδ Valid for geeric keralized L 2 -regularized learig Easy to obtai bouds for further kerel families For K covex : usig i K i may require k times more data