Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Size: px

Start display at page:

Download "Learning Kernels -Tutorial Part III: Theoretical Guarantees."

Dorcas Walsh
5 years ago
Views:

1 Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research Mehryar Mohri Courant Institute & Google Research Afshin Rostami UC Berkeley berkeley.edu

2 Standard Learning with Kernels user kernel K sample algorithm h 2

3 Learning Kernel Framework user kernel family K sample algorithm (K, h) 3

4 Learning Kernels Theoretical questions: what is the price to pay for relaxing the requirement from the user to specify a kernel? how does the choice of the kernel family affect generalization? 4

5 Part III Non-negative combinations. General case. 5

6 Kernel Families Most frequently used kernel families, q 1, with q = Hypothesis sets: H q = K q = p k=1 µ k K k : µ q µ : µ 0, µ q =1. h H K : K K q, h HK 1. 6

7 Rademacher Complexity Empirical Rademacher complexity of sample S =(x 1,...,x m ), R S (H) =E σ sup h H 1 m m i=1 Rademacher complexity of H : : for a where σ i s are independent uniform random variables taking values in. { 1, +1} σ i h(x i ) H, R m (H) = E S D m[ R S (H)]. 7

8 Single Kernel Margin Bound Theorem (Koltchinskii and Panchenko, 2002): fix ρ>0. Assume that K(x, x) R 2 for all x, then, for any δ >0, with probability at least 1 δ, for any h H 1, R(h) R R2 /ρ ρ (h)+2 2 m + log 1 δ 2m. 8

9 Early Learning Kernel Bounds For any δ >0, with probability at least, for any, h H 1 (Bousquet and Herrmann 2003; Lanckriet et al., 2004) 1 δ R(h) R ρ (h)+ 1 max p k=1 Tr(K k)max p K k k=1 Tr(K k ) m ρ log 1 δ. but, bound always greater than one (Srebro and Ben- David, 2006)! other bound of (Lanckriet et al., 2004) for linear combination case also always greater than one! 9

10 Multiplicative Learning Bound (Lanckriet et al., 2004) k [1,p],K k (x, x) R 2 Assume that for all. Then, for any δ >0, with probability at least 1 δ, for any, h H 1 pr2 /ρ R(h) R ρ (h)+o 2 m bound multiplicative in p (number of kernels).. 10

11 Additive Learning Bound Assume that for all k [1,p],K k (x, x) R 2. Then, for any δ >0, with probability at least 1 δ, for any, h H 1 R(h) R ρ (h)+ 8 2+p log 128em3 R2 ρ 2 p +256 R2 ρ 2 bound additive in p(modulo log terms). not informative for p>m. similar guarantees for other families. (Srebro and Ben-David, 2006) log ρem 8R 128mR2 log ρ 2 based on pseudo-dimension of kernel family. m + log(1/δ). 11

12 New Data-Dependent Bound (CC, MM, and AR, 2010) Theorem: for any sample S of size m, and positive integer r, R S (H 1 ) ru r m, with u =(Tr[K 1 ],...,Tr[K p ]). similarity with single kernel bound. can be used directly to derive an algorithm. 12

13 New Data-Dependent Bound Proof: Let q, r 1 1 with. R S (H q )= 1 m E σ = 1 m E σ = 1 m E σ = 1 m E σ = 1 m E σ sup h H q m σ i h(x i ) i=1 sup µ q,α Kα 1 sup µ q,α Kα 1 sup µ q m i,j=1 σ K µ σ sup µ uσ µ q q + 1 r =1 σ i α j K µ (x i,x j ) σ K µ α = 1 m E σ sup σ, α K 1/2 µ q,α K 1/2 1 (Cauchy-Schwarz) uσ =(σ K 1 σ,...,σ K p σ) ) = 1 m E σ uσ r. (definition of dual norm) 13

14 New Data-Dependent Bound R S (H 1 )= 1 m E σ Proof: in the following, uσ r 1 is arbitrary integer. 1 m E σ = 1 m E σ uσ r p (σ K k σ) r 1 k=1 2r 1 p E (σ K k σ) r 1 m σ = 1 m 1 m p k=1 p k=1 k=1 2r E σ (σ K k σ) r 1 2r r r Tr[K 2r k] (Jensen s inequality) = ru r m ( r 1, u σ u σ r ). (lemma) 14

15 Key Lemma Lemma: Let Kbe a kernel matrix for a finite sample. Then, for any integer r, E σ (σ Kσ) r proof based on combinatorial argument. r r Tr[K] 15

16 New Learning Bound - L1 Theorem: assume that for all k [1,p],K k (x, x) R 2. Then, for any δ >0, with probability at least 1 δ, for any, h H 1 R(h) R ρ (h) elog pr2 /ρ 2 m (CC, MM, and AR, 2010) very weak dependency on p, no extra log terms. analysis based on Rademacher complexity. bound valid for p m. see also (Kakade et al., 2010). + log 1 δ 2m. 16

17 Comparison 100 [Srebro & Ben David, 2006] 10 Bound p=m p=m^(5/6) p=m^(4/5) p=m^(3/5) p=m^(1/3) p= [Our bound, 2010] m in Millions ρ/r=.2 17

18 Lower Bound Tight bound: dependency log p cannot be improved. argument based on VC dimension or example. Observations: case X={-1, +1} p. canonical projection kernels K k (x, x )=x k x k. H 1 contains J p ={x sx k : k [1,p],s {-1, +1}}. VCdim(J. p )=Ω(logp) ρ=1 h J p R ρ (h)= R(h) Ω VCdim(J p )/m for and,. VC lower bound:. 18

19 New Paper Recent claim (Hussain and Shawe-Taylor, AISTATS 2011): additive bound in terms of log p, instead of multiplicative. main proof incorrect: probabilistic bound on Rademacher complexity, but slack term left out of proof of theorem 8. Adding it multiplicative bound. however: authors are preparing new version (private communication: J. Shawe-Taylor). 19

20 New Learning Bound - Lq Theorem: let q, r 1 1 with q + 1 r =1 and r integer. Assume that for all k [1,p],K k (x, x) R 2. Then, for any δ >0, with probability at least 1 δ, for any, R(h) R ρ (h)+2p 1 2r mild dependency on p rr2 /ρ 2 m (CC, MM, and AR, 2010) analysis based on Rademacher complexity. + log 1 δ 2m. h H q 20

21 Lower Bound Tight bound: dependency p 1 2r cannot be improved. in particular p 1 4 tight for L 2 regularization. Observations: equal kernels. p k=1 µ kk k = p k=1 µ k K1. h 2 H K1 =( p p k=1 µ k p 1 r µ q = p 1 r H q thus, for. k=1 µ k)h 2 H K p k=1 µ k =0 (Hölder s inequality). coincides with {h H K1 : h HK1 p 1 2r }. 21

22 Comparison L1 vs L p=m p=m^1/3 p=20 Bound m in Millions 22

23 Conclusion Theory: tight generalization bounds for learning kernels with L 1 or L q regularization ( p dependency). mild dependency on p. similar proof and analysis for other regularizations. Applications: results suggest using large number of kernels. recent results show significant improvements (CC, MM, AR, ICML 2010). 23

24 Part III Non-negative combinations. General case. 24

25 Kernel Family General case: K a family of kernels bounded by R. finite pseudo-dimension: general hypothesis set: H K = Pdim(K)<. h H K : K K, h HK 1. 25

26 Shattering Definition: Let H be a hypothesis set of functions from X to R. A={x 1,...,x m } is shattered by H if there exist t 1,...,t m R such that sgn L(h(x 1 ),f(x 1 )) t 1. sgn L(h(x m ),f(x m )) t m : h H =2 m. t 1 t 2 x 1 x 2 26

27 Pseudo-Dimension Definition: Let H be a hypothesis set of functions from X to R. The pseudo-dimension of H, Pdim(H), is the size of the largest set shattered by H. Definition (equivalent, see also (Vapnik, 1995)): (Pollard, 1984) Pdim(H) =VCdim (x, t) 1(h(x) t)>0 : h H. 27

28 Pseudo-Dimension - Properties Theorem: Pseudo-dimension of hyperplanes. Pdim(x w x + b: w R N,b R) =N +1. Theorem: Pseudo-dimension of a vector space of real-valued functions H. Pdim(H) =dim(h). Theorem: Pseudo-dimension of where φ is a monotone function: Pdim(φ(H)) dim(h). φ(h)={φ h: h H} 28

29 General Pdim Learning Bound Let K a family of kernel functions bounded by R. Let d=pdim(k), then, for any δ >0, with probability at least, for any, R(h) R ρ (h)+ 1 δ 8 h H K 2+d log 128em3 R2 ρ 2 d +256 R2 ρ 2 bound additive in d (modulo log terms). not informative for d>m. (Srebro and Ben-David, 2006) log ρem 8R m 128mR2 log ρ 2 +log(1/δ). 29

30 Application: Linear Combinations Linear and non-negative combination of base kernels (previous section): K lin = K 1 = K µ = K µ = p k=1 p k=1 µ k K k : p k=1 µ k K k : p p Since K lin K 1, k=1 k=1 µ k K k Pdim(K 1 ) Pdim(K lin ) dim µ k =1 K µ 0 µ k =1 µ 0 p µ k K k =p. k=1 30

31 Application: Gaussian Kernels Gaussian kernels with a fixed covariance matrix: K Gaussian = since exp (x 1, x 2 ) exp( (x 2 x 1 ) A(x 2 x 1 )): A S N + is monotone and since (x 1, x 2 ) (x 2 x 1 ) A(x 2 x 1 ): A S N + n = (x 1, x 2 ) A ij (x 2 x 1 ) i (x 2 x 1 ) j : A S N + i,j=1 span (x 1, x 2 ) (x 2 x 1 ) i (x 2 x 1 ) j :1 i j N Pdim(K Gaussian ) N(N 1) 2. Similar for diagonal, 31 A Pdim(K Gaussian ) N.,.

32 References Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel Learning. In AISTATS, Kakade, Sham M., Shalev-Shwartz, Shai, and Tewari, Ambuj. Applications of strong convexity strong smoothness duality to learning with matrices, arxiv: v1. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT,

33 References Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT,

Advanced Machine Learning

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning