Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.
Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2
Machine Learning Components user features sample algorithm h page 3
Machine Learning Components user features sample critical task algorithm main focus of ML literature h page 4
Kernel Methods Features PDS kernel K : X H K interpreted as a similarity measure. implicitly defined via the choice of a x, y X, (x) (y) =K(x, y). Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem. page 5
Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. N =2 d=2 Example: for and, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 = x 2 1 x 2 2 2 x 1 x 2 2cx 1 2cx 2 c y 2 1 y 2 2 2 y 1 y 2 2cy 1 2cy 2 c. page 6
XOR Problem c =1 Use second-degree polynomial kernel with : ( 1, 1) x 2 (1, 1) 2 x1 x 2 (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 ( 1, 1) (1, 1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 7
Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Normalized kernel of x y 2 2 2, =0. (x, x ) exp x x 2. Sigmoid Kernels: K(x, y) =tanh(a(x y)+b), a,b 0. page 8
Primal: Dual: min w,b 1 2 w 2 + C max m i=1 i SVM m i=1 1 2 subject to: 0 i C (Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992) 1 y i (w K(x i )+b) m i,j=1 m i=1 i jy i y j K(x i,x j ) iy i =0,i [1,m].. + page 9
Kernel Ridge Regression Primal: Dual: min w w 2 + max R m m i=1 (Hoerl and Kennard, 1970; Sanders et al., 1998) (w K(x i )+b y i ) 2. (K + I) +2 y. page 10
Questions How should the user choose the kernel? problem similar to that of selecting features for other learning algorithms. poor choice good choice learning made very difficult. even poor learners could succeed. The requirement from the user is thus critical. can this requirement be lessened? is a more automatic selection of features possible? page 11
Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 12
Standard Learning with Kernels user kernel K sample algorithm h page 13
Learning Kernel Framework user kernel family K sample algorithm (K, h) page 14
Kernel Families q 1 Most frequently used kernel families,, K q = K µ : K µ = p k=1 µ k K k, µ = µ 1. µ p q with q = µ : µ 0, µ q =1. Hypothesis sets: H q = h H K : K K q, h HK 1. page 15
Relation between Norms Lemma: for p, q (0, + ], the following holds: x R N,p q x q x p N 1 p 1 q x q. x =0 Proof: for the left inequalities, observe that for, apple p kxkp NX apple p xi X N apple q xi = =1. kxk q kxk i=1 q kxk {z } i=1 q apple1 Right inequalities follow immediately Hölder s inequality: x p = N i=1 x i p 1 p N i=1 ( x i p ) q p p q N i=1 (1) q q p 1 p q 1 p = x q N 1 p 1 q. page 16
Single Kernel Guarantee >0 >0 1 h H 1 (Koltchinskii and Panchenko, 2002) Theorem: fix. Then, for any, with probability at least, the following holds for all, R(h) R (h)+ 2 Tr[K] m + log 1 2m. page 17
Multiple Kernel Guarantee >0 (Cortes, MM, and Rostamizadeh, 2010) Theorem: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h and any integer 1 s r: H q R(h) apple b R (h)+ 2 1 q, r 1 q + 1 r =1 p skuks m + with u =(Tr[K 1 ],...,Tr[K p ]). s log 1 2m, page 18
Proof 1 q, r 1 q + 1 r =1 Let with. br S (H q )= 1 m E h = 1 m E h = 1 m E h = 1 m E h = 1 m E h sup h2h q mx i=1 i ih(x i ) sup µ2 q, > K µ apple1 sup µ2 q, > K µ apple1 sup µ2 q q > K µ i mx i,j=1 i i j K µ (x i,x j ) i > K µ = 1 h m E sup µ2 q,k k K 1/2 µ p i sup µ u u =( > K 1,..., µ2 q h, i 1/2 K µ apple1 i (Cauchy-Schwarz) > K p ) > ) = 1 m E p ku kr. (definition of dual norm) page 19
K Lemma Lemma: Let be a kernel matrix for a finite sample. Then, for any integer r, E h ( > K ) ri apple (Cortes, MM, and Rostamizadeh, 2010) r Tr[K] r. Proof: combinatorial argument. page 20
Proof 1 s r For any, br S (H q )= 1 m E p ku kr apple 1 m E p ku ks = 1 hh p m E X ( > K k ) si 1 k=1 2s i apple 1 h h X p E ( > K k ) sii 1 m k=1 = 1 h X p E m k=1 apple 1 h X p m k=1 2s h( > K k ) sii 1 2s s i 1 2s s Tr[K k ] = (Jensen s inequality) p skuks m. (lemma) page 21
L 1 Learning Bound Corollary: fix. For any, with probability, the following holds for all : R(h) apple b R (h)+ 2 weak dependency on. bound valid for. Tr[K k ] (Cortes, MM, and Rostamizadeh, 2010) >0 >0 1 h H 1 p m max x r edlog pe p m K k(x, x). p max k=1 Tr[K k] m + s log 1 2m. page 22
Proof For with q =1, the bound holds for any integer s 1 R(h) apple b R (h)+ 2 skuk s = s " p X k=1 p skuks m + Tr[K k ] s # 1 s apple sp 1 s p max k=1 Tr[K k]. s sp 1 s log p The function reaches it minimum at. s log 1 2m, page 23
Lower Bound Tight bound: dependency cannot be improved. argument based on VC dimension or example. Observations: case. canonical projection kernels. H 1 log p contains J p ={x sx k : k [1,p],s {-1, +1}}. VCdim(J p )= (logp). =1 h J p X={-1, +1} p K k (x, x )=x k x k R (h)=r(h) for and,. VCdim(J p )/m VC lower bound:. page 24
R(h) Pseudo-Dimension Bound k [1,p],K k (x, x) R 2 >0 1 Assume that for all. Then, for any, with probability at least, for any, R (h)+ bound additive in 8 128em3R2 2+plog 2 p +256 R2 2 log em 8R m not informative for. p (modulo log terms). p>m based on pseudo-dimension of kernel family. similar guarantees for other families. (Srebro and Ben-David, 2006) 128mR2 log h H 1 2 + log(1/ ). page 25
Comparison /R=.2 page 26
L q Learning Bound >0 Corollary: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h : H q R(h) apple b R (h)+ 2 mild dependency on p. (Cortes, MM, and Rostamizadeh, 2010) 1 q, r 1 q + 1 r =1 q rp 1 r max p k=1 Tr[K k] m + s log 1 2m, Tr[K k ] m max x K k(x, x). page 27
Lower Bound Tight bound: dependency p 1 2r cannot be improved. p 1 4 L 2 in particular tight for regularization. Observations: equal kernels. p k=1 µ k K 1. h 2 p H K1 =( p k=1 µ k p 1 r µ q = p 1 r H q k=1 µ kk k = p thus, for. k=1 µ k) h 2 p H K k=1 µ k =0 (Hölder s inequality). {h H K1 : h HK1 p 1 2r } coincides with. page 28
Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 29
General LK Formulation - SVMs Notation: Kset of PDS kernel functions. K kernel matrices associated to, assumed convex. Y R m m Y ii =y i Optimization problem: K diagonal matrix with. min max 2 1 Y KY K K subject to: 0 C y =0. convex problem: function linear in pointwise maximum. K, convexity of page 30
Parameterized LK Formulation Notation: (K µ ) µ parameterized set of PDS kernel functions. convex set, µ K µ concave function. Y R m m Y ii =y i Optimization problem: diagonal matrix with. min µ convex problem: function convex in pointwise maximum. max 2 1 Y K µ Y subject to: 0 C y =0. µ, convexity of page 31
Non-Negative Combinations K µ = p k=1 µ kk k µ 1,. By von Neumann s generalized minimax theorem (convexity wrt µ, concavity wrt, 1 convex and compact, A convex and compact): min µ 1 =max A min µ 1 max A 2 1 Y K µ Y =max A 2 1 max µ 1 =max 2 1 max A k [1,p] 2 1 Y K µ Y Y K µ Y Y K k Y. page 32
Non-Negative Combinations Optimization problem: in view of the previous analysis, the problem can be rewritten as the following QCQP. max,t 2 1 t subject to: k [1,p],t Y K k Y ; 0 C y =0. complexity (interior-point methods): O(pm 3 ). (Lanckriet et al., 2004) page 33
Equivalent Primal Formulation Optimization problem: 1 px kw k k 2 mx 2 min + C max w,µ2 q 2 µ k k=1 i=1 (!) px 0, 1 y i w k k(x i ). k=1 page 34
Lots of Optimization Solutions QCQP (Lanckriet et al., 2004). Wrapper methods interleaving call to SVM solver and update of µ : SILP (Sonnenburg et al., 2006). Reduced gradient (SimpleML) (Rakotomamonjy et al., 2008). Newton s method (Kloft et al., 2009). Mirror descent (Nath et al., 2009). On-line method (Orabona & Jie, 2011). Many other methods proposed. page 35
Does It Work? Experiments: this algorithm and its different optimization solutions often do not significantly outperform the simple uniform combination kernel in practice! observations corroborated by NIPS workshops. Alternative algorithms: significant improvement (see empirical results of (Gönen and Alpaydin, 2011)). centered alignment-based LK algorithms (Cortes, MM, and Rostamizadeh, 2010 and 2012). non-linear combination of kernels (Cortes, MM, and Rostamizadeh, 2009). page 36
LK Formulation - KRR Kernel family: non-negative combinations. L q regularization. (Cortes, MM, and Rostamizadeh, 2009) Optimization problem: min µ max p k=1 subject to: µ 0 µ µ 0 q. µ k K k +2 y convex optimization: linearity in pointwise maximum. µ and convexity of page 37
Projected Gradient Solving maximization problem in, closed-form solution, reduces problem to =(K µ + I) 1 y min µ y (K µ + I) 1 y subject to: µ 0 µ µ 0 2. Convex optimization problem, one solution using projection-based gradient descent: F µ k =Tr y (K µ + I) 1 y (K µ + I) (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 K k = y (K µ + I) 1 K k (K µ + I) 1 y = K k. page 38
Proj. Grad. KRR - L 2 Reg. ProjectionBasedGradientDescent((K k ) k [1,p], µ 0 ) 1 µ µ 0 2 µ 3 while µ µ > do 4 µ µ 5 (K µ + I) 1 y 6 µ µ + ( K 1,..., K p ) 7 for k 1 to p do 8 µ k max(0, µ k ) 9 µ µ 0 + µ µ 0 µ µ 0 10 return µ page 39
Interpolated Step KRR - L 2 Reg. InterpolatedIterativeAlgorithm((K k ) k [1,p], µ 0 ) 1 2 (K µ0 + I) 1 y 3 while > do 4 5 v ( K 1,..., K p ) 6 µ µ 0 + v v 7 +(1 )(K µ + I) 1 y 8 return Simple and very efficient: few iterations (less than15). page 40
L 2 -Regularized Combinations (Cortes, MM, and Rostamizadeh, 2009) Dense combinations are beneficial when using many kernels. Combining kernels based on single features, can be viewed as principled feature weighting. 1.56 1.54 1.52 DVD baseline L 1 L 2.62 0.6.58 Reuters (acq) baseline L 2 L 1 1.5.56 1.48.54 1.46.52 1.44 0 2000 4000 6000 1000 2000 3000 4000 5000 6000 page 41
Conclusion Solid theoretical guarantees suggesting the use of a large number of base kernels. Broad literature on optimization techniques but often no significant improvement over uniform combination. Recent algorithms with significant improvements, in particular non-linear combinations. Still many theoretical and algorithmic questions left to explore. page 42
References Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, 2002. Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local Rademacher complexity. In Proceedings of NIPS, 2013. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, 2009. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, 2010. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, 2010. Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh. Algorithms for Learning Kernels Based on Centered Alignment. JMLR 13: 795-828, 2012. page 43
References Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Tutorial: Learning Kernels. ICML 2011, Bellevue, Washington, July 2011. Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel Learning. In AISTATS, 2011 [see arxiv for corrected version]. Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari: Regularization Techniques for Learning with Matrices. JMLR 13: 1865-1890, 2012. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT, 2008. Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, 2004. page 44
References Mehmet Gönen, Ethem Alpaydin: Multiple Kernel Learning Algorithms. JMLR 12: 2211-2268 (2011). Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, 2006. Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT, 2009. page 45