Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Belinda Perry
5 years ago
Views:

1 Advanced Machine Learning Learning Kernels MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2

3 Machine Learning Components user features sample algorithm h page 3

4 Machine Learning Components user features sample critical task algorithm main focus of ML literature h page 4

5 Kernel Methods Features PDS kernel K : X H K interpreted as a similarity measure. implicitly defined via the choice of a x, y X, (x) (y) =K(x, y). Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem. page 5

6 Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. N =2 d=2 Example: for and, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 = x 2 1 x x 1 x 2 2cx 1 2cx 2 c y 2 1 y y 1 y 2 2cy 1 2cy 2 c. page 6

7 XOR Problem c =1 Use second-degree polynomial kernel with : ( 1, 1) x 2 (1, 1) 2 x1 x 2 (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 ( 1, 1) (1, 1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 7

8 Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Normalized kernel of x y 2 2 2, =0. (x, x ) exp x x 2. Sigmoid Kernels: K(x, y) =tanh(a(x y)+b), a,b 0. page 8

9 Primal: Dual: min w,b 1 2 w 2 + C max m i=1 i SVM m i=1 1 2 subject to: 0 i C (Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992) 1 y i (w K(x i )+b) m i,j=1 m i=1 i jy i y j K(x i,x j ) iy i =0,i [1,m].. + page 9

10 Kernel Ridge Regression Primal: Dual: min w w 2 + max R m m i=1 (Hoerl and Kennard, 1970; Sanders et al., 1998) (w K(x i )+b y i ) 2. (K + I) +2 y. page 10

11 Questions How should the user choose the kernel? problem similar to that of selecting features for other learning algorithms. poor choice good choice learning made very difficult. even poor learners could succeed. The requirement from the user is thus critical. can this requirement be lessened? is a more automatic selection of features possible? page 11

12 Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 12

13 Standard Learning with Kernels user kernel K sample algorithm h page 13

14 Learning Kernel Framework user kernel family K sample algorithm (K, h) page 14

15 Kernel Families q 1 Most frequently used kernel families,, K q = K µ : K µ = p k=1 µ k K k, µ = µ 1. µ p q with q = µ : µ 0, µ q =1. Hypothesis sets: H q = h H K : K K q, h HK 1. page 15

16 Relation between Norms Lemma: for p, q (0, + ], the following holds: x R N,p q x q x p N 1 p 1 q x q. x =0 Proof: for the left inequalities, observe that for, apple p kxkp NX apple p xi X N apple q xi = =1. kxk q kxk i=1 q kxk {z } i=1 q apple1 Right inequalities follow immediately Hölder s inequality: x p = N i=1 x i p 1 p N i=1 ( x i p ) q p p q N i=1 (1) q q p 1 p q 1 p = x q N 1 p 1 q. page 16

17 Single Kernel Guarantee >0 >0 1 h H 1 (Koltchinskii and Panchenko, 2002) Theorem: fix. Then, for any, with probability at least, the following holds for all, R(h) R (h)+ 2 Tr[K] m + log 1 2m. page 17

18 Multiple Kernel Guarantee >0 (Cortes, MM, and Rostamizadeh, 2010) Theorem: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h and any integer 1 s r: H q R(h) apple b R (h)+ 2 1 q, r 1 q + 1 r =1 p skuks m + with u =(Tr[K 1 ],...,Tr[K p ]). s log 1 2m, page 18

19 Proof 1 q, r 1 q + 1 r =1 Let with. br S (H q )= 1 m E h = 1 m E h = 1 m E h = 1 m E h = 1 m E h sup h2h q mx i=1 i ih(x i ) sup µ2 q, > K µ apple1 sup µ2 q, > K µ apple1 sup µ2 q q > K µ i mx i,j=1 i i j K µ (x i,x j ) i > K µ = 1 h m E sup µ2 q,k k K 1/2 µ p i sup µ u u =( > K 1,..., µ2 q h, i 1/2 K µ apple1 i (Cauchy-Schwarz) > K p ) > ) = 1 m E p ku kr. (definition of dual norm) page 19

20 K Lemma Lemma: Let be a kernel matrix for a finite sample. Then, for any integer r, E h ( > K ) ri apple (Cortes, MM, and Rostamizadeh, 2010) r Tr[K] r. Proof: combinatorial argument. page 20

21 Proof 1 s r For any, br S (H q )= 1 m E p ku kr apple 1 m E p ku ks = 1 hh p m E X ( > K k ) si 1 k=1 2s i apple 1 h h X p E ( > K k ) sii 1 m k=1 = 1 h X p E m k=1 apple 1 h X p m k=1 2s h( > K k ) sii 1 2s s i 1 2s s Tr[K k ] = (Jensen s inequality) p skuks m. (lemma) page 21

22 L 1 Learning Bound Corollary: fix. For any, with probability, the following holds for all : R(h) apple b R (h)+ 2 weak dependency on. bound valid for. Tr[K k ] (Cortes, MM, and Rostamizadeh, 2010) >0 >0 1 h H 1 p m max x r edlog pe p m K k(x, x). p max k=1 Tr[K k] m + s log 1 2m. page 22

23 Proof For with q =1, the bound holds for any integer s 1 R(h) apple b R (h)+ 2 skuk s = s " p X k=1 p skuks m + Tr[K k ] s # 1 s apple sp 1 s p max k=1 Tr[K k]. s sp 1 s log p The function reaches it minimum at. s log 1 2m, page 23

24 Lower Bound Tight bound: dependency cannot be improved. argument based on VC dimension or example. Observations: case. canonical projection kernels. H 1 log p contains J p ={x sx k : k [1,p],s {-1, +1}}. VCdim(J p )= (logp). =1 h J p X={-1, +1} p K k (x, x )=x k x k R (h)=r(h) for and,. VCdim(J p )/m VC lower bound:. page 24

25 R(h) Pseudo-Dimension Bound k [1,p],K k (x, x) R 2 >0 1 Assume that for all. Then, for any, with probability at least, for any, R (h)+ bound additive in 8 128em3R2 2+plog 2 p +256 R2 2 log em 8R m not informative for. p (modulo log terms). p>m based on pseudo-dimension of kernel family. similar guarantees for other families. (Srebro and Ben-David, 2006) 128mR2 log h H log(1/ ). page 25

26 Comparison /R=.2 page 26

27 L q Learning Bound >0 Corollary: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h : H q R(h) apple b R (h)+ 2 mild dependency on p. (Cortes, MM, and Rostamizadeh, 2010) 1 q, r 1 q + 1 r =1 q rp 1 r max p k=1 Tr[K k] m + s log 1 2m, Tr[K k ] m max x K k(x, x). page 27

28 Lower Bound Tight bound: dependency p 1 2r cannot be improved. p 1 4 L 2 in particular tight for regularization. Observations: equal kernels. p k=1 µ k K 1. h 2 p H K1 =( p k=1 µ k p 1 r µ q = p 1 r H q k=1 µ kk k = p thus, for. k=1 µ k) h 2 p H K k=1 µ k =0 (Hölder s inequality). {h H K1 : h HK1 p 1 2r } coincides with. page 28

29 Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 29

30 General LK Formulation - SVMs Notation: Kset of PDS kernel functions. K kernel matrices associated to, assumed convex. Y R m m Y ii =y i Optimization problem: K diagonal matrix with. min max 2 1 Y KY K K subject to: 0 C y =0. convex problem: function linear in pointwise maximum. K, convexity of page 30

31 Parameterized LK Formulation Notation: (K µ ) µ parameterized set of PDS kernel functions. convex set, µ K µ concave function. Y R m m Y ii =y i Optimization problem: diagonal matrix with. min µ convex problem: function convex in pointwise maximum. max 2 1 Y K µ Y subject to: 0 C y =0. µ, convexity of page 31

32 Non-Negative Combinations K µ = p k=1 µ kk k µ 1,. By von Neumann s generalized minimax theorem (convexity wrt µ, concavity wrt, 1 convex and compact, A convex and compact): min µ 1 =max A min µ 1 max A 2 1 Y K µ Y =max A 2 1 max µ 1 =max 2 1 max A k [1,p] 2 1 Y K µ Y Y K µ Y Y K k Y. page 32

33 Non-Negative Combinations Optimization problem: in view of the previous analysis, the problem can be rewritten as the following QCQP. max,t 2 1 t subject to: k [1,p],t Y K k Y ; 0 C y =0. complexity (interior-point methods): O(pm 3 ). (Lanckriet et al., 2004) page 33

34 Equivalent Primal Formulation Optimization problem: 1 px kw k k 2 mx 2 min + C max w,µ2 q 2 µ k k=1 i=1 (!) px 0, 1 y i w k k(x i ). k=1 page 34

35 Lots of Optimization Solutions QCQP (Lanckriet et al., 2004). Wrapper methods interleaving call to SVM solver and update of µ : SILP (Sonnenburg et al., 2006). Reduced gradient (SimpleML) (Rakotomamonjy et al., 2008). Newton s method (Kloft et al., 2009). Mirror descent (Nath et al., 2009). On-line method (Orabona & Jie, 2011). Many other methods proposed. page 35

36 Does It Work? Experiments: this algorithm and its different optimization solutions often do not significantly outperform the simple uniform combination kernel in practice! observations corroborated by NIPS workshops. Alternative algorithms: significant improvement (see empirical results of (Gönen and Alpaydin, 2011)). centered alignment-based LK algorithms (Cortes, MM, and Rostamizadeh, 2010 and 2012). non-linear combination of kernels (Cortes, MM, and Rostamizadeh, 2009). page 36

37 LK Formulation - KRR Kernel family: non-negative combinations. L q regularization. (Cortes, MM, and Rostamizadeh, 2009) Optimization problem: min µ max p k=1 subject to: µ 0 µ µ 0 q. µ k K k +2 y convex optimization: linearity in pointwise maximum. µ and convexity of page 37

38 Projected Gradient Solving maximization problem in, closed-form solution, reduces problem to =(K µ + I) 1 y min µ y (K µ + I) 1 y subject to: µ 0 µ µ 0 2. Convex optimization problem, one solution using projection-based gradient descent: F µ k =Tr y (K µ + I) 1 y (K µ + I) (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 K k = y (K µ + I) 1 K k (K µ + I) 1 y = K k. page 38

39 Proj. Grad. KRR - L 2 Reg. ProjectionBasedGradientDescent((K k ) k [1,p], µ 0 ) 1 µ µ 0 2 µ 3 while µ µ > do 4 µ µ 5 (K µ + I) 1 y 6 µ µ + ( K 1,..., K p ) 7 for k 1 to p do 8 µ k max(0, µ k ) 9 µ µ 0 + µ µ 0 µ µ 0 10 return µ page 39

40 Interpolated Step KRR - L 2 Reg. InterpolatedIterativeAlgorithm((K k ) k [1,p], µ 0 ) 1 2 (K µ0 + I) 1 y 3 while > do 4 5 v ( K 1,..., K p ) 6 µ µ 0 + v v 7 +(1 )(K µ + I) 1 y 8 return Simple and very efficient: few iterations (less than15). page 40

41 L 2 -Regularized Combinations (Cortes, MM, and Rostamizadeh, 2009) Dense combinations are beneficial when using many kernels. Combining kernels based on single features, can be viewed as principled feature weighting DVD baseline L 1 L Reuters (acq) baseline L 2 L page 41

42 Conclusion Solid theoretical guarantees suggesting the use of a large number of base kernels. Broad literature on optimization techniques but often no significant improvement over uniform combination. Recent algorithms with significant improvements, in particular non-linear combinations. Still many theoretical and algorithmic questions left to explore. page 42

43 References Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local Rademacher complexity. In Proceedings of NIPS, Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh. Algorithms for Learning Kernels Based on Centered Alignment. JMLR 13: , page 43

44 References Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Tutorial: Learning Kernels. ICML 2011, Bellevue, Washington, July Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel Learning. In AISTATS, 2011 [see arxiv for corrected version]. Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari: Regularization Techniques for Learning with Matrices. JMLR 13: , Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT, Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, page 44

45 References Mehmet Gönen, Ethem Alpaydin: Multiple Kernel Learning Algorithms. JMLR 12: (2011). Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT, page 45

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley