10-701/ Recitation : Kernels

Size: px

Start display at page:

Download "10-701/ Recitation : Kernels"

Mariah Wright
5 years ago
Views:

1 10-701/ Recitation : Kernels Manojit Nandi February 27, 2014

2 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer Theorem Do You want to Build a Snowman Kernel? It doesn t have to be Snowman Kernel. Operations that Preserve/Create Kernels

3 Current Research in Kernel Theory Flaxman Test Fast Food References Manojit Nandi Carnegie Mellon University Machine Learning Department 3/36

4 Manojit Nandi Carnegie Mellon University Machine Learning Department 4/36

5 IMPORTANT: The Kernels used in Support Vector Machines are different from the Kernels used in Kernel Regression. To avoid confusion, I ll refer to the Support Vector Machines kernels as Mercer Kernels and the Kernel Regression kernels as Smoothing Kernels Manojit Nandi Carnegie Mellon University Machine Learning Department 5/36

6 Recall: A square matrix A R NxN u R n, u T Au 0. is positive semi-definite if for all vectors For some kernel function K : X xx R defined over a set X ; X = n, the Gram Matrix G is an nxn matrix such that G ij = K(x i, x j ) =< φ(x i ), φ(x j ) > Manojit Nandi Carnegie Mellon University Machine Learning Department 6/36

7 Mathematical Theory Manojit Nandi Carnegie Mellon University Machine Learning Department 7/36

8 Banach Space and Hilbert Spaces A Banach Space is a complete Vector Space V equipped with a norm. Examples: l m p Spaces: R m equipped with the norm x = ( m x i p ) 1 p i=1 Function Spaces, f := ( X f(x) P dx) 1 p Manojit Nandi Carnegie Mellon University Machine Learning Department 8/36

9 A Hilbert Space is a complete Vector Space V equipped with an inner product <.,. >. The inner product defines a norm ( x =< x, x >), so Hilbert Spaces are a special case of Banach Spaces. Example: Euclidean Space with standard inner product: x, y R n ; < x, y >= n x i y i = x T y Because Kernel functions are the inner product in some higher-dimensional space, each kernel function corresponds to some Hilbert Space H. i=1 Manojit Nandi Carnegie Mellon University Machine Learning Department 9/36

10 Kernels A Mercer Kernel is a function K defined over a set X such that K : X xx R. The Mercer Kernel represents an inner product in some higher dimensional space A Mercer Kernel is symmetric and positive semi-definite. By Mercer s theorem, we know that any symmetric positive semi-definite kernel represents the inner product in some higher-dimensional Hilbert Space H. Symmetric and Positive Semi-Definite Kernel Function < φ(x), φ(x ) > for some φ(.). Manojit Nandi Carnegie Mellon University Machine Learning Department 10/36

11 Why Symmetric? Inner products are symmetric by definition, so therefore if the kernel function represents an inner product in some Hilbert Space, then the kernel function must be symmetric as well. K(x, z) =< φ(x), φ(z) >=< φ(z), φ(x) >= K(z, x). Manojit Nandi Carnegie Mellon University Machine Learning Department 11/36

12 Why Positive Semi-definite? In order to be positive semi-definite, the Gram Matrix G R nxn must satisfy u T Gu 0 u R n. Using functional analysis, one can show that u T Gu corresponds to < h u, h u > for some h u in some Hilbert Space H. Because < h u, h u > 0 by the definition of an inner product, then < h u, h u > 0 and < h u, h u >= u T Gu u T Gu 0. Manojit Nandi Carnegie Mellon University Machine Learning Department 12/36

13 Why are Kernels useful? Let x = (x 1, x 2 ) and z = (z 1, z 2 ). Then, K(x, z) =< x, z > 2 = (x 1 z 1 + x 2 z 2 ) 2 = (x 2 1z 2 1) + (2x 1 z 1 x 2 z 2 ) + (x 2 2z 2 2) =< (x 2 1, 2x 1 x 2, x 2 2), (z 2 1, 2z 1 z 2, z 2 2) >=< φ(x), φ(z) > Where φ(x) = (x 2 1, 2x 1 x 2, x 2 2). What if x = (x 1, x 2, x 3, x 4 ), z = (z 1, z 2, z 3, z 4 ), and K(x, z) =< x, z > 50. In this case, φ(x) and φ(z) are dimensional vectors. : From combinatorics, 50 unlabeled balls into 4 labeled boxes = ( 53 3 ). Manojit Nandi Carnegie Mellon University Machine Learning Department 13/36

14 We can either: 1. Calculate < x, z > (Inner product of two four-dimension vectors) and raise this value (a float) to the 50 th power. 2. OR map x and z to φ(x) and φ(z), and calculate the inner product of two dimensional vectors. Which is faster? Manojit Nandi Carnegie Mellon University Machine Learning Department 14/36

15 Commonly Used Kernels Some commonly used Kernels are: Linear Kernel: K(x, z) =< x, z > Polynomial Kernel: K(x, z) =< x, z > d where d is the degree of the polynomial Gaussian RBF: K(x, z) = exp( 1 2σ 2 x z 2 ) = exp( γ x z 2 ) Sigmoid Kernel: K(x, z) = tanh(αx T z + β) Manojit Nandi Carnegie Mellon University Machine Learning Department 15/36

16 Other Kernels: 1. Laplacian Kernel 2. ANOVA Kernel 3. Circular Kernel 4. Spherical Kernel 5. Wave Kernel 6. Power Kernel 7. Log Kernel 8. B-Spline Kernel 9. Bessel Kernel 10. Cauchy Kernel 11. χ 2 Kernel 12. Wavelet Kernel 13. Bayesian Kernel 14. Histogram Kernel As you can see, there are a lot of kernels. Manojit Nandi Carnegie Mellon University Machine Learning Department 16/36

17 The Gaussian RBF Kernel is infinite In class, Barnabas mentioned that the Gaussian RBF Kernel corresponds to an infinite dimensional vector space. From the Moore-Aronszajn theorem, we know there is a unique Hilbert space of functions for with the Gaussian RBF is a reproducing kernel. One can show that this Hilbert Space has a infinite basis, and so the Gaussian RBF Kernel corresponds to an infinite-dimensional vector space. A YouTube video of Abu-Mostafa explaining this in more detail has been included in the references at the end of this presentation. Manojit Nandi Carnegie Mellon University Machine Learning Department 17/36

18 Kernel Theory Manojit Nandi Carnegie Mellon University Machine Learning Department 18/36

19 One Weird Kernel Trick The kernel trick is one the reasons kernel methods are so popular in machine learning. With the kernel trick, we can substitute any dot product with a kernel function. This means any linear dot product in the formulation of a machine learning algorithm can be replaced with a non-linear kernel function. The nonlinear kernel function may allow us to find separation or structure in the data that was not present in the linear dot product. Kernel Trick: < x i, x j > can be replaced with K(x i, x j ) for any valid kernel function K. Furthermore, we can swap any kernel function with any other kernel function. Manojit Nandi Carnegie Mellon University Machine Learning Department 19/36

20 Some examples of algorithms that take advantage of the kernel trick: Support Vector Machines (Poster Child for Kernel Methods). Kernel Principal Component Analysis Kernel Independent Component Analysis Kernel K-Means algorithm Kernel Gaussian Process Regression Kernel Deep Learning Manojit Nandi Carnegie Mellon University Machine Learning Department 20/36

21 Representer Theorem Theorem Let X be a non-empty set and k a positive-definite real-valued kernel on on X xx with the corresponding reproducing kernel Hilbert Space H k. Given a training sample [(x 1, y 1 ),..., (x n, y n )] X xr, a strictly monotonic increasing real-valued function g : [0, ) R, and an arbitrary emprical loss function E : (X XR 2 ) m R, then for any f satisfying ˆf = argmin{e[(x 1, y 1, f(x 1 ),..., (x n, y n, f(x n ))]} + g( f ) f H k f can be represented in the form f (.) = n α i k(, x i ), where α i R for all α i i=1 Manojit Nandi Carnegie Mellon University Machine Learning Department 21/36

22 Example: Let H k be some RKHS function space with a kernel k(.,.). Let [(x 1, y 1 ),..., (x m, y m )] be a training input-output set, and our task is to find f that minimizes the following regularized kernel. n n ˆf = argmin( f(x i ) 6 ) [ sin( x i yi f(xi) ) 25 + y i f(x i ) 249 ] + exp( f F ) f H k i=1 i=1 Manojit Nandi Carnegie Mellon University Machine Learning Department 22/36

23 n n ˆf = argmin( f(x i ) 6 ) [ sin( x i yi f(xi) ) 25 + y i f(x i ) 249 ] + exp( f F ) f H k i=1 i=1 Well exp(.) is a strictly monotonic increasing function, so exp( f F ) fits the g( f F ) part. ( n f(x i ) 6 ) n [ sin( x i yi f(xi) ) 25 + y i f(x i ) 249 ] is some arbitrary emperical loss function of the [x i, y i, f(x i )]. So this optimization can be expressed as ˆf = argmin {E[(x 1, y 1, f(x 1 ),..., (x n, y n, f(x n ))]} + g( f ) i=1 i=1 f H k Therefore, by the Representer Theorem, f (.) = n α i K(x i,.) i=1 Manojit Nandi Carnegie Mellon University Machine Learning Department 23/36

24 Do You want to Build a Snowman Kernel? It doesn t have to be Snowman Kernel. Manojit Nandi Carnegie Mellon University Machine Learning Department 24/36

25 Operations that Preserve/Create Kernels Let k 1,...k m be valid kernel functions defined over some set X such that X = n, and let α 1,..., α m be non-negative coefficents. The following are valid kernels. K(x,z) = m α i k i (x, z) (Closed under non-negative linear multiplication) i=1 K(x,z) = m k i (x, z) (Closed under multiplication) i=1 K(x,z) =k 1 (f(x), f(z)) for any function f : X X K(x,z) = g(x)g(z) for any function g : X R K(x,z) = x T A T Az for any matrix A R mxn Manojit Nandi Carnegie Mellon University Machine Learning Department 25/36

26 Proof: K(x,z) = m α i k i (x, z) is a valid kernel i=1 Symmetry: K(z, x) = m i=1 α i k i (z, x). Because each k i is a kernel function, it is symmetric, so k i (z, x) = k i (x, z). Therefore, K(z, x) = m α i k i (z, x) = m α i k i (x, z) = K(x, z) K(z, x) = K(x, z) i=1 i=1 Positive Semi-definite: Let u R n be arbitrary. The Gram matrix of K, denoted by G has the property, G i,j = K(x i, x j ) = m α i k i (z, x) G = α 1 G α m G m. Now u T Gu = u T (α 1 G α m G m )u = α 1 u T G 1 u α m u T G m u = m α i u T G i u. i=1 u T G i u 0, and α i 0, so α i u T G i u 0. i=1 Manojit Nandi Carnegie Mellon University Machine Learning Department 26/36

27 Proof: K(x,z) = x T A T Az for any matrix A R mxn is a valid Kernel. For this proof, we are going to show K(x, z) is an inner product on some Hilbert Space. Let φ(x) = Ax, then < φ(x), φ(z) >= φ(x) T φ(z) = (Ax) T (Az) = x T A T Az = K(x, z) < φ(x), φ(z) >= K(x, z). Therefore, K(x, z) is an inner product on some Hilbert Space. Manojit Nandi Carnegie Mellon University Machine Learning Department 27/36

28 Just because it is a valid kernel does not mean it is a good kernel Manojit Nandi Carnegie Mellon University Machine Learning Department 28/36

29 Manojit Nandi Carnegie Mellon University Machine Learning Department 29/36

30 Some Exercises Prove the following are not valid kernels: 1. K(x, z) = k 1 (x, z) k 2 (x, z) for k 1, k 2 valid kernels 2. K(x, z) = exp(γ x z 2 ) for some γ > 0 Manojit Nandi Carnegie Mellon University Machine Learning Department 30/36

31 Current Research in Kernel Theory Manojit Nandi Carnegie Mellon University Machine Learning Department 31/36

32 Flaxman Test Correlates of homicide: New space/time interaction tests for spatiotemporal point processes Goal: Develop a statistical test that can test for the strength of interaction between time and space for different types of homicides. Result: Using Reproducing Kernel Hilbert spaces, one can project the space-time data into a higher dimensional space that can test for a significant interaction of a kernalized distance of space and time using the Hilbert-Schmidt Independence Criterion. Manojit Nandi Carnegie Mellon University Machine Learning Department 32/36

33 Fast Food Fastfood - Approximating Kernel Expansions in Loglinear time Problem: Kernel methods do not scale well to large datasets, especially during prediction time, because the algorithm has to compute the kernel distance between the new vector and all of the data in the training set. Solution: Using advanced mathematical black magic, dense random Gaussian matrices can be approximated using Hadamard matrices and diagonal Gaussian matrices. V = 1 σ d SHG HB Manojit Nandi Carnegie Mellon University Machine Learning Department 33/36

34 Where {0, 1} n is a permutation matrix, H is a Hadamard matrix, S is a random scaling matrix with elements along the diagonal, B has random {+ 1} along the diagonal, and G has random Gaussian entries. This algorithm performs 100x faster than the previous leading algorithm (Random Kitchen Sinks) and uses 1000x less memory. Manojit Nandi Carnegie Mellon University Machine Learning Department 34/36

35 References Representer Theorem and Kernel Methods Predicting Structured Data by Alex Smola Kernel Machines String and Tree Kernels Why Gaussian Kernel is Infinite Manojit Nandi Carnegie Mellon University Machine Learning Department 35/36

36 Questions?

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up! Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new