The Kernel Trick. Carlos C. Rodríguez October 25, Why don t we do it in higher dimensions?

Similar documents
Hilbert Spaces: Infinite-Dimensional Vector Spaces

Advanced Introduction to Machine Learning

Reproducing Kernel Hilbert Spaces

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernels MIT Course Notes

Review: Support vector machines. Machine learning techniques and image analysis

Solving with Absolute Value

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machine (SVM) and Kernel Methods

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

10-701/ Recitation : Kernels

Kernel Principal Component Analysis

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel Methods. Barnabás Póczos

Bayesian Support Vector Machines for Feature Ranking and Selection

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machines

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Basis Expansion and Nonlinear SVM. Kai Yu

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Support Vector Machine (SVM) and Kernel Methods

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

CS798: Selected topics in Machine Learning

Kernel Method: Data Analysis with Positive Definite Kernels

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Your first day at work MATH 806 (Fall 2015)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

22 : Hilbert Space Embeddings of Distributions

Kernel Methods and Support Vector Machines

Kernel Methods. Charles Elkan October 17, 2007

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Reproducing Kernel Hilbert Spaces

ML (cont.): SUPPORT VECTOR MACHINES

The Representor Theorem, Kernels, and Hilbert Spaces

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machine (SVM) and Kernel Methods

Lecture 7: Kernels for Classification and Regression

Support Vector Machines

Bayesian Interpretations of Regularization

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Support Vector Machines.

This is an author-deposited version published in : Eprints ID : 17710

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

1 Dirac Notation for Vector Spaces

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Support Vector Machines Explained

SVM optimization and Kernel methods

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

A Kernel Between Sets of Vectors

Math (P)Review Part II:

Lecture Notes 1: Vector spaces

Your first day at work MATH 806 (Fall 2015)

Manifold Learning: Theory and Applications to HRI

2.2 The spin states in the Stern-Gerlach experiment are analogous to the behavior of plane and circularly polarized light

CS 7140: Advanced Machine Learning

Sums of Squares (FNS 195-S) Fall 2014

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Lecture 10: A brief introduction to Support Vector Machine

18.9 SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

HOW TO THINK ABOUT POINTS AND VECTORS WITHOUT COORDINATES. Math 225

Real Variables # 10 : Hilbert Spaces II

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Lecture 4 February 2

9.2 Support Vector Machines 159

Linear Algebra- Final Exam Review

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines for Classification: A Statistical Portrait

9 Classification. 9.1 Linear Classifiers

Infinite Ensemble Learning with Support Vector Machinery

Announcements. Proposals graded

Machine learning for automated theorem proving: the story so far. Sean Holden

Function Space and Convergence Types

Integer-Valued Polynomials

Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Support Vector Machines and Kernel Methods

Kernel Methods. Outline

SVMs: nonlinearity through kernels

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reproducing Kernel Hilbert Spaces

Kernel Methods in Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernels A Machine Learning Overview

Reproducing Kernel Hilbert Spaces

Applied Machine Learning Annalisa Marsico

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Transcription:

The Kernel Trick Carlos C. Rodríguez http://omega.albany.edu:8008/ October 25, 2004 Why don t we do it in higher dimensions? If SVMs were able to handle only linearly separable data, their usefulness would be quite limited. But someone around 1992 (correct?) had a moment of true enlightenment and realized that if instead of the euclidean inner product < x i, x j > one fed the QP solver with a function K(x i, x j ) the boundary between the two classes would then be, K(x, w) + b = 0 and the set of x R d on that boundary becomes a curved surface embedded in R d when the function K(x, w) is non-linear. I don t really know how the history went (Vladimir?), but it could have perfectly been the case that a courageous soul just fed the QP solver with K(x, w) = exp( x w 2 ) and waited for the picture of the classification boundary to appear on the screen and saw the first non linear boundary computed with just linear methods. Then, depending on that person s mathematical training, s/he either said aha! K(x, w) is a kernel in a reproducing kernel Hilbert space or rushed to the library or to the guy next door to find out, and probably very soon after that said aha!, K(x, w) is the kernel of a RKHS. After knowing about RKHS that conclusion is inescapable but that, I believe, does not diminish the importance of this discovery. No. Mathematicians did not know about the kernel trick then, and most of them don t know about the kernel trick now. Let s go back to the function K(x, w) and try to understand why, even when it is non linear in its arguments, still makes sense as a proxy for < x, w >. The way to think about it is to consider K(x, w) to be the inner product not of the coordinate vectors x and w in R d but of vectors φ(x) and φ(w) in higher dimensions. The map, φ : X H is called a feature map from the data space X into the feature space H. The feature space is assumed to be a Hilbert space of real valued functions defined 1

on X. The data space is often R d but most of the interesting results hold when X is a compact Riemannian manifold. The following picture illustrates a particularly simple example where the feature map φ(x 1, x 2 ) = (x 2 1, 2x 1 x 2, x 2 2) maps data in R 2 into R 3. There are many ways of mapping points in R 2 to points in R 3 but the above has the extremely useful property of allowing the computation of the inner products of feature vectors φ(x) and φ(w) in R 3 by just squaring the inner product of the data vectors x and w in R 2! (to appreciate the exclamation point just replace 3 by 10 or by infinity!) i.e., in this case K(x, w) = < φ(x), φ(w) > = x 2 1w 2 1 + 2x 1 x 2 w 1 w 2 + x 2 2w 2 2 = (x 1 w 1 + x 2 w 2 ) 2 = (< x, w >) 2 OK. But are we really entitled to just Humpty Dumpty move up to higher, even infinite dimensions? 2

Sure, why not? Just remember that even the coordinate vectors x and w are just that, coordinates, arbitrary labels, that stand as proxy for the objects that they label, perhaps images or speech. What is important is the inter relation of these objects as measured, for example by the Grammian matrix of abstract inner products (< x i, x j >) and not the labels themselves with the euclidean dot product. If we think of the x i as abstract labels (coordinate free), as pointers to the objects that they represent, then we are free to choose values for < x i, x j > representing the similarity between x i and x j and that can be provided with a non linear function K(x, w) producing the n 2 numbers k ij = K(x i, x j ) when the abstract labels x i are replaced with the values of the observed data. In fact, the QP solver sees the data only through the n 2 numbers k ij. These could, in principle be chosen without even a kernel function and the QP solver will deliver (w, b). The kernel function becomes useful for choosing the classification boundary but even that could be empirically approximated. Now, of course, arbitrary n 2 numbers k ij that disregard the observed data completely will not be of much help, unless they happen to contain cogent prior information about the problem. So, what s the point that I am trying to make? The point is that it is obvious that a choice of kernel function is an ad-hoc way of sweeping under the rug prior information into the problem, indutransductibly (!) ducking the holy Bayes Theorem. There is a kosher path that flies under the banner of gaussian processes and RVMs but they are not cheap. We ll look at the way of the Laplacians (formerly a.k.a. bayesians) later but let s stick with the SVMs for now. The question is: What are the constraints on the function K(x, w) so that there exists a Hilbert space H of abstract labels φ(x) H such that < φ(x i ), φ(x j ) >= k ij?. The answer: It is sufficient for K(x, w) to be continuous, symmetric and positive definite. i.e., for X either R d or a d dimensional compact Riemannian manifold, K : X X R satisfies, 1. K(x, w) is continuous. 2. K(x, w) = K(w, x). Symmetric 3. K(x, w) is p.d. (positive definite). i.e., for any set {z 1,..., z m } X the m by m matrix K[z] = (K(z i, z j )) ij is positive definite. Such function is said to be a Mercer kernel. We have, Theorem If K(x, w) is a Mercer kernel then there exists a Hilbert space H K of real valued functions defined on X and a feature map φ : X H K such that, < φ(x), φ(w) > K = K(x, w) 3

where <, > K is the innerproduct on H K. Proof: Let L be the vector space containing all the real valued functions f defined on X of the form, f(x) = m α j K(x j, x) where m is a positive integer, the α j s are real numbers and {x 1,..., x m } X. Define in L the inner product, m n m n < α i K(x i, ), β j K(w j, ) >= α i β j K(x i, w j ) i=1 i=1 Since K is a Mercer kernel the above definition makes L a well defined inner product space. The Hilbert space H K is obtained when we add to L the limits of all the Cauchy sequences (w.r.t. the <, >) in L. Notice that the inner product in H K was defined so that < K(x, ), K(w, ) > K = K(x, w). We can then take the feature map to be, RKHS φ(x) = K(x, ) The space H K is said to be a Reproducing Kernel Hilbert Space (RKHS). Moreover, for all f H K f(x) =< K(x, ), f > K. This follows from the reproducing property when f L and by continuity for all f H K. It is also easy to show that the reproducing property is equivalent to the continuity of the evaluation functionals δ x (f) = f(x). The Mercer kernel K(x, w) naturally defines an integral linear operator that (abusing notation a bit) we also denote by K : H K H K, where, (Kf)(x) = K(x, y)f(y)dy and since K is symmetric and positive definite, it is orthogonally diagonalizable (just as in the finite dimensional case). Thus, there is an ordered orthonormal basis {φ 1, φ 2,...} of eigen vectors of K, i.e., for j = 1, 2,... Kφ j = λ j φ j 4

with < φ i, φ j > K = δ ij, λ 1 λ 2... > 0 and such that, K(x, w) = λ j φ j (x)φ j (w) from where it follows that the feature map is, producing, φ(x) = λ 1/2 j φ j (x)φ j K(x, w) =< φ(x), φ(w) > K Notice also that any continuous map φ : X H where H is a Hilbert space, defines K(x, w) =< φ(x), φ(w) > which is a Mercer Kernel. 5