Kernel Methods. Charles Elkan October 17, 2007

Similar documents
Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Introduction to Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear & nonlinear classifiers

CS798: Selected topics in Machine Learning

18.9 SUPPORT VECTOR MACHINES

Linear & nonlinear classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernel Methods and Support Vector Machines

Support Vector Machine

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Announcements. Proposals graded

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

COMS 4771 Introduction to Machine Learning. Nakul Verma

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines

Support Vector Machine (continued)

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines.

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

(Kernels +) Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Data Mining and Analysis: Fundamental Concepts and Algorithms

Jeff Howbert Introduction to Machine Learning Winter

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernel Methods. Barnabás Póczos

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Introduction to SVM and RVM

Support Vector Machines and Kernel Methods

Kernel methods CSE 250B

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Support Vector Machine & Its Applications

ML (cont.): SUPPORT VECTOR MACHINES

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines

SVMs: nonlinearity through kernels

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Learning from Examples

Kernel Principal Component Analysis

Reference Material /Formulas for Pre-Calculus CP/ H Summer Packet

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning. Support Vector Machines. Manfred Huber

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Introduction to Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning from Data

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

CS 188: Artificial Intelligence Spring Announcements

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Kernel Methods in Machine Learning

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kaggle.

Nonlinearity & Preprocessing

Kernel Methods. Machine Learning A W VO

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines

QUADRATIC FUNCTIONS AND MODELS

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Introduction to Machine Learning

Lecture Notes on Support Vector Machine

Machine Learning: The Perceptron. Lecture 06

Kernels A Machine Learning Overview

Kernel Methods. Outline

SUPPORT VECTOR MACHINE

Neural networks and support vector machines

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines and Kernel Methods

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Machine Learning : Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Final Exam. 1 True or False (15 Points)

10-701/ Recitation : Kernels

Support Vector Machines for Classification: A Statistical Portrait

Mathematics 530. Practice Problems. n + 1 }

Support Vector Machines Explained

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Applied Machine Learning Annalisa Marsico

LMS Algorithm Summary

Learning with kernels and SVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Transcription:

Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then the problem becomes linearly separable. Specifically,... The major disadvantage of mapping points into a new space is that the new space may have very high dimension. For example, if points lie in d-dimensional Euclidean space, and we include the product of every pair of dimensions then we have quadratic blowup with the mapping f : R d R d2. We can avoid this explosion if we can achieve two objectives: Rewrite our learning algorithm so that instead of using f(x) and f(y) directly, it only uses the dot-product K(x, y) = f(x) f(y). Compute dot-products K(x, y) in some indirect way, that is without computing f(x) and f(y) explicitly. These two objectives are called the kernel trick. The first one is called kernelizing the learning algorithm. This means rewriting the algorithm so that it uses training examples solely by storing them, and by computing dot-products involving them. When we map examples into a higher-dimensional space, then linear classification in the new space is typically equivalent to nonlinear classification in the original space. Thus the kernel trick lets us extend a linear classification method such as the perceptron to a nonlinear method. The kernel trick was first published in 1964 by Aizerman et al., but it was not explored until the 1990s. Then it was first investigated extensively in the context of support vector machines, but more recently it has been applied to many other learning methods. For a simple example, consider kernelizing the perceptron. Remember the basic algorithm: 1

w := 0 repeat for T epochs: for i = 1 to i = m if y i sign(w x i ) then w := w + y i x i At each point in the execution of the algorithm, the vector w is a sum of data points w = j y jx j for j in a subset J of {1, 2,..., m}. So w x = j y jx j x. Now suppose we have a a kernel function K. The prediction for example s is just sign( j y jk(x j, x). Given m training points, how many kernel calculations do we need to do in order to make a prediction? Because J is a subset of {1, 2,..., m}, we could need to do up to m kernel calculations. This is true regardless of how many epochs of training we do, since there are still only m different possible values for K(x j, x). Every training example x j that is actually used in the final classifier is called a support vector. The kernelized perceptron is thus a type of support vector machine, but conventionally this name is usually reserved for a different learning approach. Intuitively, the perceptron algorithm is a good one to kernelize because the proof of its convergence is independent of the dimensionality of the data. This means that the dimensionality of f(x) is irrelevant as a determinant of the generalization ability of the learned classifier. Instead, what is relevant is R/δ as measured in the new space. Now that we have seen how to kernelize at least one learning algorithm, we can pay attention to the second part of the kernel trick, which is to compute interesting kernel functions efficiently. Suppose data points x live in d-dimensional Euclidean space. Suppose we want to re-represent them with every quadratic combination of features, If we can do this, we can learn separating surfaces that are quadratic in the original space R d. (Quadratic surfaces include circles, ellipses, parabolas, and their analogs in high dimensions.) Suppose we wish to re-represent a point x in a quadratic way, that is as the vector of products f(x) =..., x i x j,... for all i and j including i = j. This vector has length d 2. The dot-product after re-representation will be the sum of all d 2 terms x i x j y i y j. How can we compute this efficiently, i.e. indirectly? Consider (x y) 2 = (x 1 y 1 + x 2 y 2 +... x d y d ) 2. This is a sum of all d 2 possible terms of the form x i y i x j y j. By commutativity x i y i x j y j = x i x j y i y j and it follows that f(x) f(y) = (x y) 2. 2

Now consider (1 + x y) 2 = 1 + 2x y + f(x) f(y). This equals g(x) g(y) where g(x) = 1,..., 2x i,... x i x j,... which is a re-representation of x that includes all products of degree 0, 1, and 2 of elements of x. By definition, then, every quadratic function of x can be written as w g(x) for some weight vector w; the weight vector w can adjust for the fact that some components of g(x) are scaled by 2. In general, the function K(x, y) = (1 + x y) n is called the polynomial kernel of degree n. The binomial theorem says that n ( ) n (a + b) n = a n k b k. k Let a = 1 and b = x y. We obtain k=0 K(x, y) = (1 + x y) n = n k=0 ( ) n (x y) k. k It is not hard to show that K(x, y) = h(x) h(y) where h(x) is a vector that includes all products of degree n and lower of components of x, with scaling factors on the components. Kernels can also be defined for data points that are not real-valued vectors. As long as f(x) is a vector of the same length regardless of x, x itself need not be real-valued and need not be of fixed size; for example x may be a sequence of arbitrary length. Let A be an alphabet, and consider some total ordering of all strings over this alphabet, for example a lexicographic ordering over m, s where m is the length of s. Let s n be string number n in this ordering. Now let x be any string and define f n (x) to be the number of times s n appears as a substring in x, including overlapping appearances. Let f(x) be the representation f o (x), f 1 (x),.... Note that the vector f(x) is always of infinite length, but for any string x of length m there exists a threshold T m i=0 A i such that all components f t (x) are zero for t > T. Hence for any two strings the dot-product f(x) f(y) is a finite integer. Given two strings, how do we compute K(x, y) = f(x) f(y)? For each substring of x, count how often it appears in y, and sum up these counts. We can do this in O( x y ) time using dynamic programming, where y is the length of string y. 3

Intuitively, a kernel is a similarity function. However, we cannot use just any heuristic similarity function as a genuine kernel. To be a genuine kernel, a function must be a dot-product in some space. Mathematically, Mercer s theorem says when a function K has this property. First we need two definitions. Definition: The function K(x, y) : X X R is symmetric if and only if K(x, y) = K(y, x) for all x and y in X. Definition: The function K(x, y) : X X R is non-negative definite if and only if K(x i, x j )c i c j 0 i,j for every finite subset {x 1,..., x n } of X and every subset {c 1,..., c n } of real numbers. The sum above is equal to c T Kc where c is the column vector c 1,..., c n T and K is a matrix whose ijth entry is K(x i, x j ). A symmetric non-negative definite matrix has non-negative eigenvalues. Now we can state Mercer s theorem. Theorem: [James Mercer, 1909] There exists a d and a function f : X R d such that K(x, y) = f(x) f(y) if and only K(x, y) is symmetric and non-negative definite. One of the most widely used kernels is based on Euclidean distance. The Gaussian kernel is defined to be K(x, y) = exp x y 2 /s 2 where s is a parameter. This is also called the radial basis function kernel with parameter s. It can be proved that this function K is positive semi-definite, so we can use it as a kernel without knowing explicitly what space it corresponds to. If we learn a linear classifier using a Gaussian kernel, the result is similar to a nearest-neighbor classifier. Given a test example z, we compute its distance to every support vector selected by the training algorithm. For support vectors that are not near z, their label will not contribute to the prediction for z. With a Gaussian kernel, the support vectors are the centers of the radial basis functions. A remarkable feature of the kernelized perceptron with a Gaussian kernel is that the training algorithm automatically chooses how many centers to use. However, which ones are chosen, and how many, depends on the width s 2 of the kernel, which is not chosen automatically. Kernels possess useful closure properties. Given two kernels we can make new ones by adding and multiplying them with each other, and by adding or multiplying by scalar constants. We can also raise a kernel to a positive integer power, and hence exponentiate them also, because the exponential function is the limit of a series of polynomials. We may also normalize a kernel, i.e. use 4

N(x, y) = K(x, y)/ K(x, x)k(y, y). The closure properties for kernels let us use, for example, the general polynomial kernel K(x, y) = (x y + R) d for constant R 0. The great flexibility in defining kernels invites the question of which kernels are better. This question has no definitive answer. At one extreme, a kernel that gives a diagonal kernel matrix is useless. At the other extreme, a perfect kernel divides the data into disjoint subspaces, each of which has the same label. 5