CIS 520: Machine Learning Oct 09, Kernel Methods

Similar documents
Introduction to Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: Support Vector Machine and Large Margin Classifier

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Methods. Outline

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture 10: A brief introduction to Support Vector Machine

Kernel Methods and Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

5.6 Nonparametric Logistic Regression

Kernels and the Kernel Trick. Machine Learning Fall 2017

Least Squares Regression

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear & nonlinear classifiers

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines and Kernel Methods

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

10-701/ Recitation : Kernels

(Kernels +) Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine

Kernel Methods. Charles Elkan October 17, 2007

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Lecture 7: Kernels for Classification and Regression

The Representor Theorem, Kernels, and Hilbert Spaces

Kernel Methods. Machine Learning A W VO

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LMS Algorithm Summary

SVMs, Duality and the Kernel Trick

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Statistical Methods for SVM

Support Vector Machines

Kernels A Machine Learning Overview

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

CS798: Selected topics in Machine Learning

Kernel methods CSE 250B

Understanding Generalization Error: Bounds and Decompositions

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Linear & nonlinear classifiers

Support Vector Machines.

Lecture Notes on Support Vector Machine

Support Vector Machines

Support Vector Machines

Least Squares Regression

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machines for Classification: A Statistical Portrait

SVMs: nonlinearity through kernels

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines, Kernel SVM

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Max Margin-Classifier

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernel Method: Data Analysis with Positive Definite Kernels

Lecture 10 February 23

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Support Vector Machines

Kernel Methods. Barnabás Póczos

Support Vector Machines and Kernel Methods

ML (cont.): SUPPORT VECTOR MACHINES

Advanced Introduction to Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Statistical Machine Learning from Data

Functional Gradient Descent

L5 Support Vector Classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

9.2 Support Vector Machines 159

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Introduction to Machine Learning

Introduction to Machine Learning

Support Vector Machines

Consistency of Nearest Neighbor Methods

Kernel Methods in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

18.9 SUPPORT VECTOR MACHINES

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Support Vector and Kernel Methods

Kernel Methods & Support Vector Machines

Support Vector Machines

Announcements. Proposals graded

Transcription:

CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture (and vice versa Outline Non-linear models via basis functions Closer look at the SVM dual: kernel functions, kernel SVM RKHSs and Representer Theorem Kernel logistic regression Kernel ridge regression Non-linear Models via Basis Functions Let X = R d We have seen methods for learning linear models of the form h(x = sign(w x + b for binary classification (such as logistic regression and SVMs and f(x = w x + b for regression (such as linear least squares regression and SVR What if we want to learn a non-linear model? What would be a simple way to achieve this using the methods we have seen so far? One way to achieve this is to map instances x R d to some new feature vectors φ(x R n via some non-linear feature mapping φ : R d R n, and then to learn a linear model in this transformed space For example, if one maps instances x R d to n = ( + 2d + ( d 2 -dimensional feature vectors x x d x x 2 φ(x =, x d x d x 2 then learning a linear model in the transformed space is equivalent to learning a quadratic model in the original instance space In general, one can choose any basis functions φ, φ n : X R, and learn a linear x 2 d

2 Kernel Methods model over these: w φ(x + b, where w R n (in fact, one can do this for X R d as well For example, in least squares regression applied to a training sample S = ((x, y,, (x m, y m (R d R m, one would simply replace the matrix X R m d with the design matrix Φ R m n, where Φ ij = φ j (x i What is a potential difficulty in doing this? If n is large (eg as would be the case if the feature mapping φ corresponded to a high-degree polynomial, then the above approach can be computationally expensive In this lecture we look at a technique that allows one to implement the above idea efficiently for many algorithms We start by taking a closer look at the SVM dual which we derived in the last lecture 2 Closer Look at the SVM Dual: Kernel Functions, Kernel SVM Recall the form of the dual we derived for the (soft-margin linear SVM: max α 2 α i α j y i y j (x i x j + α i ( j= subject to α i y i = 0 (2 0 α i C, i =,, m (3 If we implement this on feature vectors φ(x i R n in place of x i R d, we get the following optimization problem: max α 2 ( α i α j y i y j φ(xi φ(x j + α i (4 j= subject to α i y i = 0 (5 0 α i C, i =,, m (6 This involves computing dot products between vectors φ(x i, φ(x j in R n Similarly, using the learned model to make predictions on a new test point x R d also involves computing dot products between vectors in R n : ( h(x = sign α i y i φ(x i φ(x + b i SV For example, as we saw above, one can learn a quadratic classifier in X = R 2 by learning a linear classifier in φ(r 2 R 6, where (( x x φ = x 2 x 2 x x 2 ; x 2 x 2 2 clearly, a straightforward approach to learning an SVM classifier in this space (and applying it to a new test point will involve computing dot products in R 6 (more generally, when learning a degree-q polynomial in R d, such a straightforward approach will involve computing dot products in R n for n = O(d q

Kernel Methods 3 Now, consider replacing dot products φ(x φ(x in the above example with K(x, x, where x, x R 2, K(x, x = (x x + 2 It can be verified (exercise! that K(x, x = φ K (x φ K (x, where (( x φ K = x 2 2x 2x2 2x x 2 Thus, using K(x, x above instead of φ(x φ(x implicitly computes dot products in R 6, with computation of dot products required only in R 2! In fact, one can use any symmetric, positive semi-definite kernel function K : X X R (also called a Mercer kernel function in the SVM algorithm directly, even if the feature space implemented by the kernel function cannot be described explicitly Any such kernel function yields a convex dual problem; if K is positive definite, then K also corresponds to inner products in some inner product space V (ie K(x, x = φ(x, φ(x for some φ : X V For Euclidean instance spaces X = R d, examples of commonly used kernel functions include the polynomial kernel K(x, x = (x x + q,which results in learning a degree-q polynomial threshold classifier, and the Gaussian kernel, also known as the radial basis function (RBF kernel, K(x, x = exp ( x x 2 2 2σ (where 2 σ > 0 is a parameter of the kernel, which effectivey implements dot products in an infinite-dimensional inner product space; in both cases, evaluating the kernel K(x, x at any two points x, x requires only O(d computation time Kernel functions can also be used for non-vectorial data (X = R d ; for example, kernel functions are often used to implicitly embed instance spaces containing strings, trees etc into an inner product space, and to implicitly learn a linear classifier in this space Intuitively, it is helpful to think of kernel functions as capturing some sort of similarity between pairs of instances in X To summarize, given a training sample S = ((x, y,, (x m, y m (X {±} m, in order to learn a kernel SVM classifier using a kernel function K : X X R, one simply solves the kernel SVM dual given by x 2 x 2 2 max α 2 α i α j y i y j K(x i, x j + α i (7 j= subject to α i y i = 0 (8 0 α i C, i =,, m, (9 and then predicts the label of a new instance x X according to ( h(x = sign i SV α i y i K(x i, x + b, where b = SV i SV ( y i j SV α j y j K(x i, x j

4 Kernel Methods 3 RKHSs and Representer Theorem Let K : X X R be a symmetric positive definite kernel function Let { FK 0 r } = f : X R f(x = α i K(x i, x for some r Z +, α i R, x i X For f, g FK 0 with f(x = r α ik(x i, x and g(x = s j= β jk(x j, x, define r s f, g K = α i β j K(x i, x j (0 j= f K = f, f K ( Let F K be the completion of FK 0 under the metric induced by the above norm Then reproducing kernel Hibert space (RKHS associated with K 2 Note that the SVM classifier learned using kernel K is of the form where f(x = i SV α iy i K(x i, x, ie where f F K h(x = sign(f(x + b, In fact, consider the following optimization problem: ( yi (f(x i + b f F K,b R m + + λ f 2 K F K is called the It turns out that the above SVM solution (with C = 2λm is a solution to this problem, ie the kernel SVM solution imizes the RKHS-norm regularized hinge loss over all functions over the form f(x + b for f F K, b R More generally, we have the following result: Theorem (Representer Theorem Let K : X X R be a positive definite kernel function Let Y R Let S = ((x, y,, (x m, y m (X Y m Let L : R m Y m R Let Ω : R + R + be a monotonically increasing function Then for λ > 0, there is a solution to the optimization problem of the form ( (f(x L + b,, f(x m + b, (y,, y m f F K,b R f(x = α i K(x i, x for some α,, α m R If Ω is strictly increasing, then all solutions have this form + λ Ω( f 2 K The above result tells us that even if F K is an infinite-dimensional space, any optimization problem resulting from imizing a loss over a finite training sample regularized by some increasing function of the RKHSnorm is effectively a finite-dimensional optimization problem, and moreover, the solution to this problem can be written as a kernel expansion over the training points In particular, imizing any other loss over F K (regularized by the RKHS-norm will also yield a solution of this form! Exercise Show that linear functions f : R d R of the form f(x = w x form an RKHS with linear kernel K : R d R d R given by K(x, x = x x and with f 2 K = w 2 2 The metric induced by the norm K is given by d K (f, g = f g K The completion of FK 0 is simply F K plus any limit points of Cauchy sequences in FK 0 under this metric 2 The name reproducing kernel Hilbert space comes from the following reproducing property: For any x X, define K x : X R as K x(x = K(x, x ; then for any f F K, we have f, K x = f(x

Kernel Methods 5 4 Kernel Logistic Regression Given a training sample S (X {±} m and kernel function K : X X R, the kernel logistic regression classifier is given by the solution to the following optimization problem: f F K,b R m ln ( + e yi(f(xi+b + λ f 2 K Since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α, b: α R m,b R m ln ( + e yi( m j= αjk(xj,xi+b + λ j= α i α j K(x i, x j This is of a similar form as in standard logistic regression, with m basis functions φ j (x = K(x j, x for j [m] (and w α! In particular, define K R m m as K ij = K(x i, x j (this is often called the gram matrix, and let k i denote the i-th column of this matrix Then we can write the above as simply α R m,b R m ln ( + e yi(α k i+b + λα Kα, which is similar to the form for standard linear logistic regression (with feature vectors k i except for the regularizer being α Kα rather than α 2 2 and can be solved similarly as before, using similar numerical optimization methods We note that unlike SVMs, here in general, the solution has α i 0 i [m] A variant of logistic regression called the import vector machine (IVM adopts a greedy approach to find a subset IV [m] such that the function f (x + b = i IV α i K(x i, x + b gives good performance Compared to SVMs, IVMs can provide more natural class probability estimates, as well as more natural extensions to multiclass classification 5 Kernel Ridge Regression Given a training sample S (X R m and kernel function K : X X R, consider first a kernel ridge regression formulation for learning a function f F K : f F K m ( yi f(x i 2 + λ f 2 K Again, since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α: α R m m ( 2 y i α j K(x j, x i + λ α i α j K(x i, x j, j= j= or in matrix notation, α R m m ( yi α 2 k i + λα Kα

6 Kernel Methods Again, this is of the same form as standard linear ridge regression, with feature vectors k i and with regularizer α Kα rather than α 2 2 If K is positive definite, in which case the gram matrix K is invertible, then setting the gradient of the objective above wrt α to zero can be seen to yield α = ( K + λmi m y, where as before I m is the m m identity matrix and y = (y,, y m R m Exercise Show that if X = R d and one wants to explicitly include a bias term b in the linear ridge regression solution which is not included in the regularization, then defining x ( [ ] w Id 0 X =, w =, L =, b 0 0 x m one gets the solution w = ( X X + λml X y How would you extend this to learning a function of the form f(x + b for f F K, b R in the kernel ridge regression setting?