Kernel Methods. Machine Learning A W VO

Similar documents
Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear & nonlinear classifiers

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS798: Selected topics in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Outline

Machine Learning. Support Vector Machines. Manfred Huber

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Barnabás Póczos

Support Vector Machines: Maximum Margin Classifiers

Introduction to Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

SVMs, Duality and the Kernel Trick

CIS 520: Machine Learning Oct 09, Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

LMS Algorithm Summary

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Statistical Machine Learning from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Pattern Recognition 2018 Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Methods and Support Vector Machines

Support Vector Machines

(Kernels +) Support Vector Machines

Lecture Notes on Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines.

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Support Vector Machines for Classification and Regression

Kernel Principal Component Analysis

Support Vector Machines

Support Vector Machines and Kernel Methods

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machine for Classification and Regression

Introduction to SVM and RVM

Support Vector Machines and Kernel Algorithms

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

L5 Support Vector Classification

LECTURE 7 Support vector machines

Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Kernel Methods & Support Vector Machines

5.6 Nonparametric Logistic Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Applied inductive learning - Lecture 7

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Kernel Method: Data Analysis with Positive Definite Kernels

Classifier Complexity and Support Vector Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Support Vector Machines Explained

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Discriminative Models

Kernels MIT Course Notes

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

SMO Algorithms for Support Vector Machines without Bias Term

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Kernel methods, kernel SVM and ridge regression

CS-E4830 Kernel Methods in Machine Learning

Constrained Optimization and Lagrangian Duality

Convex Optimization M2

Convex Optimization and Support Vector Machine

Discriminative Models

Lecture Notes 1: Vector spaces

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Support Vector Machines

Linear Regression and Its Applications

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

10-701/ Recitation : Kernels

Support Vector Machines

Reproducing Kernel Hilbert Spaces

Math 5311 Constrained Optimization Notes

Linear Algebra. Session 12

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

18.9 SUPPORT VECTOR MACHINES

Transcription:

Kernel Methods Machine Learning A 708.063 07W VO

Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance vector machine)

Introduction Many linear parametric models can be re-cast into an equivalent dual representation in which a linear prediction is based on linear combinations of kernel functions. The kernel concept was introduced into the field of pattern recognition by Aizerman et al. (1964). It was reintroduced into machine-learning in the context of large margin classifiers by Bose et al. (1992) giving rise to the technique of Support Vector Machines.

Dual representation Models which are based on a fixed nonlinear feature space mapping Φ : X H can be reformulated in terms of a dual representation in which the kernel function arises naturally. Consider the linear (regression) model T y=w x n, M x n X, w ℝ, H ℝ with a regularized sum-of-squares error function 2 1 N T T E w = n=1 w x n t n w w 2 2 w opt = T I N 1 T t, t= t 1,..., t N T M

Gram matrix Introduce a kernel function k x, x ' = x T x '. and the Gram matrix K, defined as K = T K nm=k x n, x m K nm= x n T x m, where denotes the design matrix.

Dual formulation The dual formulation allows the solution to the least-squares problem to be expressed entirely in terms of the kernel function in a N dim. space y x =wt x =a T k x k n x =k x n, x a= K I N 1 t, t= t 1,..., t N T. Known as dual formulation, because by noting that a can be expressed in terms of φ (x) we will recover the original formulation (for t =K t' / const and λ=λ' / const one obtains the same ESSE in kernel space)

The kernel concept In learning one wants to be able to generalize to unseen data points. For training data x1, y 1,..., x N, y N X Y and a new test sample x we want to choose y so that (x,y) is in some sense similar to the training samples. Therefore we need notions of similarity: Outputs y: Error- or loss function Inputs x : Symmetric kernel function k : X X ℝ x, x ' k x, x '

Similarities in the input space We further focus on a simple similarity measure: The dot product, But X may be no dot-product space. We therefore introduce a mapping : X H x x := x to a dot product space H. Benefits: 1. k x, x ' := x, x ' = x, x ' 2. We can deal with patterns geometrically 3. Freedom to choose. Canonical dot-product T x x'

Remark: Linear models Linear models can be completely formulated as dot products: y x =sign w, x b

Questions What is the benefit of feature maps Φ? What is the relationship between kernel and feature maps? What are the advantageous and disadvantageous of the kernel approach?

Benefit of nonlinear feature maps Classification can be done in 3D with a single hyperplane.

Relationship between kernels and feature maps Question: What kind of kernels k(x,x') admit a representation as a dot product in a feature space, i.e. given a kernel k(x,x') can we always construct a dot product space H and a map Φ mapping into it? Whenever we have a map Φ into a dot product space H, can we always construct a kernel k(x,x')? kernel k(x,x') map Φ into a dot product space H. Answer: Yes, if the kernel is positive semidefinite.

Positive semidefinite kernels Definition (Positive semidefinite matrix): A complex m m matrix K satisfying i, j c i c j K ij 0 for all ci ℂ is called positive semidefinite. Similarly, a real symmetric m m matrix K satisfying this relation for all ci ℝ is called positive semidefinite Definition (Positive semidefinite kernel): Let X be a nonempty set. A function k on X X which for all N ℕ, x1,..., x N X gives rise to a positive semidefinite Gram matrix is called a positive semidefinite kernel. The definitions for positive semidefinite kernels and positive semidefinite matrices differ in the fact that in the former case we are free to choose the points on which the kernel is evaluated.

Properties of the Gram matrix 1. Positivity on the diagonal 2. Symmetry K i,i=k x i, x i 0 for all x i X K i, j=k x i, x j =k xi, x j =K j,i 3. All eigenvalues are nonnegative. 4. Cauchy-Schwarz inequality is fulfilled 2 K i, j K i,i K j, j Proof: Eigenvalues of K are positive and so is their product, the det. 0 det K =K 1,1 K 2,2 K 1,2 K 2,1 =K 1,1 K 2,2 K 1,2 K 1,2 =K 1,1 K 2,2 K 1,2 2

Kernels from feature maps Whenever we have a map Φ into a dot product space H, we obtain a positive semidefinite kernel via k x, x ' := x, x ' c i ℝ, xi X, i=1,..., m 2 i, j c i c j k xi, x j = i ci x i, j c j x j = i ci x i 0

Feature maps from kernels Define a map from X into the space of functions mapping X into ℝ : X ℝ X x k, x. X We construct a feature space by X ℝ :={ f : X ℝ} 1. Turn the image of Φ into a vector space 2. Define a dot product 3. Show that the dot product satisfies k x, x ' := x, x ' 4. For convenience turn the dot product space into a Hilbert space

1. Create a vector space First we define a vector space by taking linear combinations of the form N f = i=1 i k, x i N ℕ, i ℝ, x1,..., x N X f : X ℝ

2. Define a dot product Next we define a dot product between f and another function N' g = i =1 i k, xi ' N ' ℕ, i ℝ, x1 ',..., x N ' X as N N' f, g = i=1 j=1 i j k x i, x j ' Note that the expansion coefficients need not to be unique.

Properties of a dot product 1. A dot product, : H H ℝ x, x ' x, x ' is a symmetric bilinear form ax bx ', x ' ' =a x, x ' ' b x ', x ' ', x ' ', ax bx ' =a x ' ', x b x ' ', x ' Proof: N N' f, g = i=1 j=1 i j k x i, x j ' = g, f N' f, g = j =1 j f x j ' N f, g = i=1 i g xi

Properties of a dot product 2. A dot product is positive semidefinite, i.e. f, f 0, with equality only for f =0. Proof: From the definition of the dot product and the p. s. d. of the kernel follows N N' f, f = i=1 j=1 i j k xi, x j ' 0 and from the Cauchy-Schwarz inequality for kernels follows f x 2= k, x, f 2 k x, x f, f f, f =0 f =0.

3. Kernel is the dot product The kernel k is the representer of evaluation N N' f, g = i=1 j=1 i j k x i, x j ' k, x, k, x ' =k x, x ' and is therefore also called reproducing kernel. Therefore we get x, x ' =k x, x '

4. RKHS We finally turn the dot product space into a Hilbert space by a fairly simple mathematical trick, resulting in a Reproducing Kernel Hilbert Space RKHS: We complete the dot product space in the norm by adding the limit points of Cauchy sequences that are convergent in the norm. Cauchy sequence: Reason: This is has some mathematical advantages, e. g. it is always possible to define projections.

Mercer theorem Mercer's theorem is the traditional way to introduce the kernel trick. Mercer's theorem uses L2 norm in contrast to the RKHS. But any two separable Hilbert spaces are isometrically isomorphic, i.e. it's possible to define a one-to-one linear map between spaces which preserves dot product.

Mercer theorem 1. Mercer kernels are positive definite kernels. Therefore they are also reproducing kernels. 2. Different feature spaces can be constructed for the same kernel. 3. As long as only dot products are considered, spaces can be regarded as identical. 4. Practically we never make use of RKHS or Mercer maps, but only deal with kernel functions.

Representer theorem Theorem: (Representer theorem) Denote by Ω:[0, ) ℝ a strictly monotonic increasing function, by X a set, and by c:(x ℝ2)N ℝ { } an arbitrary loss function. Then each minimizer f H of the regularized risk c x 1, y1, f x 1,..., x N, y N, f x N f H admits a representation of the form m f x = i=1 i k x i, x Although we might be trying to solve an optimization problem in an infinite dimensional space H, containing linear combinations of points centered on arbitrary points of X, the solution lies in the span of N particular kernels those centered on the training points.

Representer theorem Proof: We decompose any f into a part contained in the span of the kernels centered on the training samples and a part that is orthogonal N f x = f x f x = i=1 i k x i, x f x, f x j = f., k x j,. f H, f j {1,..., N } N = i=1 i k x i, x j f., k x j,. N = i=1 i k x i, x j m 2 2 m 2. f = i i k xi, f i ik x i,. Therefore the risk function is minimized if f =0, k x i,. =0, i { 1,..., N }

Examples of kernels Polynomial kernels Inhomogeneous polynomial kernels Gaussian kernels Sigmoid kernel (not p.s.d.) Prior knowledge of the problem helps in designing the right kernel for sophisticated problems (e.g. bioinformatics, text categorization, etc.)

Advantages of the kernel approach 1. Kernel trick 1: Simple computation of the dot product in a potentially infinite dimensional feature space (see RBF) by means on the kernel function 2. Kernel trick 2: Given an algorithm formulated in terms of a p.s.d. kernel k one can formulate another algorithm by replacing k with another p.s.d. kernel. 3. Simple construction of p.s.d. kernels from other p.s.d. kernels k x, x' k x, x' k x, x' k x, x' k x, x' = = = = = k 1 x, x ' k 2 x, x ' k 1 x, x ' k 2 x, x ' exp k 1 x, x ' f x k 1 x, x ' f x ', c k1 x, x ', f is any function c 0

Disadvantages of the kernel approach In the dual formulation the solution for a regression model is obtain by inverting a N x N matrix, which for standard methods requires O(N3). This limits the the number of training samples < 10000. For predicting the output for a new test sample the evaluation of N kernel functions is required. A solution for the later problem are sparse kernel machines, for which the prediction only depends on a subset of training data points.

Summary Kernel: Similarity measure of inputs that is calculated using dot products in high dimensional spaces Linear models for regression and classification can be computed only with dot products. Any p.s.d. kernel corresponds to a dot product in feature space Kernel Trick: Compute high-dimensional dot products without computing feature map.

Examples of kernel machines Non-sparse kernel machines: Gaussian processes Sparse kernel machines: Nonlinear PCA Support vector machines Support vector regression Relevance vector machines

Nonlinear PCA Linear Principle Component Analysis (PCA): Linear PCA is an orthogonal transformation of the coordinate system. The new coordinate system is obtained by projecting the data on the principle components, orthogonal axes in direction of the largest variance.

Kernel PCA Nonlinear PCA is its generalization to nonlinear transformation.

Linear PCA d Assume that that observations x n ℝ are centered, i.e. have mean 0. PCA finds the principle axes by diagonalizing the covariance matrix N 1 T C= i =1 xi xi N C is positive semidefinite, and can thus be diagonalized with nonnegative eigenvalues. This is done by the solving the eigenvalue equation i v i =C v i, for eigenvalues i=1,..., d 1... d 0 and nonzero eigenvectors v i ℝ d {0}.

Kernelizing PCA All eigenvectors vi with i 0 lie in the span of x1,...,xn: N N 1 i i v i =C v i= j =1 x j, v i x j v i = j =1 j x j N the eigenvalue equation is therefore equivalent to i x j, v i = x j, C v i, Substituting vi for all j=1,..., N into this equation on obtains the EV problem for N i i =K i, i= i1,..., in T i=1,..., N i

Kernelizing PCA Requiring i i 1=N i, v i to be normalized we obtain Projections on the principle axes can be obtained by N i j N i v i= j=1 x j v i, x = j=1 j x j, x Because only dot products are involved they can be replaced by non-linear kernels, corresponding to linear PCA in a high-dimensional feature space.

Kernel PCA Algorithm: 1. Calculate the Gram matrix K and diagonalize it to get 2. Normalize principle axes v i by setting i i and i i 1=N i, 3. Extract principle components by projecting on the principle axes N i v i, x = j=1 j k x j, x, i=1,..., N

Centering For the sake of simplicity we have assumed that the data points in the feature space are centered. Centering in the feature space is not as easy as in the input space but can be achieved by K i, j= K 1 N K K 1N 1 N K 1 N i, j where 1N denoted the matrix containing 1/N in all elements.

Example: contour lines (const PC values) v1 v2 v3 k = x T x k = x T x 2 k = x T x 3 k = x T x 4

Contour lines for RBF kernels

Support vector regression Extension of Support Vector Machines (SVM) to regression problems while preserving the property of sparseness. Basic idea behind SVMs: Maximize the margin defined by the support vectors in a feature space H

SVM revisited Maximize the distance between of the closest data point to the hyperplane T t n y x n t n w x n b = w w by solving argmax w, b { 1 T minn [ t n w x n b ] w }

The canonical representation The optimization problem argmax w, b { 1 T minn [ t n w x n b ] w } can reformulated by rescaling w w b b resulting in an unchanged distance of a data point to the hyperplane and and the constraint minimization problem solved by quadratic programming T t n w x n b 1 with equality only for SVs argmax w, b w 1 argmin w, b w 2

Overlapping class distributions For overlapping class distributions we define slack variables 0, if x n is on the correct side n= y x n t n, otherwise { } and perform the constrained optimization T t n w x n b 1 n argmin w, b E reg w with equality only for SVs N 2 E reg w =C n=1 n w

Support vector regression Basic idea behind Support Vector Regression (SVR): For SVM support vectors are data points on or within the margin or data points on the wrong side of the hyperplane. For SVR support vectors are data points far away from the target value. SVM SVR

Regularized error function For support vector regression we introduce an - insensitive error function if y x t E y x t = 0, y x t, otherwise { } and minimize the regularized error function obtained by replacing ESSE with EC 1 E reg=c n=1 E y x t w 2 2 N

Points outside the -tube For points outside the -tube we define two kinds of slack variables n= { { 0, otherwise y x n t n, if y is above the tube 0, otherwise n= t n y x n, if y is below the tube } } N 1 1 2 2 E reg=c n=1 E y x t w =C n=1 n n w 2 2 N

Reminder: Lagrangian multipliers To solve the optimization problem of maximizing a function f x under the constrain g x =0 one introduces Lagrangian multipliers and maximizes the Lagrangian function L x, f x g x x L= f g =0 L=g=0 At the solution the two gradients for the function f and g must be parallel and a parameter λ must exist such that they cancel.

Reminder: Lagrangian multipliers To solve the optimization problem of maximizing a function under the constrain g x 0 one observes to cases: 1. The maximum of f lies within f x g x 0 =0 2. The maximum of f lies on 0, g x =0 f = g The solution is therefore obtained by maximizing L subject to the Karush-Kuhn-Tucker (KKT) conditions: x L= f g =0 g x 0 0 g x = 0

Lagrangian for SVR The constraints for SVR n 0, n 0, t n y x n n t n y x n n lead to the Lagrangian where are Lagrangian multipliers.

Dual problem Setting the derivatives of L with respect to w, b and to zero one obtains after elimination of these variables the Lagrangian L in the dual formulation with the KKT conditions n 0, n 0 0 a n C, 0 a n C an n y n t n =0 a n n y n t n =0 C a n n=0, C a n n=0 ( eliminating μn) ( eliminating μn)

Support vectors for SVR Solving for w we see that new predictions can be made using N y x = n=1 a n a n k x, x n b From the KKT cond. we obtain the support vectors a n n y n t n =0 a n n y n t n =0 1. a n / a n is only nonzero for points above/below the tube (or boundary) 2. The two ( ) terms are incompatible, therefore either a n=0or a n=0.