SVMs: nonlinearity through kernels

Similar documents
Learning From Data Lecture 10 Nonlinear Transforms

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning From Data Lecture 25 The Kernel Trick

Support Vector Machine (SVM) and Kernel Methods

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Support Vector Machine (SVM) and Kernel Methods

Linear models and the perceptron algorithm

Support Vector Machine (SVM) and Kernel Methods

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Introduction to Support Vector Machines

Support Vector Machines.

Applied Machine Learning Annalisa Marsico

Linear & nonlinear classifiers

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (continued)

Support Vector Machines

Statistical Machine Learning from Data

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Kernel Methods and Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

(Kernels +) Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear & nonlinear classifiers

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines

L5 Support Vector Classification

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

SVMs, Duality and the Kernel Trick

CS798: Selected topics in Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernelized Perceptron Support Vector Machines

Support Vector Machine & Its Applications

Learning From Data Lecture 26 Kernel Machines

Lecture 10: A brief introduction to Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Introduction to Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Support Vector Machines

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

Linear Classification and SVM. Dr. Xin Zhang

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

18.9 SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINE

Nonlinearity & Preprocessing

Statistical Pattern Recognition

Discriminative Models

Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Neural networks and support vector machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning, Fall 2011: Homework 5

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Kernel Methods & Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Classifier Complexity and Support Vector Classifiers

Kernel methods, kernel SVM and ridge regression

CS145: INTRODUCTION TO DATA MINING

Statistical Methods for SVM

Support Vector Machines

Discriminative Models

The Perceptron Algorithm

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines and Speaker Verification

6.867 Machine learning

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to SVM and RVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Support Vector Machines

HOMEWORK 4: SVMS AND KERNELS

Machine Learning Practice Page 2 of 2 10/28/13

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

10-701/ Recitation : Kernels

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Transcription:

Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly separable. Transform your features! 8.. The Optimal Hyperplane Consider the following two datasets: x x zn = Φ(xn ). minimize: b,w x to: subject x (b) Nonlinearly separable. PT A t w w "! yn w t zn + b (8.9) (n =,z =, xn ), z = x objects in where w is now in Rd instead of Rd (recall that we use tilde for The optimization problem in (8.9) is a QP-problem with d + Z space). Linear8.6: with outliers data (reproduction Nonlinear Figure Non-separable of Figure 3.). data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data points render the data non-separable. In Figure 8.6(b), the target function is inherently nonlinear. For the learning problem in Figure 8.6(a), we prefer the linear separator, and need to tolerate the few noisy data points. In Chapter 3, we modified the PLA into the pocket algorithm to handle this situation. Similarly, for SVMs, we will modify the hard-margin SVM to the soft-margin SVM in Section 8.4. Unlike the hard margin SVM, the soft-margin SVM allows data points to violate the cushion, or even be misclassified. To address the other situation in Figure 8.6(b), we introduced the nonlinear transform in Chapter 3. There is nothing to stop us from using the nonlinear After transforming the data, we solve the hard-margin SVM problem in the Z space, which is just (8.4) written with zn instead of xn : ER (a) Few noisy data. z x= x z = Non-separable data e-8. Support Vector Machines data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data points render the data non-separable. In Figure 8.6(b), the target function is inherently nonlinear. For the learning problem in Figure 8.6(a), we prefer the linear separator, and need to tolerate the few noisy data points. In Chapter 3, we modified the PLA into the pocket algorithm to handle this situation. Similarly, for SVMs, Mechanics of SVM the tofeature Transform I 8.4. we will modify the hard-margin the soft-margin SVM in Section Mechanics of the thesoft-margin FeatureSVM Transform I Unlike the hard margin SVM, allows data points to violate the the data cushion, even bein misclassified. Transform to aorz-space which the data is separable. To address the other situation in Figure 8.6(b), we introduced the nonlinear Transform the data to a Z-space in which the data is separable. 0 Map your data into usingtoa stop nonlinear transform in Chapter 3. Z-space There is nothing us fromfunction using the nonlinear 0 transform with the optimal hyperplane, which we will do here. To render the data separable, we would typically transform into a higher dimension. Consider a transform Φ : Rd Rd. The transformed data are e-c H A PT Both are not 8.6: linearly separable. But there isofafigure difference! Figure Non-separable data (reproduction 3.). x = x 3 = xx cxam Magdon-Ismail, Lin: Jan-05 L Abu-Mostafa, x z = Φ(x) = x = Φ(x) Φ(x) (x) z = Φ(x) = e-chap:8 9 xx = Φ x Φ(x) 4 c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 6 /7 Feature transform II: classify in Z-space c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 6 /7 Feature transform II: classify in Z-space

Mechanics of the Feature Transform III Classification in Z-space To classify a new x, firsttransformx to Φ(x) Z-space and classify there with g. Classification in Z-space In Z-space the data can be linearly separated: g(z) =sign( w t z) a g(x) = g(φ(x)) =sign( w t Φ(x)) g(z) =sign( w t z) c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 8/7 Summary of nonlinear transform Must Choose Φ BEFORE Your Look at the Data 5 6 c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 7/7 Feature transform III: bring back to X -space After constructing features carefully, before seeing the data... Classification in Z-space...if you think linear is not enough, try the nd order polynomial transform. x x = x Φ(x) = Φ (x) Φ (x) = Φ 3 (x) Φ 4 (x) Φ 5 (x) What can we say about the dimensionality of the Z-space as a function of the dimensionality of the data? x x x x x x c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: /7 The polynomial transform The General Polynomial Transform Φ k The polynomial Z-space We can We get can even choose fancier: higher degree-k order polynomial polynomials: transform: Φ (x) =(, x,x ), Φ (x) =(, x,x, x,x x,x ), Φ 3 (x) =(, x,x, x,x x,x, x 3,x x,x x,x 3 ), Φ 4 (x) =(, x,x, x,x x,x, x 3,x x,x x,x 3, x 4,x 3 x,x x,x x 3,x 4 ),. What are the potential effects of increasing the order of the Dimensionalityofthefeaturespaceincreasesrapidly(d polynomial? vc )! Similar transforms for d-dimensional original space. Approximation-generalizationtradeoff Higher degree gives lower (even zero) E in but worse generalization. 7 8 c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 3/7 Be carefull with nonlinear transforms

The polynomial Z-space A few potential issues Linear model fourth order polynomial Transforming the data seems like a good idea: u But: u u Better chance of obtaining linear separability At the expense of potential overfitting Computationally expensive (memory and time) Kernels: avoid the computational expense by an implicit mapping Feature-space dimensionality increases rapidly and with it the complexity of the model: danger of overfitting High order polynomial transform leads to nonsense. c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 5/7 Digits data 9 0 Achieving non-linear discriminant functions How to avoid the overhead of explicit mapping Consider two dimensional data and the mapping (x) =(x, p x x,x ) Let s plug that into the discriminant function: w (x) =w x + p w x x + w 3 x The resulting decision boundary is a conic section. Suppose the weight vector can be expressed as: w = i x i i= The discriminant function is then: w f(x) = X x + b i x i x + b i And using our nonlinear mapping: w (x)+b f(x) = X i i (x i ) (x)+b Turns out we can often compute the dot product without explicitly mapping the data into a high dimensional feature space! parabola circle/ellipse hyperbola 3

Let s go back to the example and compute the dot product Example (x) =(x, p x x,x ) (x) (z) =(x, p x x,x ) (z, p z z,z) = x z +x x z z + x z =(x z) Do we need to perform the mapping explicitly? Let s go back to the example and compute the dot product Example (x) =(x, p x x,x ) (x) (z) =(x, p x x,x ) (z, p z z,z) = x z +x x z z + x z =(x z) Do we need to perform the mapping explicitly? NO! Squaring the dot product in the original space has the same effect as computing the dot product in feature space. 3 4 Kernels Definition: A function k(x, z) that can be expressed as a dot product in some feature space is called a kernel. In other words, k(x, z) is a kernel if there exists such that Why is this interesting? k(x, z) = (x) (z) : X 7! F If the algorithm can be expressed in terms of dot products, we can work in the feature space without performing the mapping explicitly! The dual SVM problem The dual SVM formulation depends on the data through dot products, and so can be expressed using kernels. Replace with: maximize i= i subject to: 0 apple i apple C, maximize i= i subject to: 0 apple i apple C, i= j= i j y i y j x i x j i y i =0 i= i= j= i j y i y j k(x i, x j ) i y i =0 i= 5 6 4

Standard kernel functions Standard kernel functions The linear kernel Homogeneous polynomial kernel Polynomial kernel Gaussian kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d k(x, z) =exp( x z ) The linear kernel Homogeneous polynomial kernel Polynomial kernel Gaussian kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d k(x, z) =exp( x z ) Feature space: Original features All monomials of degree d All monomials of degree less or equal to d Infinite dimensional 7 8 Demo Demo Using polynomial kernel: Using the Gaussian kernel: k(x, z) =exp( x z ) k(x, z) =(x z + ) d 9 0 5

Kernelizing the perceptron algorithm Kernelizing the perceptron algorithm Recall the primal version of the perceptron algorithm: Input: labeled data D in homogeneous coordinates Output: weight vector w w = 0 converged = false while not converged : converged = true for i in,, D : if x i is misclassified update w and set converged=false w 0 = w + y i x i Input: labeled data D in homogeneous coordinates Output: weight vector α α = 0 converged = false while not converged : w = converged = true for i in,, D : if x i is misclassified update α and set converged=false i x i i= What do you need to change to express it in the dual? (I.e. express the algorithm in terms of the alpha coefficients) Kernelizing the perceptron algorithm Linear regression revisited Input: labeled data D in homogeneous coordinates Output: weight vector α α = 0 converged = false while not converged : converged = true for i in,, D : if x i is misclassified update α and set converged=false The update Is equivalent to: i! i + y i w 0 = w + y i x i 3 The sum-squared cost function: X (y i w x i ) =(y Xw) (y Xw) i The optimal solution satisfies: If we express w as: We get: X Xw = X y w = i x i = X i= X XX = X y 4 6

Kernel linear regression We now get that α satisfies: Compare with: XX = y X Xw = X y Which is harder to find? What have we gained? X X XX The kernel matrix The covariance matrix (d x d) Matrix of dot products associated with a dataset (n x n). Can replace it with a matrix K such that: K ij = (x i ) (x j )=k(x i, x j ) This is the kernel matrix associated with a dataset a.k.a the Gram matrix 5 6 How does that matrix look like? Kernel matrix for gene expression data in yeast: Properties of the kernel matrix The kernel matrix: K ij = (x i ) (x j )=k(x i, x j ) q q q q Symmetric (and therefore has real eigenvalues) Diagonal elements are positive Every kernel matrix is positive semi-definite, i.e. x Kx 0 8x Corollary: the eigenvalues of a kernel matrix are positive 7 8 7

Standard kernel functions Some tricks for constructing kernels The linear kernel Homogeneous polynomial kernel Polynomial kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d Gaussian kernel (aka RBF kernel) k(x, z) =exp( x z ) Feature space: Original features All monomials of degree d All monomials of degree less or equal to d Infinite dimensional Let K(x,z) be a kernel function If a > 0 then a. K(x,z) is a kernel How do we even know that the Gaussian is a valid kernel? 9 30 Some tricks for constructing kernels Some tricks for constructing kernels Sums of kernels are kernels: Sums of kernels are kernels: Let K and K be kernel functions then K + K is a kernel Let K and K be kernel functions then K + K is a kernel What is the feature map that shows this? Feature map: concatenation of the underlying feature maps 3 3 8

Some tricks for constructing kernels Some tricks for constructing kernels Products of kernels are kernels: Products of kernels are kernels: Let K and K be kernel functions then K K is a kernel Let K and K be kernel functions then K K is a kernel Construct a feature map that contains all products of pairs of features 33 34 The cosine kernel The cosine kernel If K(x,z) is a kernel then K 0 (x, z) = K(x, z) p K(x, x)k(z,z) If K(x,z) is a kernel then K 0 (x, z) = K(x, z) p K(x, x)k(z,z) is a kernel is a kernel This kernel is equivalent to normalizing each example to have unit norm in the feature space associated with the kernel. This kernel is the cosine in the feature space associated with the kernel K: (x) (z) cos( (x), (z)) = (x) (z) = (x) (z) p (x) (x) (z) (z) 35 36 9

Infinite sums of kernels Theorem: A function K(x, z) =K(x z) with a series expansion X K(t) = a n t n n=0 is a kernel iff a n 0 for all n. Infinite sums of kernels Theorem: A function K(x, z) =K(x z) with a series expansion X K(t) = a n t n n=0 is a kernel iff a n 0 for all n. Corollary: K(x, z) =exp( x z) is a kernel 37 38 Infinite sums of kernels Theorem: A function K(x, z) =K(x z) with a series expansion X K(t) = a n t n n=0 is a kernel iff a n 0 for all n. Corollary: K(x, z) =exp( x z) is a kernel Corollary: k(x, z) =exp( x z ) is a kernel exp( x z )=exp( (x x + z z)+ (x z)) i.e. the Gaussian kernel is the cosine kernel of the exponential kernel K 0 K(x, z) (x, z) = p K(x, x)k(z,z) 39 0