Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Similar documents
Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Advanced Introduction to Machine Learning

Which Separator? Spring 1

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Lecture 3: Dual problems and Kernels

Support Vector Machines CS434

Kernel Methods and SVMs Extension

CSE 252C: Computer Vision III

Intro to Visual Recognition

Support Vector Machines CS434

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

Chapter 6 Support vector machine. Séparateurs à vaste marge

Support Vector Machines

Nonlinear Classifiers II

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Lecture 10 Support Vector Machines. Oct

10-701/ Machine Learning, Fall 2005 Homework 3

Recap: the SVM problem

Natural Language Processing and Information Retrieval

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Multi-layer neural networks

Multilayer neural networks

Pattern Classification

Lecture 10 Support Vector Machines II

EEE 241: Linear Systems

Generative classification models

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Non-linear Canonical Correlation Analysis Using a RBF Network

Multilayer Perceptron (MLP)

Kristin P. Bennett. Rensselaer Polytechnic Institute

1 Convex Optimization

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Classification learning II

CSC 411 / CSC D11 / CSC C11

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Week 5: Neural Networks

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Report on Image warping

Evaluation of classifiers MLPs

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Maximal Margin Classifier

Lecture 10: Dimensionality reduction

Generalized Linear Methods

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

17 Support Vector Machines

Ensemble Methods: Boosting

Supporting Information

Some modelling aspects for the Matlab implementation of MMA

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Statistical machine learning and its application to neonatal seizure detection

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

A kernel method for canonical correlation analysis

Support Vector Machines

Multigradient for Neural Networks for Equalizers 1

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Support Vector Machines

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

SDMML HT MSc Problem Sheet 4

18-660: Numerical Methods for Engineering Design and Optimization

Linear Feature Engineering 11

Assortment Optimization under MNL

Errors for Linear Systems

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Lecture Notes on Linear Regression

Feature Selection: Part 1

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Linear Approximation with Regularization and Moving Least Squares

PHYS 705: Classical Mechanics. Calculus of Variations II

Introduction to the Introduction to Artificial Neural Network

Regularized Discriminant Analysis for Face Recognition

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Numerical Heat and Mass Transfer

APPENDIX A Some Linear Algebra

2.3 Nilpotent endomorphisms

Lecture 21: Numerical methods for pricing American type derivatives

MULTICLASS LEAST SQUARES AUTO-CORRELATION WAVELET SUPPORT VECTOR MACHINES. Yongzhong Xing, Xiaobei Wu and Zhiliang Xu

Composite Hypotheses testing

Lagrange Multipliers Kernel Trick

Classification as a Regression Problem

Lecture 12: Classification

MMA and GCMMA two methods for nonlinear optimization

Solving Nonlinear Differential Equations by a Neural Network Method

Lecture 6: Support Vector Machines

Limited Dependent Variables

Homework Assignment 3 Due in class, Thursday October 15

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Polynomial Regression Models

Transcription:

Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan

Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Wth NN:Mult-layer feed-forward neural networks Neurons are organzed nto herarchcal layers Each layer receve ther nputs from the prevous one and transmts the output to the net one w j w j j j j w g z j j j z w g z Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = 0 = 0 a = -0.5 z = 0 a = -0.5 z = 0 a = -0.5 z = 0 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = = 0 a = 0. z = a = -0. z = 0 a = 0. z = Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = 0 = a = 0. z = a = -0. z = 0 a = 0. z = Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = = a = 0.9 z = a = 0. z = a = -0.5 z = 0 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

The hdden layer REMAPS the nput n a new representaton that s lnearly separable Input Desred Actvaton of output hdden neurons 0 0 0 0 0 0 0 0 0 0 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Etenson to Non-lnear Decson Boundary So far, we have only consdered large-margn classfer wth a lnear decson boundary How to generalze t to become nonlnear? Key dea: transform to a hgher dmensonal space to make lfe easer Input space: the space the pont are located Feature space: the space of f( ) after transformaton Why to transform? Lnear operaton n the feature space s equvalent to nonlnear operaton n nput space Classfcaton can become easer wth a proper transformaton. In the XOR problem, for eample, addng a new feature of make the problem lnearly separable Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

XOR X Y 0 0 0 0 0 0 Y Is not lnearly separable X X Y XY 0 0 0 0 0 0 0 0 0 XY Y Is lnearly separable X

Fnd a feature space Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Transformng the Data Input space f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Note: feature space s of hgher dmenson than the nput space n practce Computaton n the feature space can be costly because t s hgh dmensonal The feature space can be nfnte-dmensonal! The kernel trck comes to rescue Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

The Kernel Trck Recall the SVM optmzaton problem The data ponts only appear as scalar product As long as we can calculate the nner product n the feature space, we do not need the mappng eplctly Many common geometrc operatons (angles, dstances) can be epressed by nner products Defne the kernel functon K by Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

An Eample for f(.) and K(.,.) Suppose f(.) s gven as follows An nner product n the feature space s So, f we defne the kernel functon as follows, there s no need to carry out f(.) eplctly Ths use of kernel functon to avod carryng out f(.) eplctly s known as the kernel trck Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Kernels Gven a mappng: φ() a kernel s represented as the nner product K (, y) φ () φ (y) A kernel must satsfy the Mercer s condton: g( ) such that g ( ) d K(, y) g( ) g( y) ddy 0 Analogous to postve-semdefnte matrces M for whch z T 0 z M z 0 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Modfcaton Due to Kernel Functon Change all nner products to kernel functons For tranng, Orgnal Wth kernel functon Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Modfcaton Due to Kernel Functon For testng, the new data z s classfed as class f f >0, and as class f f <0 Orgnal Wth kernel functon Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

More on Kernel Functons Snce the tranng of SVM only requres the value of K(, j ), there s no restrcton of the form of and j can be a sequence or a tree, nstead of a feature vector K(, j ) s just a smlarty measure comparng and j For a test object z, the dscrmnat functon essentally s a weghted sum of the smlarty between z and a preselected set of objects (the support vectors) Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eample Suppose we have 5 D data ponts =, =, 3 =4, 4 =5, 5 =6, wth,, 6 as class and 4, 5 as class y =, y =, y 3 =-, y 4 =-, y 5 = Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eample Suppose we have 5 D data ponts =, =, 3 =4, 4 =5, 5 =6, wth,, 6 as class and 4, 5 as class y =, y =, y 3 =-, y 4 =-, y 5 = class class class 4 5 6 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eample We use the polynomal kernel of degree K(,y) = (y+) C s set to 00 We frst fnd a (=,, 5) by Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eample By usng a QP solver, we get a =0, a =.5, a 3 =0, a 4 =7.333, a 5 =4.833 Note that the constrants are ndeed satsfed The support vectors are { =, 4 =5, 5 =6} The dscrmnant functon s b s recovered by solvng f()= or by f(5)=- or by f(6)=, All three gve b=9 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eample Value of dscrmnant functon f(z) f(z)>0 class class class 4 5 6 f(z)<0 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Kernel Functons In practcal use of SVM, the user specfes the kernel functon; the transformaton f(.) s not eplctly stated Gven a kernel functon K(, j ), the transformaton f(.) s gven by ts egenfunctons (a concept n functonal analyss) Egenfunctons can be dffcult to construct eplctly Ths s why people only specfy the kernel functon wthout worryng about the eact transformaton Another vew: kernel functon, beng an scalar product, s really a smlarty measure between the objects Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

A kernel s assocated to a transformaton Gven a kernel, n prncple t should be recovered the transformaton n the feature space that orgnates t. K(,y) = (y+) = y +y+ If and y are numbers t corresponds the transformaton What f and y are -dmensonal vectors? Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

A kernel s assocated to a transformaton ) ) ) ) ) ) ) ) ) ) ) ) ) ) ),, j j j j j j j j j j + + + + + + = K ) ) ) ) ) ) T,,, )= (,, f Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

XOR Smple eample (XOR problem) 0 L α) = N α = N N = j= α α j y y j K(, j ) Input vector Y [-,-] - [-,+] + [+,-] + K( ) +, j )= j [+,+] - (-,-) (-,+) (+,-) (+,+) (-,-) 9 (-,+) 9 (+,-) 9 (+,+) 9 Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

) α + α α α + α α α α + α + α α α α α α α ( +α +α α +α L(α 4 4 3 3 4 3 4 3 4 3 9 9 9 9 ) 0 9α α L 0 9α α L 0 9α α L 0 9α α L 4 3 4 4 3 3 4 3 4 3 = α α α = α α α = α α α = α α α 4 8 4 3 L = = = α = α = α α The four Input vectors are All support vectors N = ) ( y α w= W = [0, 0, /sqrt(), 0, 0, 0] T XOR Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

XOR 0 Input vector Y [-,-] - [-,+] + [+,-] + [+,+] - f( )=, ) ) ) ),, ),, ) T w= N = α y ( ) W = [0, 0, /sqrt(), 0, 0, 0] Input vector Y [-,-] - + [-,+] + - [+,-] + - [+,+] - + Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eamples of Kernel Functons Polynomal kernel up to degree d Polynomal kernel up to degree d Radal bass functon kernel wth wdth s Sgmod wth parameter k and It does not satsfy the Mercer condton on all k and Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Ploynomal kernel Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Fonte: http://www.vancuc.org/

Eamples of Kernel Functons Radal bass functon (or gaussan) kernel wth wdth s K(, y) ep s y ep s yy ep s Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Eamples of Kernel Functons Wth -dm vectors: K(, y) ep s y ep s y ep s It corresponds to the scalar product n the nfnte dmensonal feature space: 3 ( T f ) ep,,, s s s 3! s 3,..., ) n! s... For vector n m-dm the feature space s more complcated Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna n n

Wthout slack varables Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Wth slack varables Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Gaussan RBF kernel Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Buldng new kernels If k (,y) and k (,y) are two vald kernels then the followng kernels are vald Lnear Combnaton Eponental Product Polymomal transformaton (Q: polymonal wth non negatve coeffcents) Functon product (f: any functon) ), ( ), ( ), ( y k c y c k y k ), ( ep ), ( y k y k ), ( ), ( ), ( y k y k y k ), ( ), ( y Q k y k ) ( ), ( ) ( ), ( y f y k f y k Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Choosng the Kernel Functon Probably the most trcky part of usng SVM. The kernel functon s mportant because t creates the kernel matr, whch summarzes all the data Many prncples have been proposed (dffuson kernel, Fsher kernel, strng kernel, ) There s even research to estmate the kernel matr from avalable nformaton In practce, a low degree polynomal kernel or RBF kernel wth a reasonable wdth s a good ntal try Note that SVM wth RBF kernel s closely related to RBF neural networks, wth the centers of the radal bass functons automatcally chosen for SVM Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Kernels can be defned also for structures other than vectors Computatonal bology often deals wth structures dfferent from vectors: Sequences (DNA, RNA, protens) Trees (Phylogenetc relatonshps) Graphs (Interacton networks) 3-D structures (protens) Is t possble to buld kernels for that structures? Transform data onto a feature space made of n-dmensonal real vectors and then compute the scalar product. Wrte a kernel wthout wrtng eplctly the feature space (but.. What s a kernel?) Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Defnng kernels wthout defnng feature transformaton What a kernel represent? Dstance n feature space

Defnng kernels wthout defnng feature transformaton What a kernel represent? Dstance n feature space Kernel s a SIMILARITY measure Moreover t has to fullfll a «postvty» condton

Spectral kernel for sequences Gven a DNA sequence we can count the number of bases (4-D feature space) f ( ) ( n, n, n, n A C G T ) Or the number of dmers (6-D space) f ( ) ( n, n, n, n Or l-mers (4 l D space), n, n, n, n AA AC AG AT CA CC CG CT,..) The spectral kernel s k l (, y) l ) f y) f l Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

l s usually lower than

Kernel out of generatve models Gven a generatve model assocatng a probablty p( θ) to a gven nput, we defne : Fsher Kernel ) ( ) ( ), ( y p p y K ), ( ), ( ), ( ), ( ), ( ), ( ), ( ) ( ln ), ( y g F g y K g g N g g E F p g T N T Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Other Aspects of SVM How to use SVM for mult-class classfcaton? One can change the QP formulaton to become mult-class More often, multple bnary classfers are combned One can tran multple one-versus-all classfers, or combne multple parwse classfers ntellgently Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Other Aspects of SVM How to nterpret the SVM dscrmnant functon value as probablty? By performng logstc regresson on the SVM output of a set of data (valdaton set) that s not used for tranng Some SVM software (lke lbsvm) have these features bult-n Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Software A lst of SVM mplementaton can be found at http://www.kernel-machnes.org/software.html Some mplementaton (such as LIBSVM) can handle mult-class classfcaton SVMLght s among one of the earlest mplementaton of SVM Several Matlab toolboes for SVM are also avalable Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Summary: Steps for Classfcaton Prepare the pattern matr Select the kernel functon to use Select the parameter of the kernel functon and the value of C You can use the values suggested by the SVM software, or you can set apart a valdaton set to determne the values of the parameter Eecute the tranng algorthm and obtan the a Unseen data can be classfed usng the a and the support vectors Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Strengths and Weaknesses of SVM Strengths Tranng s relatvely easy No local optmal, unlke n neural networks It scales relatvely well to hgh dmensonal data Tradeoff between classfer complety and error can be controlled eplctly Non-tradtonal data lke strngs and trees can be used as nput to SVM, nstead of feature vectors Weaknesses Need to choose a good kernel functon. Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna

Other Types of Kernel Methods A lesson learnt n SVM: a lnear algorthm n the feature space s equvalent to a non-lnear algorthm n the nput space Standard lnear algorthms can be generalzed to ts nonlnear verson by gong to the feature space Kernel prncpal component analyss, kernel ndependent component analyss, kernel canoncal correlaton analyss, kernel k-means, -class SVM are some eamples Per Lug Martell - Systems and In Slco Bology 05-06- Unversty of Bologna