Chapter 6 Support vector machine. Séparateurs à vaste marge

Similar documents
Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Linear Classification, SVMs and Nearest Neighbors

Lecture 3: Dual problems and Kernels

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Which Separator? Spring 1

Support Vector Machines

Support Vector Machines CS434

An Introduction to. Support Vector Machine

Support Vector Machines

Support Vector Machines CS434

Kernel Methods and SVMs Extension

Kristin P. Bennett. Rensselaer Polytechnic Institute

Support Vector Machines

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Nonlinear Classifiers II

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Natural Language Processing and Information Retrieval

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Intro to Visual Recognition

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines. Oct

Support Vector Machines

Support Vector Machines

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Lagrange Multipliers Kernel Trick

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

18-660: Numerical Methods for Engineering Design and Optimization

Support Vector Novelty Detection

CSE 252C: Computer Vision III

Lecture 6: Support Vector Machines

Recap: the SVM problem

10-701/ Machine Learning, Fall 2005 Homework 3

Advanced Introduction to Machine Learning

Pattern Classification

17 Support Vector Machines

Statistical machine learning and its application to neonatal seizure detection

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

MULTICLASS LEAST SQUARES AUTO-CORRELATION WAVELET SUPPORT VECTOR MACHINES. Yongzhong Xing, Xiaobei Wu and Zhiliang Xu

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

By : Moataz Al-Haj. Vision Topics Seminar (University of Haifa) Supervised by Dr. Hagit Hel-Or

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Neural networks. Nuno Vasconcelos ECE Department, UCSD

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Multilayer Perceptron (MLP)

Ensemble Methods: Boosting

Regularized Discriminant Analysis for Face Recognition

Maximal Margin Classifier

Machine Learning 4771

UVA CS / Introduc8on to Machine Learning and Data Mining

Evaluation of classifiers MLPs

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Maximum Likelihood Estimation (MLE)

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Lecture 12: Classification

Pairwise Multi-classification Support Vector Machines: Quadratic Programming (QP-P A MSVM) formulations

Chapter 10 The Support-Vector-Machine (SVM) A statistical approach of learning theory for designing an optimal classifier

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Lecture Notes on Linear Regression

15-381: Artificial Intelligence. Regression and cross validation

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Fisher Linear Discriminant Analysis

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Non-linear Canonical Correlation Analysis Using a RBF Network

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

SVM Tutorial: Classification, Regression, and Ranking

Classification as a Regression Problem

CSC 411 / CSC D11 / CSC C11

Kernel Methods and SVMs

Large-Margin HMM Estimation for Speech Recognition

VQ widely used in coding speech, image, and video

Evaluation of simple performance measures for tuning SVM hyperparameters

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

PHYS 705: Classical Mechanics. Calculus of Variations II

Report on Image warping

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Machine Learning. What is a good Decision Boundary? Support Vector Machines

1 Convex Optimization

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Linear Approximation with Regularization and Moving Least Squares

One-class Classification: ν-svm

Efficient and Robust Feature Extraction by Maximum Margin Criterion

Moving Least Squares Support Vector Machines for weather temperature prediction

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Statistical Foundations of Pattern Recognition

Clustering gene expression data & the EM algorithm

Originated from experimental optimization where measurements are very noisy Approximation can be actually more accurate than

An Iterative Modified Kernel for Support Vector Regression

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 8, 2015

Multigradient for Neural Networks for Equalizers 1

Differentiating Gaussian Processes

Transcription:

Chapter 6 Support vector machne Séparateurs à vaste marge

Méthode de classfcaton bnare par apprentssage Introdute par Vladmr Vapnk en 1995 Repose sur l exstence d un classfcateur lnéare Apprentssage supervsé Effcace en terme de temps de calcul et de precson

IDEE DE BASE rouver un classfcateur lnéare (hyperplan) séparant les données d'un plan en deux categores : Classe 1 (+) pour les ponts à y>0 Classe (-) pour les ponts à y<0 Maxmser la dstance de separaton entre ces deux classes

Dscrmnant Functon It can be arbtrary functons of x, such as: Nearest Neghbor Decson ree g( x) Lnear Functons w x b Nonlnear Functons

Lnear Dscrmnant Functon g(x) s a lnear functon: x w x + b > 0 g( x) w x b A hyper-plane n the feature space n (Unt-length) normal vector of the hyper-plane: n w w w x + b < 0 x 1

Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! x 1

Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! x 1

Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! x 1

Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! Whch one s the best? x 1

Large Margn Lnear Classfer denotes +1 he lnear dscrmnant functon (classfer) wth the maxmum margn s the best x safe zone denotes -1 Margn Margn s defned as the wdth that the boundary could be ncreased by before httng a data pont Why t s the best? Robust to outlners and thus strong generalzaton ablty x 1

Large Margn Lnear Classfer Gven a set of data ponts: {( x, y )}, 1,,, n, where x denotes +1 denotes -1 For y 1, wxb0 For y 1, wxb0 Wth a scale transformaton on both w and b, the above s equvalent to For y 1, wxb1 For y 1, wxb1 x 1

Large Margn Lnear Classfer Formulaton: maxmze w x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 n x - x 1

Large Margn Lnear Classfer Formulaton: 1 mnmze w x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 n x - x 1

Large Margn Lnear Classfer Formulaton: 1 mnmze w x x + denotes +1 denotes -1 Margn such that x + y ( wxb) 1 n x - x 1

Solvng the Optmzaton Problem Quadratc programmng wth lnear constrants s.t. 1 mnmze w y ( wxb) 1 Lagrangan Functon 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0

Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 L p b n p w w w x 1 0 s.t. 0 L p 0 w y x w 1 n 1 n y 0

Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 Lagrangan Dual Problem maxmze s.t. 0 1 n n n j yy j j 1 1 j1 n xx, and 1 y 0

Solvng the Optmzaton Problem From KK condton, we know: y ( wxb) 1 0 x x + hus, only support vectors have 0 x + x - he soluton has the form: n w yx yx 1 SV get b from y ( wxb) 1 0, where x s support vector Support Vectors x 1

Solvng the Optmzaton Problem he lnear dscrmnant functon s: g( x) w x b x x b SV Notce t reles on a dot product between the test pont x and the support vectors x Also keep n mnd that solvng the optmzaton problem nvolved computng the dot products x x j between all pars of tranng ponts

Soluton du problème d optmsaton * : estmé (x S,y S ) étant n'mporte quel pont de support m s s m y y w y w D 1 * * 0 1 * * * 0 * ). ( ). ( ) ( x x x w x w x Seuls les α correspondant aux ponts les plus proches sont non nuls. On parle de ponts de support. Elles determnent l hyperplan optmal

Interpretaton geometrque Class 8 =0.6 10 =0 5 =0 7 =0 =0 4 =0 9 =0 Class 1 3 =0 6 =1.4 1 =0.8

Large Margn Lnear Classfer What f data s not lnear separable? (nosy data, outlers, etc.) x denotes +1 denotes -1 Slack varables ξ can be added to allow msclassfcaton of dffcult or nosy data ponts 1 x 1

Large Margn Lnear Classfer Formulaton: 1 mnmze w C n 1 such that y( wx b) 1 0 Parameter C can be vewed as a way to control over-fttng.

Large Margn Lnear Classfer Formulaton: (Lagrangan Dual Problem) maxmze 1 n n n j yy j j 1 1 j1 xx such that 0 C n 1 y 0

Non-lnear SVMs Datasets that are lnearly separable wth nose work out great: 0 x But what are we gong to do f the dataset s just too hard? 0 x How about mappng data to a hgher-dmensonal space: x 0 x hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Non-lnear SVMs: Feature Space General dea: the orgnal nput space can be mapped to some hgher-dmensonal feature space where the tranng set s separable: Φ: x φ(x) hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Nonlnear SVMs: he Kernel rck Wth ths mappng, our dscrmnant functon s now: g( x) w ( x) b ( x) ( x) b SV No need to know ths mappng explctly, because we only use the dot product of feature vectors n both the tranng and test. A kernel functon s defned as a functon that corresponds to a dot product of two feature vectors n some expanded feature space: K( x, x ) ( x ) ( x ) j j

Nonlnear SVMs: he Kernel rck An example: -dmensonal vectors x=[x 1 x ]; let K(x,x j )=(1 + x x j ), Need to show that K(x,x j ) = φ(x ) φ(x j ): K(x,x j )=(1 + x x j ), = 1+ x 1 x j1 + x 1 x j1 x x j + x x j + x 1 x j1 + x x j = [1 x 1 x 1 x x x 1 x ] [1 x j1 x j1 x j x j x j1 x j ] = φ(x ) φ(x j ), where φ(x) = [1 x 1 x 1 x x x 1 x ] hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Nonlnear SVMs: he Kernel rck Examples of commonly-used kernel functons: Lnear kernel: K( x, x ) x x j j Polynomal kernel: K( x, x ) (1 x x ) j j p Gaussan (Radal-Bass Functon (RBF) ) kernel: Sgmod: j K( x, xj) exp( x x ) K( x, x ) tanh( x x ) j 0 j 1 In general, functons that satsfy Mercer s condton can be kernel functons.

Nonlnear SVM: Optmzaton Formulaton: (Lagrangan Dual Problem) n n n 1 maxmze y y K(, ) such that 0 C x x j j j 1 1 j1 n 1 y 0 he soluton of the dscrmnant functon s g( x) K( x, x) b SV he optmzaton technque s the same.

Support Vector Machne: Algorthm 1. Choose a kernel functon. Choose a value for C 3. Solve the quadratc programmng problem (many software packages avalable) 4. Construct the dscrmnant functon from the support vectors

Some Issues Choce of kernel - Gaussan or polynomal kernel s default - f neffectve, more elaborate kernels are needed - doman experts can gve assstance n formulatng approprate smlarty measures Choce of kernel parameters - e.g. σ n Gaussan kernel - σ s the dstance between closest ponts wth dfferent classfcatons - In the absence of relable crtera, applcatons rely on the use of a valdaton set or cross-valdaton to set such parameters. Optmzaton crteron Hard margn v.s. Soft margn - a lengthy seres of experments n whch varous parameters are tested hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Summary: Support Vector Machne 1. Large Margn Classfer Better generalzaton ablty & less over-fttng. he Kernel rck Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.

Soluton du nouveau problème d optmsaton La foncton de décson devent alors D(x) m S u K(x,x) w 0 1 m S : nb de ponts de support

SHEMA DE FONCIONNEMEN des SVM sgn( u K(x,x) + w 0 ) Sorte : sgn( u K(x,x) + w 0 ) 1 3 4 K K K K Comparason : K(x, x) Échantllon x 1, x, x 3,... Vecteur d'entrée x

Archtecture of SVMs Nonlnear Classfer(usng kernel) Decson functon are computed as the soluton of quadratc program l l v y v x tran example each for substtute x b x x v k b x x v x f ) ( ) ), ( sgn( ) )) ( ) ( ( sgn( ) ( 1 1

Matlab example load fsherrs data = [meas(:,1), meas(:,)]; % Extract the Setosa class groups = smember(speces,'setosa'); % Randomly select tranng and test sets [tran, test] = crossvalnd('holdout',groups); % % Use a lnear support vector machne classfer svmstruct = svmtran(data(tran,:),groups(tran),'showplot',true); classes = svmclassfy(svmstruct,data(test,:),'showplot',true); % See how well the classfer performed cp = classperf(groups); classperf(cp,classes,test); cp.correctrate senstvty or true postve rate (PR), specfcty (SPC) or true negatve rate

4.5 4 0 (tranng) 0 (classfed) 1 (tranng) 1 (classfed) Support Vectors 3.5 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Summary: Support Vector Machne 1. Large Margn Classfer Better generalzaton ablty & less over-fttng. he Kernel rck Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.

Addtonal Resource http://www.kernel-machnes.org/

Demo of LbSVM http://www.cse.ntu.edu.tw/~cjln/lbsvm/