Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Similar documents
Chapter 6 Support vector machine. Séparateurs à vaste marge

Linear Classification, SVMs and Nearest Neighbors

Lecture 3: Dual problems and Kernels

Support Vector Machines

An Introduction to. Support Vector Machine

Support Vector Machines

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Support Vector Machines

Which Separator? Spring 1

Kristin P. Bennett. Rensselaer Polytechnic Institute

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Lecture 10 Support Vector Machines II

Support Vector Machines CS434

Natural Language Processing and Information Retrieval

Support Vector Machines CS434

Nonlinear Classifiers II

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Intro to Visual Recognition

Kernel Methods and SVMs Extension

18-660: Numerical Methods for Engineering Design and Optimization

Support Vector Machines

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Support Vector Machines

Lecture 10 Support Vector Machines. Oct

CSE 252C: Computer Vision III

Lecture 6: Support Vector Machines

Lagrange Multipliers Kernel Trick

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

10-701/ Machine Learning, Fall 2005 Homework 3

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Advanced Introduction to Machine Learning

17 Support Vector Machines

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Recap: the SVM problem

Pattern Classification

Ensemble Methods: Boosting

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Statistical machine learning and its application to neonatal seizure detection

Multilayer Perceptron (MLP)

Lecture 12: Classification

Support Vector Novelty Detection

Regularized Discriminant Analysis for Face Recognition

Non-linear Canonical Correlation Analysis Using a RBF Network

By : Moataz Al-Haj. Vision Topics Seminar (University of Haifa) Supervised by Dr. Hagit Hel-Or

Classification as a Regression Problem

UVA CS / Introduc8on to Machine Learning and Data Mining

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

MULTICLASS LEAST SQUARES AUTO-CORRELATION WAVELET SUPPORT VECTOR MACHINES. Yongzhong Xing, Xiaobei Wu and Zhiliang Xu

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Maximal Margin Classifier

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Kernel Methods and SVMs

Chapter 10 The Support-Vector-Machine (SVM) A statistical approach of learning theory for designing an optimal classifier

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

CSC 411 / CSC D11 / CSC C11

Maximum Likelihood Estimation (MLE)

Pairwise Multi-classification Support Vector Machines: Quadratic Programming (QP-P A MSVM) formulations

Neural networks. Nuno Vasconcelos ECE Department, UCSD

1 Convex Optimization

Report on Image warping

EEE 241: Linear Systems

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

VQ widely used in coding speech, image, and video

Efficient and Robust Feature Extraction by Maximum Margin Criterion

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

15-381: Artificial Intelligence. Regression and cross validation

Classification learning II

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Generative classification models

SVM Tutorial: Classification, Regression, and Ranking

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Linear Approximation with Regularization and Moving Least Squares

Linear Feature Engineering 11

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

FORECASTING EXCHANGE RATE USING SUPPORT VECTOR MACHINES

Statistical pattern recognition

Learning with Tensor Representation

Machine Learning: and 15781, 2003 Assignment 4

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Fisher Linear Discriminant Analysis

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Evaluation of classifiers MLPs

Evaluation of simple performance measures for tuning SVM hyperparameters

Statistical Foundations of Pattern Recognition

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Large-Margin HMM Estimation for Speech Recognition

Machine Learning 4771

Class Administrivia Motivated Examples

Multigradient for Neural Networks for Equalizers 1

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Transcription:

Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas

What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest Neghbors Whch of the above are lnear and whch are not? (1) (6) and (7) are non-lnear o () s lnear under certan restrctons

Decson Surfaces Decson ree g( x) Lnear Functons w x b Nonlnear Functons (Neural nets)

oday: Support Vector Machne (SVM) A classfer derved from statstcal learnng theory by Vapnk, et al. n 199 SVM became famous when, usng mages as nput, t gave accuracy comparable to neural-network wth hand-desgned features n a handwrtng recognton task Currently, SVM s wdely used n object detecton & recognton, content-based mage retreval, text recognton, bometrcs, speech recognton, etc. Also used for regresson (wll not cover today) Chapter 5.1, 5., 5.3, 5.11 (5.4*) n Bshop SVM tutoral (start readng from Secton 3) V. Vapnk

Outlne Lnear Dscrmnant Functon Large Margn Lnear Classfer Nonlnear SVM: he Kernel rck Demo of SVM

Lnear Dscrmnant Functon or a Lnear Classfer x Gven data and two classes, learn a functon of the form: g( x) w x b w x + b > 0 denotes +1 denotes -1 A hyper-plane n the feature space Decde class=1 f g(x)>0 and class=-1 otherwse w x + b < 0 x 1

Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1

Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1

Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1

Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! Whch one s the best? x 1

Large Margn Lnear he lnear dscrmnant functon (classfer) wth the maxmum margn s the best Classfer x safe zone denotes +1 denotes -1 Margn Margn s defned as the wdth that the boundary could be ncreased by before httng a data pont Why t s the best? he larger the margn the better generalzaton Robust to outlers x 1

Large Margn Lnear Am: Learn a large margn classfer. Gven a set of data ponts, defne: For y 1, wxb1 For y 1, wxb1 Gve an algebrac expresson for the wdth of the margn. Classfer x safe zone denotes +1 denotes -1 Margn x 1

Algebrac Expresson for Wdth of a Margn safe zone Margn x 1

Large Margn Lnear Am: Learn a large margn classfer Mathematcal Formulaton: maxmze w Classfer x x + x + denotes +1 denotes -1 Margn such that For y 1, wxb1 For y 1, wxb1 x - Common theme n machne learnng: LEARNING IS OPIMIZAION x 1

Large Margn Lnear Formulaton: 1 mnmze w Classfer x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 x - x 1

Large Margn Lnear Formulaton: 1 mnmze w Classfer x x + denotes +1 denotes -1 Margn such that x + y ( wxb) 1 x - x 1

Large Margn Lnear Classfer Formulaton: 1 mnmze w such that hs s a Quadratc programmng problem wth lnear constrants o y ( wxb) 1 Off-the-shelf Software However, we wll convert t to Lagrangan dual n order to use the kernel trck!

Solvng the Optmzaton Problem Quadratc programmng wth lnear constrants s.t. 1 mnmze w y ( wxb) 1 Lagrangan Functon 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0

Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 L p 0 w y x w 1 L p b 0 n 1 n y 0

Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 Lagrangan Dual Problem maxmze s.t. 0 1 n n n j yy j j 1 1 j1 n xx, and 1 y 0

Solvng the Optmzaton y ( wxb) 1 0 Problem From the equatons, we can prove that: (KK condtons): hus, only support vectors have 0 x x + x - x + he soluton has the form: n w yx yx 1 SV get b from y ( wxb) 1 0, where x s support vector Support Vectors x 1

Solvng the Optmzaton Problem he lnear dscrmnant functon s: g( x) w x b x x b SV Notce t reles on a dot product between the test pont x and the support vectors x Also keep n mnd that solvng the optmzaton problem nvolved computng the dot products x x j between all pars of tranng ponts

Large Margn Lnear What f data s not lnear separable? (nosy data, outlers, etc.) Classfer x denotes +1 denotes -1 Slack varables ξ can be added to allow msclassfcaton of dffcult or nosy data ponts 1 x 1

Large Margn Lnear Classfer Formulaton: 1 mnmze w C n 1 1 mnmze w such that y( wx b) 1 0 s.t. y ( wxb) 1 Wthout slack varables Parameter C can be vewed as a way to control over-fttng.

Large Margn Lnear Classfer Formulaton: (Lagrangan Dual Problem) maxmze 1 n n n j yy j j 1 1 j1 xx such that 0 C n 1 y 0

Non-lnear SVMs Datasets that are lnearly separable wth nose work out great: 0 x But what are we gong to do f the dataset s just too hard? Kernel rck!!! 0 x SVM = Lnear SVM + Kernel rck hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Kernel rck Motvaton Lnear classfers are well understood, wdely-used and effcent. How to use lnear classfers to buld non-lnear ones? Neural networks: Construct non-lnear classfers by usng a network of lnear classfers (perceptrons). Kernels: o o Map the problem from the nput space to a new hgher-dmensonal space (called the feature space) by dong a non-lnear transformaton usng a specal functon called the kernel. hen use a lnear model n ths new hgh-dmensonal feature space. he lnear model n the feature space corresponds to a non-lnear model n the nput space.

Non-lnear SVMs: Feature Space General dea: the orgnal nput space can be mapped to some hgher-dmensonal feature space where the tranng set s separable: Φ: x φ(x) hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Nonlnear SVMs: he Kernel rck Wth ths mappng, our dscrmnant functon s now: g( x) w ( x) b ( x) ( x) b SV No need to know ths mappng explctly, because we only use the dot product of feature vectors n both the tranng and test. A kernel functon s defned as a functon that corresponds to a dot product of two feature vectors n some expanded feature space: K( x, x ) ( x ) ( x ) j j

Nonlnear SVMs: he Kernel rck An example: -dmensonal vectors x=[x 1 x ]; let K(x,x j )=(1 + x x j ), Need to show that K(x,x j ) = φ(x ) φ(x j ): K(x,x j )=(1 + x x j ), = 1+ x 1 x j1 + x 1 x j1 x x j + x x j + x 1 x j1 + x x j = [1 x 1 x 1 x x x 1 x ] [1 x j1 x j1 x j x j x j1 x j ] = φ(x ) φ(x j ), where φ(x) = [1 x 1 x 1 x x x 1 x ] hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Nonlnear SVMs: he Kernel rck Examples of commonly-used kernel functons: Lnear kernel: K( x, x ) x x j j Polynomal kernel: K( x, x ) (1 x x ) j j p Gaussan (Radal-Bass Functon (RBF) ) kernel: Sgmod: j K( x, xj) exp( x x ) K( x, x ) tanh( x x ) j 0 j 1 In general, functons that satsfy Mercer s condton can be kernel functons: Kernel matrx should be postve semdefnte.

Nonlnear SVM: Optmzaton Formulaton: (Lagrangan Dual Problem) n n n 1 maxmze y y K(, ) x x j j j 1 1 j1 such that 0 C n 1 y he soluton of the dscrmnant functon s SV 0 g( x) K( x, x) b he optmzaton technque s the same.

Support Vector Machne: Algorthm 1. Choose a kernel functon. Choose a value for C 3. Solve the quadratc programmng problem (many software packages avalable) 4. Construct the dscrmnant functon from the support vectors

Some Issues Choce of kernel - Gaussan or polynomal kernel s default - f neffectve, more elaborate kernels are needed - doman experts can gve assstance n formulatng approprate smlarty measures Choce of kernel parameters - e.g. σ n Gaussan kernel - σ s the dstance between closest ponts wth dfferent classfcatons - In the absence of relable crtera, applcatons rely on the use of a valdaton set or cross-valdaton to set such parameters. Optmzaton crteron Hard margn v.s. Soft margn - a lengthy seres of experments n whch varous parameters are tested hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt

Summary: Support Vector Machne 1. Large Margn Classfer o Better generalzaton ablty & less over-fttng. he Kernel rck o o Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.

Addtonal Resource http://www.kernel-machnes.org/