Introduction to Support Vector Machines

Similar documents
Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (continued)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Infinite Ensemble Learning with Support Vector Machinery

Support Vector Machines and Kernel Methods

Lecture 10: Support Vector Machine and Large Margin Classifier

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Pattern Recognition

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

SVMs: nonlinearity through kernels

Support Vector Machine for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to Support Vector Machines

Constrained Optimization and Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning Techniques

Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS145: INTRODUCTION TO DATA MINING

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Perceptron Revisited: Linear Separators. Support Vector Machines

Learning with kernels and SVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernel Methods and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

L5 Support Vector Classification

SVMs, Duality and the Kernel Trick

Support Vector Machines: Maximum Margin Classifiers

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector and Kernel Methods

Support Vector Machines

Support Vector Machine. Natural Language Processing Lab lizhonghua

Soft-Margin Support Vector Machine

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines.

Statistical Machine Learning from Data

Machine Learning Techniques

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine & Its Applications

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Introduction to SVM and RVM

Support Vector Machines Explained

Classifier Complexity and Support Vector Classifiers

Support Vector Machines

Support Vector Machines. Maximizing the Margin

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

ML (cont.): SUPPORT VECTOR MACHINES

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine. Industrial AI Lab.

Support Vector Machines and Speaker Verification

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Lecture Notes on Support Vector Machine

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

SUPPORT VECTOR MACHINE

Applied Machine Learning Annalisa Marsico

CIS 520: Machine Learning Oct 09, Kernel Methods

COMS 4771 Introduction to Machine Learning. Nakul Verma

Linear & nonlinear classifiers

This is an author-deposited version published in : Eprints ID : 17710

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS798: Selected topics in Machine Learning

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Announcements - Homework

Support Vector Machine

Support vector machines Lecture 4

Linear Classification and SVM. Dr. Xin Zhang

Training Support Vector Machines: Status and Challenges

Lecture Support Vector Machine (SVM) Classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

SVMC An introduction to Support Vector Machines Classification

Machine Learning : Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

An introduction to Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Neural networks and support vector machines

Introduction to Machine Learning

Introduction to Support Vector Machines

Support Vector Machines

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Binary Classification / Perceptron

Support Vector Machines, Kernel SVM

Chapter 6 Classification and Prediction (2)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Transcription:

Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 1 / 20

Setup Learning Problem fixed-length example: D-dimensional vector x, each component is a feature raw digital sampling of a 0.5 sec. wave file DFT of the raw sampling a fixed-length feature vector extracted from the wave file label: a number y Y binary classification: is there a man speaking in the wave file? (y = +1 if man, y = 1 if not) multi-class classification: which speaker is speaking? (y {1, 2,, K }) regression: how excited is the speaker? (y R) H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 2 / 20

Learning Problem Binary Classification Problem learning problem: given training examples and labels {(x i, y i )} N i=1, find a function g(x) : X Y that predicts the label of unseen x well vowel identification: given training wave files and their vowel labels, find a function g(x) that translates wave files to vowel well we will focus on binary classification problem: Y = {+1, 1} most basic learning problem, but very useful and can be extended to other problems illustrative demo: for examples with two different color in a D-dimensional space, how can we separate the examples? H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 3 / 20

Support Vector Machine Hyperplane Classifier use a hyperplane to separate the two colors: g(x) = sign ( w T x + b ) if w T + b 0, the classifier returns +1, otherwise the classifier returns 1 possibly lots of hyperplanes satisfying our needs, which one should we choose? H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 4 / 20

Support Vector Machine SVM: Large-Margin Hyperplane Classifier margin ρ i = y i (w T x + b)/ w 2 : does y i agree with w T x + b in sign? how large is the distance between the example and the separating hyperplane? large positive margin clear separation low risk classification idea of SVM: maximize the minimum margin max w,b min ρ i i s.t. ρ i = y i (w T x i + b)/ w 2 0 H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 5 / 20

Support Vector Machine Hard-Margin Linear SVM maximize the minimum margin max w,b min ρ i i s.t. ρ i = y i (w T x i + b)/ w 2 0, i = 1,..., N. equivalent to min w,b 1 2 w T w s.t. y i (w T x i + b) 1, i = 1,..., N. hard-margin linear SVM quadratic programming with D + 1 variables: well-studied in optimization is the hard-margin linear SVM good enough? H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 6 / 20

Support Vector Machine Soft-Margin Linear SVM hard-margin hard constraints on separation: min w,b 1 2 w T w s.t. y i (w T x i + b) 1, i = 1,..., N. no feasible solution if some noisy outliers exist soft-margin soft constraints as cost: min w,b 1 2 w T w + C i s.t. y i (w T x i + b) 1 ξ i, ξ i ξ i 0, i = 1,..., N. allow the noisy examples to have ξ i > 0 with a cost is linear SVM good enough? H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 7 / 20

Support Vector Machine Soft-Margin Nonlinear SVM what if we want a boundary g(x) = sign ( x T x 1 )? can never be constructed with a hyperplane classifier sign ( w T x + b ) however, we can have more complex feature transforms: φ(x) = [(x) 1, (x) 2,, (x) D, (x) 1 (x) 1, (x) 1 (x) 2,, (x) D (x) D ] there is a classifier sign ( w T φ(x) + b ) that describes the boundary soft-margin nonlinear SVM: min w,b 1 2 w T w + C i ξ i with nonlinear φ( ) s.t. y i (w T φ(x i ) + b) 1 ξ i, ξ i 0, i = 1,..., N. H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 8 / 20

Support Vector Machine Feature Transformation what feature transforms φ( ) should we use? we can only extract finite small number of features, but we can use unlimited number of feature transforms traditionally: use domain knowledge to do feature transformation use only useful feature transformation use a small number of feature transformation control the goodness of fitting by suitable choice of feature transformation what if we use infinite number of feature transformation, and let the algorithm decide a good w automatically? would infinite number of transformations introduce overfitting? are we able to solve the optimization problem? H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 9 / 20

Dual Problem Support Vector Machine infinite quadratic programming if infinite φ( ): min w,b 1 2 w T w + C i ξ i s.t. y i (w T φ(x i ) + b) 1 ξ i, ξ i 0, i = 1,..., N. luckily, we can solve its associated dual problem: α: N-dimensional vector min α 1 2 αt Qα e T α s.t. y T α = 0, 0 α i C, Q ij y i y j φ T (x i )φ(x j ) H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 10 / 20

Support Vector Machine Solution of the Dual Problem associated dual problem: min α 1 2 αt Qα e T α s.t. y T α = 0, 0 α i C, Q ij y i y j φ T (x i )φ(x j ) equivalent solution: ( ) g(x) = sign w T x + b ) = sign( yi α i φ T (x i )φ(x) + b no need for w and φ(x) explicitly if we can compute K (x, x ) = φ T (x)φ(x ) efficiently H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 11 / 20

Kernel Trick Support Vector Machine let kernel K (x, x ) = φ T (x)φ(x ) revisit: can we compute the kernel of φ(x) = [(x) 1, (x) 2,, (x) D, (x) 1 (x) 1, (x) 1 (x) 2,, (x) D (x) D ] efficiently? well, not really how about this? [ φ(x) = 2(x)1, 2(x) 2,, ] 2(x) D, (x) 1 (x) 1,, (x) D (x) D K (x, x ) = (1 + x T x ) 2 1 H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 12 / 20

Support Vector Machine Different Kernels types of kernels linear K (x, x ) = x T x, polynomial: K (x, x ) = (ax T x + r) d Gaussian RBF: K (x, x ) = exp( γ x x 2 2 ) Laplacian RBF: K (x, x ) = exp( γ x x 1 ) the last two equivalently have feature transformation in infinite dimensional space! new paradigm for machine learning: use many many feature transformations, control the goodness of fitting by large-margin (clear separation) and violation cost (amount of outlier allowed) H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 13 / 20

Properties of SVM Support Vectors: Meaningful Representation min α 1 2 αt Qα e T α s.t. y T α = 0, 0 α i C, equivalent solution: ) g(x) = sign( yi α i K (x i, x) + b only those with α i > 0 are needed for classification support vectors from optimality conditions, α i : = 0 : no need in constructing the decision function, away from the boundary or on the boundary > 0 and < C : free support vector, on the boundary = C : bounded support vector, violate the boundary (ξ i > 0) or on the boundary H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 14 / 20

Properties of SVM Why is SVM Successful? infinite number of feature transformation: suitable for conquering nonlinear classification tasks large-margin concept: theoretically promising soft-margin trade-off: controls regularization well convex optimization problems: possible for good optimization algorithms (compared to Neural Networks and some other learning algorithms) support vectors: useful in data analysis and interpretation H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 15 / 20

Properties of SVM Why is SVM Not Successful? SVM can be sensitive to scaling and parameters standard SVM is only a discriminative classification algorithm SVM training can be time-consuming when N is large and the solver is not carefully implemented infinite number of feature transformation mysterious classifier H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 16 / 20

Using SVM Useful Extensions of SVM multiclass SVM: use 1vs1 approach to combine binary SVM to multiclass the label that gets more votes from the classifiers is the prediction probability output: transform the raw output w T φ(x) + b to a value between [0, 1] to mean P(+1 x) use a sigmoid function to transform from R [0, 1] infinite ensemble learning (Lin and Li 2005): if the kernel K (x, x ) = x x 1 is used for standard SVM, the classifier is equivalently ( ) g(x) = sign w θ s θ (x)dθ + b where s θ (x) is a thresholding rule on one feature of x. H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 17 / 20

Basic Use of SVM Using SVM scale each feature of your data to a suitable range (say, [ 1, 1]) use a Gaussian RBF kernel K (x, x ) = exp( γ x x 2 2 ) use cross validation and grid search to determine a good (γ, C) pair use the best (γ, C) on your training set do testing with the SVM classifier all included in LIBSVM (from Lab of Prof. Chih-Jen Lin) H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 18 / 20

Using SVM Advanced Use of SVM include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power) combining SVM with your favorite tools (e.g. HMM + SVM for speech recognition) fine-tune SVM parameters with specific knowledge of your problem (e.g. different costs for different examples?) interpreting the SVM results you get (e.g. are the SVs meaningful?) H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 19 / 20

Resources Using SVM LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBSVM Tools: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools Kernel Machines Forum: http://www.kernel-machines.org Hsu, Chang, and Lin: A Practical Guide to Support Vector Classification my email: htlin@caltech.edu acknowledgment: some figures obtained from Prof. Chih-Jen Lin H.-T. Lin (Learning Systems Group) Introduction to SVMs 2005/11/16 20 / 20