Support Vector Machines

Similar documents
Statistical Methods for SVM

Statistical Methods for Data Mining

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine (SVM) and Kernel Methods

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines and Speaker Verification

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

L5 Support Vector Classification

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Convex Optimization and Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine

Support Vector Machines

Support Vector Machine (continued)

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to SVM and RVM

Linear & nonlinear classifiers

18.9 SUPPORT VECTOR MACHINES

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines and Kernel Methods

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines Explained

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Announcements - Homework

Linear & nonlinear classifiers

Introduction to Machine Learning

Support Vector Machines

Machine Learning, Fall 2011: Homework 5

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Midterm exam CS 189/289, Fall 2015

Perceptron Revisited: Linear Separators. Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 7: Kernels for Classification and Regression

Introduction to Support Vector Machines

Linear Decision Boundaries

Support Vector Machine I

Support Vector Machines, Kernel SVM

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Machine Learning from Data

Support Vector Machine II

Kernels and the Kernel Trick. Machine Learning Fall 2017

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

SVM optimization and Kernel methods

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Signal Detection and Classification. Phani Chavali

SVMs: nonlinearity through kernels

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine

Day 4: Classification, support vector machines

Kernel Methods and Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

10-701/ Machine Learning - Midterm Exam, Fall 2010

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

A Magiv CV Theory for Large-Margin Classifiers

Lecture Notes on Support Vector Machine

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

Support vector machines Lecture 4

SUPPORT VECTOR MACHINE

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

CSC 411 Lecture 17: Support Vector Machine

Classification and Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

(Kernels +) Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Classification 1: Linear regression of indicators, linear discriminant analysis

Support Vector Machine. Industrial AI Lab.

Transcription:

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two ways: We soften what we mean by separates, and We enrich and enlarge the feature space so that separation is possible. 1 / 21

What is a Hyperplane? A hyperplane in p dimensions is a flat affine subspace of dimension p 1. In general the equation for a hyperplane has the form β 0 + β 1 X 1 + β 2 X 2 +... + β p X p = 0 In p = 2 dimensions a hyperplane is a line. If β 0 = 0, the hyperplane goes through the origin, otherwise not. The vector β = (β 1, β 2,, β p ) is called the normal vector it points in a direction orthogonal to the surface of a hyperplane. 2 / 21

Hyperplane in 2 Dimensions X 2 2 0 2 4 6 8 10 β 1 X 1 +β 2 X 2 6= 4 β 1 X 1 +β 2 X 2 6=0 β=(β 1,β 2 ) β 1 X 1 +β 2 X 2 6=1.6 β 1 = 0.8 β 2 = 0.6 2 0 2 4 6 8 10 X 1 3 / 21

Separating Hyperplanes 1 0 1 2 3 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 If f(x) = β 0 + β 1 X 1 + + β p X p, then f(x) > 0 for points on one side of the hyperplane, and f(x) < 0 for points on the other. If we code the colored points as Y i = +1 for blue, say, and Y i = 1 for mauve, then if Y i f(x i ) > 0 for all i, f(x) = 0 defines a separating hyperplane. 4 / 21

Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. 1 0 1 2 3 Constrained optimization problem maximize β 0,β 1,...,β p M subject to p βj 2 = 1, j=1 y i (β 0 + β 1 x i1 +... + β p x ip ) M for all i = 1,..., N. 1 0 1 2 3 X1 This can be rephrased as a convex quadratic program, and solved efficiently. The function svm() in package e1071 solves this problem efficiently 5 / 21

Non-separable Data 1.0 0.5 0.0 0.5 1.0 1.5 2.0 The data on the left are not separable by a linear boundary. This is often the case, unless N < p. 0 1 2 3 X1 6 / 21

Noisy Data 1 0 1 2 3 1 0 1 2 3 1 0 1 2 3 X 1 1 0 1 2 3 X 1 Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier. The support vector classifier maximizes a soft margin. 7 / 21

Support Vector Classifier 1 0 1 2 3 4 3 6 4 5 7 1 2 8 9 10 1 0 1 2 3 4 3 6 12 4 5 7 11 1 2 8 9 10 0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 maximize M subject to β 0,β 1,...,β p,ɛ 1,...,ɛ n p βj 2 = 1, j=1 y i (β 0 + β 1 x i1 + β 2 x i2 +... + β p x ip ) M(1 ɛ i ), n ɛ i 0, ɛ i C, i=1 8 / 21

C is a regularization parameter 3 2 1 0 1 2 3 3 2 1 0 1 2 3 1 0 1 2 X1 1 0 1 2 X1 3 2 1 0 1 2 3 3 2 1 0 1 2 3 1 0 1 2 X1 1 0 1 2 X1 9 / 21

Linear boundary can fail 4 2 0 2 4 Sometime a linear boundary simply won t work, no matter what value of C. The example on the left is such a case. What to do? 4 4 2 0 2 4 X 1 10 / 21

Feature Expansion Enlarge the space of features by including transformations; e.g. X1 2, X3 1, X 1X 2, X 1 2,.... Hence go from a p-dimensional space to a M > p dimensional space. Fit a support-vector classifier in the enlarged space. This results in non-linear decision boundaries in the original space. Example: Suppose we use (X 1, X 2, X 2 1, 2, X 1X 2 ) instead of just (X 1, X 2 ). Then the decision boundary would be of the form β 0 + β 1 X 1 + β 2 X 2 + β 3 X 2 1 + β 4 X 2 2 + β 5 X 1 X 2 = 0 This leads to nonlinear decision boundaries in the original space (quadratic conic sections). 11 / 21

Here we use a basis expansion of cubic polynomials From 2 variables to 9 The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space Cubic Polynomials 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 X 1 β 0 +β 1 X 1 +β 2 X 2 +β 3 X 2 1 +β 4 X 2 2 +β 5 X 1 X 2 +β 6 X 3 1 +β 7 X 3 2 +β 8 X 1 X 2 2 +β 9 X 2 1 X 2 = 0 12 / 21

Nonlinearities and Kernels Polynomials (especially high-dimensional ones) get wild rather fast. There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers through the use of kernels. Before we discuss these, we must understand the role of inner products in support-vector classifiers. 13 / 21

Inner products and support vectors p x i, x i = x ij x i j j=1 inner product between vectors The linear support vector classifier can be represented as n f(x) = β 0 + α i x, x i n parameters i=1 To estimate the parameters α 1,..., α n and β 0, all we need are the ( n 2) inner products xi, x i between all pairs of training observations. It turns out that most of the ˆα i can be zero: f(x) = β 0 + i S ˆα i x, x i S is the support set of indices i such that ˆα i > 0. [see slide 8] 14 / 21

Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! Some special kernel functions can do this for us. E.g. K(x i, x i ) = 1 + p x ij x i j j=1 computes the inner-products needed for d dimensional polynomials ( ) p+d d basis functions! Try it for p = 2 and d = 2. The solution has the form d f(x) = β 0 + i S ˆα i K(x, x i ). 15 / 21

Radial Kernel p K(x i, x i ) = exp( γ (x ij x i j) 2 ). j=1 4 2 0 2 4 f(x) = β 0 + i S ˆα i K(x, x i ) Implicit feature space; very high dimensional. Controls variance by squashing down most dimensions severely 4 4 2 0 2 4 X 1 16 / 21

Example: Heart Data True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier LDA True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ=10 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate ROC curve is obtained by changing the threshold 0 to threshold t in ˆf(X) > t, and recording false positive and true positive rates as t varies. Here we see ROC curves on training data. 17 / 21

Example continued: Heart Test Data True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier LDA True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 Support Vector Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ=10 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate 18 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K > 2 classes? OVA One versus All. Fit K different 2-class SVM classifiers ˆf k (x), k = 1,..., K; each class versus the rest. Classify x to the class for which ˆf k (x ) is largest. OVO One versus One. Fit all ( ) K 2 pairwise classifiers ˆf kl (x). Classify x to the class that wins the most pairwise competitions. Which to choose? If K is not too large, use OVO. 19 / 21

Support Vector versus Logistic Regression? With f(x) = β 0 + β 1 X 1 +... + β p X p can rephrase support-vector classifier optimization as n minimize max [0, 1 y β 0,β 1,...,β p i f(x i )] + λ i=1 p j=1 β 2 j Loss 0 2 4 6 8 SVM Loss Logistic Regression Loss This has the form loss plus penalty. The loss is known as the hinge loss. Very similar to loss in logistic regression (negative log-likelihood). 6 4 2 0 2 y i(β 0 + β 1x i1 +... + β px ip) 20 / 21

Which to use: SVM or Logistic Regression When classes are (nearly) separable, SVM does better than LR. So does LDA. When not, LR (with ridge penalty) and SVM very similar. If you wish to estimate probabilities, LR is the choice. For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive. 21 / 21