A Tutorial on Support Vector Machine

Similar documents
Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

SUPPORT VECTOR MACHINE

L5 Support Vector Classification

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear & nonlinear classifiers

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine

Introduction to Support Vector Machines

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines.

Introduction to SVM and RVM

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines for Classification: A Statistical Portrait

CS798: Selected topics in Machine Learning

Support Vector Machine (continued)

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

An introduction to Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

(Kernels +) Support Vector Machines

Constrained Optimization and Support Vector Machines

Learning with kernels and SVM

Support Vector Machines

Statistical Pattern Recognition

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine & Its Applications

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Pattern Recognition 2018 Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Non-linear Support Vector Machines

Kernel Methods and Support Vector Machines

Lecture Notes on Support Vector Machine

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Classification and SVM. Dr. Xin Zhang

Kernel Methods. Machine Learning A W VO

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

ML (cont.): SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Support Vector Machines

Support Vector Machines and Kernel Methods

Statistical Machine Learning from Data

Polyhedral Computation. Linear Classifiers & the SVM

Nearest Neighbors Methods for Support Vector Machines

Machine Learning 4771

Support Vector Machines

LECTURE 7 Support vector machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine for Classification and Regression

Support Vector Machines with Example Dependent Costs

Applied inductive learning - Lecture 7

Support Vector Machines

Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Brief Introduction to Machine Learning

CS249: ADVANCED DATA MINING

Lecture Support Vector Machine (SVM) Classifiers

Classification and Support Vector Machine

Learning From Data Lecture 25 The Kernel Trick

Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

ICS-E4030 Kernel Methods in Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines and Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 10: Support Vector Machine and Large Margin Classifier

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Transcription:

A Tutorial on School of Computing National University of Singapore

Contents Theory on Using with Other s

Contents Transforming Theory on Using with Other s

What is a classifier? A function that maps instances to classes. Instance usually expressed as a vector of n attributes. Example: Fit-And-Trim club : Gender, Weight, Height Class (Member): Yes, No Is Garvin a member of the Fit-And-Trim club? Person Gender Weight Height Member Garvin Male 78 179?

Training a classifier: Given: Training data (a set of instances whose classes are known). Output: A function, selected from a predefined set of functions. Example: Fit-And-Trim club Training instances: Person Gender Weight Height Member Alex Male 79 189 Yes Betty Female 76 170 Yes Charlie Male 77 155 Yes Daisy Female 72 163 No Eric Male 73 195 No Fiona Female 70 182 No Possible classification rule: Weight more than 74.5?

Training a classifier: Aim to minimize both training error (errors on seen instances) and generalization error (errors on unseen instances). A classifier that has low training error but high generalization error is said to overfit the training data. Example: Fit-And-Trim club Suppose we use person name as an attribute, and the trained classifier uses the following classification rules: Alex Yes Betty Yes Charlie Yes Daisy No Eric No Fiona No This classifier severely overfits: it achieves 100% accuracy on the training data, but is unable to classify any unseen test instance.

are real numbers, so each instance x = (x 1,x 2,...,x n ) T R n is a n-dimensional vector. Classes are +1 (positive class) and 1 (negative class). Classification rule: y = sign[f(x)] = where the decision function f( ) is { +1 if f(x) 0 1 if f(x) < 0 f(x) = w 1 x 1 +w 2 x 2 +...+w n x n +b = w T x+b for some weight vector w = (w 1,w 2,...,w n ) T R n and bias b R. Training a linear classifier means tuning w and b.

Example: A course is graded based on two written tests. Weightage: Test 1 30%, Test 2 70%. Students pass if the total weighted score is at least 50%. : x 1 = Test 1 score, x 2 = Test 2 score. To pass, we need: 0.3x 1 +0.7x 2 50 Decision function of linear classifier: f(x) = 0.3x 1 +0.7x 2 50 Positive class = pass, negative class = fail.

f(x) = w T x+b = w 1 x 1 +w 2 x 2 +...+w n x n +b f(x) = 0 is a hyperplane in the n-dimensional real space R n. Training: Find a hyperplane that separates positive and negative instances. Exercise 1 Show that w is orthogonal to the hyperplane.?

Given a linear classifier with: f(x) = w 1 x 1 +w 2 x 2 +...+w n x n +b The following are equivalent classifiers: f(x) = (aw 1 )x 1 +(aw 2 )x 2 +...+(aw n )x n +(ab) for any constant a > 0. f(x) = w 1 x 1 +...+w i (ax i)+...+w n x n +b for any constant a > 0 and where w i = w i a. f(x) = w 1 x 1 +...+w i (x i +a)+...+w n x n +b for any constant a and where b = b aw i. In other words, possible to scale whole problem, and scale and shift individual attributes.

Using training data, we can normalize each attribute x i : 0 x i 1. x i has mean 0 and standard deviation 1. Normalization makes training easier by avoiding numerical difficulties. For linear classifier with normalized attributes: f(x) = w 1 x 1 +w 2 x 2 +...+w n x n +b Larger w i or w 2 i means x i more important or relevant. Can be used for attribute ranking or selection. If x i is missing, can be used to decide whether to acquire it.

Transforming If attribute values are ordered, use the ordering. Consider attribute values {Small, Medium, Large}. Map Small to 1. Map Medium to 2. Map Large to 3. If attribute values have no ordering information, create one numeric attribute whose values are {0, 1} for each discrete attribute value. Consider attribute values {Red, Green, Blue}. Create three attributes x Red, x Green, x Blue. Map Red to x Red = 1, x Green = 0, x Blue = 0. Map Green to x Red = 0, x Green = 1, x Blue = 0. Map Blue to x Red = 0, x Green = 0, x Blue = 1. If attribute values are finite sets, use the above method.

Contents Theory on Using with Other s

Narrow margin Wide margin ρ ρ Wide margin gives better generalization performance.

Support vector machine (SVM) is a maximum margin linear classifier. Training data: D = {(x 1,y 1 ),(x 2,y 2 ),...,(x N,y N )}. Assume D is linearly separable, i.e., can separate positive and negative instances exactly using a hyperplane. SVM requires: Training instance xi with class y i = +1: f(x i ) = w T x i +b +1. Training instance xi with class y i = 1: f(x i ) = w T x i +b 1. Combined, we have y i f(x i ) = y i (w T x i +b) 1. Support vector: A training instance x i with f(x i ) = w T x i +b = ±1.

ρ = 2/ w support vectors w f(x) = +1 f(x) = 0 f(x) = 1 Test instance x: Greater f(x) Greater classification confidence. Maximize ρ Minimize w 2 Minimize w T w. ( w 2 = w 2 1 +w2 2 +...+w2 n is the length or 2-norm of the vector w) Exercise 2 Show that ρ = 2 w 2.

Primal problem: Minimize 1 2 wt w Subject to y i (w T x i +b) 1 Primal problem is convex optimization problem, i.e., any local minimum is also global minimum. But more convenient to solve the dual problem instead.

Let f(u 1,u 2,...,u k ) be function of variables u 1, u 2,..., u k. Partial derivatives: Partial derivative f u i means differentiate w.r.t. u i while holding all other u j s as constants. Example: f(x,y) = sinx+xy 2, f x = cosx+y2, f y = 2xy. Stationary point when f u i = 0 for all i. Partial derivatives w.r.t. vectors: Let u = [u 1,u 2,...,u n ] T. [ ] f = f u = f T. u 1, f u 2,..., f u n Stationary point when f = f u = 0. Exercise 3 Verify that u at u = u ut a = a and u ut u = 2u.

Primal problem: Minimize f(u) Subject to g 1 (u) 0,g 2 (u) 0,...,g m (u) 0 Lagrangian function: L(u,α,β) = f(u)+ h 1 (u) = 0,h 2 (u) = 0,...,h n (u) = 0 m α i g i (u)+ The α i s and β i s are Lagrange multipliers. n β i h i (u) Optimal solution must satisfy Karush-Kuhn-Tucker (KKT) conditions: L u = 0 g i(u) 0 h i (u) = 0 α i g i (u) = 0 α i 0

KKT conditions: Dual problem: L u = 0 g i(u) 0 h i (u) = 0 Maximize Subject to α i g i (u) = 0 α i 0 f(α,β) = inf L(u,α,β) u KKT conditions (inf means infimum or greatest lower bound) Relation between primal and dual objective functions: f(u) f(α,β) Under certain conditions (satisfied by SVM), equality occurs at optimal solution.

Primal problem: Lagrangian: Minimize 1 2 wt w Subject to 1 y i (w T x i +b) 0 L(w,b,α) = 1 2 wt w+ KKT conditions: = 1 2 wt w+ α i (1 y i (w T x i +b)) α i α i y i w T x i b α i y i L N w = w α i y i x i = 0 L N b = α i y i = 0 1 y i (w T x i +b) 0 α i (1 y i (w T x i +b)) = 0 α i 0

Since w = N α iy i x i, we have: w T w = α i y i w T x i = Note also N α iy i = 0. L(w,b,α) = 1 2 wt w+ = α i 1 2 α i j=1 j=1 α i α j y i y j x T i x j α i y i w T x i b α i y i α i α j y i y j x T i x j

Dual problem: Maximize α i 1 2 Subject to α i 0 α i y i = 0 j=1 α i α j y i y j x T i x j Dual problem is also convex optimization problem. Many software packages available and tailored to solve this form of optimization problem efficiently.

Solving w and b: w = N α iy i x i. Choose support vector x i, then b = y i w T x i. Decision function: f(x) = N α iy i x T i x+b. Support vectors: KKT condition: α i (1 y i (w T x i +b)) = 0. α i > 0 y i (w T x i +b) = 1 x i is support vector. Usually relatively few support vectors, many α i s will vanish.

Narrow margin Not linearly separable Solution: Allow training instances inside the margin or on the other side of the separating hyperplane (misclassified).

Soft margin SVM ε i ε i ε i Allow training instance (x i,y i ) to have y i (w T x i +b) < 1, or y i (w T x i +b) = 1 ε i for ε i > 0, but penalize it with a constant factor C > 0. ε i

Primal problem: Dual problem: Minimize Subject to Maximize Subject to 1 2 wt w+c ε i y i (w T x i +b) 1 ε i ε i 0 α i 1 2 0 α i C α i y i = 0 j=1 α i α j y i y j x T i x j Exercise 4 Derive the dual problem for the soft margin SVM.

Recall: 0 α i C. x i is support vector if α i > 0. Two types of support vectors: x i is free support vector if α i < C. x i is bounded support vector if α i = C. Exercise 5 What are the characteristics of free support vectors and bounded support vectors? (Hint: Consider the KKT conditions.)

Possible to weigh each training instance x i with λ i > 0 to indicate its importance. Primal problem: Minimize Subject to Cost-sensitive SVM: 1 2 wt w+c λ i ε i y i (w T x i +b) 1 ε i ε i 0 Different costs for misclassifying positive and negative instances. Set λ i proportional to misclassification cost of x i. Exercise 6 Derive the dual problem for this SVM.

Nonlinear feature map Φ : R n R m from attribute space to feature space. Training instances linearly separable in feature space. Φ(x) = (x,x 2 ) T Cover s theorem: Instances more likely to be linearly separable in high dimension space.

Dual problem: Maximize Subject to Decision function: Observations: α i 1 2 0 α i C α i y i = 0 f(x) = j=1 α i α j y i y j Φ(x i ) T Φ(x j ) α i y i Φ(x i ) T Φ(x)+b Computationally expensive or difficult to compute Φ(x). But often Φ(x) T Φ(x ) has simple expression.

Dual problem: Maximize Subject to Decision function: α i 1 2 0 α i C α i y i = 0 f(x) = j=1 α i y i K(x i,x)+b α i α j y i y j K(x i,x j ) The kernel is the function K(x,x ) = Φ(x) T Φ(x ). Exercise 7 Express w T w and b in terms of the kernel. What is the margin of the kernelized SVM?

K(, ) can take any expression that satisfies Mercer s condition to ensure K(x,x ) = Φ(x) T Φ(x ) for some feature map Φ( ): Mercer s condition: matrix K formed by K ij = K(x i,x j ) is positive semidefinite, i.e., eigenvalues of K are all nonnegative, or v T Kv 0 for all vectors v. No need to compute or even know explicit form of Φ( ). Examples: Linear kernel: K(x,x ) = x T x. Polynomial kernel: K(x,x ) = (x T x +k) d. Radial basis function (RBF) kernel: K(x,x ) = exp( x x 2 2σ 2 ). Suggestion: Try linear kernel first, then RBF kernel.

Contents Theory on Using Estimating Handling with Other s

Dual problem: Maximize Subject to RBF kernel: α i 1 2 0 α i C α i y i = 0 j=1 ( x x K(x,x ) 2 ) = exp 2σ 2 α i α j y i y j K(x i,x j ) Hyperparameters include C of SVM formulation and kernel parameters, e.g., σ 2 of RBF kernel. Important to select good hyperparameters, has large impact on SVM performance.

Traditional way of parameter tuning: Use a training set and a validation set. Do a grid search on the parameters, e.g., when using RBF kernel for each C {2 10,2 8,...,2 0,...,2 8,2 10 } and σ 2 {2 10,2 8,...,2 0,...,2 8,2 10 }: Train on training set. Test on validation set. Choose C and σ 2 that gives best performance on validation set. Extensions of this idea: k-fold cross-validation and leave-one-out cross-validation. Some SVM implementations have integrated (and more optimized) parameter tuning, others require user to perform own tuning before training.

Estimating Probability of test instance x belonging to a class can be estimated by: P(y = 1 x) = 1 1+exp(Af(x)+B) A and B are parameters to be determined from training data.

Handling For k > 2 classes, decompose into multiple two-class SVMs. Pairwise SVMs: Train ( k 2) SVMs, one for each pair of classes. For a test instance, each SVM casts a vote on a class. Classify test instance as the class that gets the most votes. One-against-all SVMs: Train k SVMs, the ith SVM considers class i as positive and all other classes as negative. Classify test instance to class i if the ith SVM gives the greatest f( ) value.

: Text processing: Each text document is an instance. Each word is an attribute. Attribute values are word counts. Image processing: Each image is an instance. Each pixel is an attribute. Attribute values are pixel values. Cover s theorem: Instances more likely to be linearly separable in high dimension space. These applications already have many attributes, and each instance is a sparse vector, so linear SVM often performs well.

SVM decision function: f(x) = w 1 x 1 +w 2 x 2 +w 3 x 3 +w 4 x 4 +b 2-norm of weight vector is w 2 = w 2 1 +w2 2 +w2 3 +w2 4. Test instance with missing attribute values: x = (x 1,?,x 3,?) Substitute with constants attributewise: x = (x 1,x 2,x 3,x 4 ) Decision function becomes: f(x) = w 1 x 1 +w 2 x 2+w 3 x 3 +w 4 x 4+b = w 1 x 1 +w 3 x 3 +b New weight vector is w (x) = (w 1,w 3 ) T and its 2-norm decreased to w (x) 2 = w 2 1 +w2 3. Can use f(x), w 2 and w (x) 2 to estimate probability of misclassification, and hence whether to acquire missing attribute values.

Contents Theory on Using with Other s of s

Advantages of SVM: SVM can handle linear relationships between attributes, but typical decision tree split only on one attribute at a time. Typically better on datasets with many attributes. Handles continuous values well. Advantages of decision tree: Decision tree is better at handling nested if-then-else type of rules, which SVM is not good at. Typically better on datasets with fewer attributes. Handles discrete values well.

Advantages of SVM: Faster classification compared to neighbours. Smooth separating hyperplane. Less suspectible to noise. Advantages of neighbours: Training is instantaneous nothing to do. Local variations are considered.

often has the dubious honour of being placed among the last in evaluations. is fast and easy to implement, hence often treated as a baseline. One key handicap of naïve Bayes is its independence assumption.

A linear SVM can be seen as a neural network with no hidden layers. But training algorithm is different. SVM design is driven by sound theory, neural network design is driven by applications. Backpropagation algorithm for training multi-layer neural network does not find the maximal margin separating hyperplane. Neural network tend to overfit more than SVM. Neural network have many local minima, SVM has only one global minima. Multi-layer neural networks require specifying number of hidden layers and number of nodes at each layer, but SVM does not need them. What do the learned weights in a trained multi-layer neural network mean?

of s Possible to combine classifiers: SVM with decision tree: Instead of splitting based only on one attribute, split using a linear classifier trained by SVM. SVM with neighbours: Use neighbours to classify test instances inside the margin. SVM with naïve Bayes: Train classifier, produces conditional probabilities P(x i c). Modify every (training and testing) instance x by multiplying each attribute x i by P(x i c). Train and test on the modified instances.

Contents Theory on Using with Other s

SVM is a maximum margin linear classifier. Often good for classifying test instances in high dimensional space. Lagrangian theory primal and dual problems. When training instances are not linearly separable: Use soft margin. Use kernel. Usage tips: Transforming non-numeric attributes. Normalize attributes. Tune parameters when training. with other classifiers.

Introduction to SVM, with applications: [Burges, 1998], [Osuna et al., 1997]. Practical guide on using SVM effectively: [Hsu et al., 2003]. LIBSVM manual, including weighted instances: [Chang and Lin, 2001]. Tutorial on Lagrangian theory: [Burges, 2003], [Klien, 2004]. SVM with posterior probabilities: [Platt, 2000]. Attribute selection using SVM: [Guyon et al., 2002]. Handling multiple classes: [Hsu and Lin, 2002], [Tibshirani and Hastie, 2007]. SVM combined with... Decision tree: [Bennett and Blue, 1998], [Tibshirani and Hastie, 2007]. neighbours: [Chiu and Huang, 2007]. : [Li et al., 2007].

References I Bennett, K. P. and Blue, J. A. (1998). A support vector machine approach to decision trees. In IEEE World Congress on Computational Intelligence, pages 2396 2401. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121 167. Burges, C. J. C. (2003). Some notes on applied mathematics for machine learning. In Advanced Lectures on Learning, pages 21 40. Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Chiu, C.-Y. and Huang, Y.-T. (2007). Integration of support vector machine with naïve bayesian classifier for spam classification. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 618 622. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Learning, 46(1-3):389 422. Hsu, C.-W., Chang, C.-C., and Lin, C.-J. (2003). A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University. Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on s, 13(2):415 425.

References II Klien, D. (2004). Lagrange multipliers without permanent scarring. Available at http://www.cs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf. Li, R., Wang, H.-N., He, H., Cui, Y.-M., and Du, Z.-L. (2007). Support vector machine combined with k-nearest neighbors for solar flare forecasting. Chinese Journal of Astronomy and Astrophysics, 7(3):441 447. Osuna, E. E., Freund, R., and Girosi, F. (1997). Support vector machines: Training and applications. Technical Report AIM-1602, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Platt, J. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin s, pages 61 74. Tibshirani, R. and Hastie, T. (2007). Margin trees for high-dimensional classification. Journal of Learning Research (JMLR), 8:637 652.