Support Vector Machines.

Similar documents
Support Vector Machines. Goals for the lecture

Support Vector Machine (continued)

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines Explained

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS798: Selected topics in Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear & nonlinear classifiers

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to Support Vector Machines

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Kernel methods CSE 250B

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear & nonlinear classifiers

Lecture 10: A brief introduction to Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Kernel Methods and Support Vector Machines

Discriminative Models

Support Vector Machines

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine & Its Applications

Machine Learning. Support Vector Machines. Manfred Huber

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Pattern Recognition 2018 Support Vector Machines

Discriminative Models

L5 Support Vector Classification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

CSC 411 Lecture 17: Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Statistical Pattern Recognition

Statistical Machine Learning from Data

SVMs: nonlinearity through kernels

Introduction to SVM and RVM

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Max Margin-Classifier

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Brief Introduction to Machine Learning

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines and Kernel Methods

Support Vector Machines for Classification: A Statistical Portrait

Lecture Notes on Support Vector Machine

Linear Classification and SVM. Dr. Xin Zhang

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

SUPPORT VECTOR MACHINE

(Kernels +) Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Applied inductive learning - Lecture 7

18.9 SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines. Machine Learning Fall 2017

Lecture Support Vector Machine (SVM) Classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machines, Kernel SVM

Machine Learning for Structured Prediction

Applied Machine Learning Annalisa Marsico

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernel Methods & Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

Support Vector Machines and Kernel Methods

Incorporating detractors into SVM classification

Kernel Methods. Machine Learning A W VO

SVM optimization and Kernel methods

Support Vector Machine for Classification and Regression

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

Transcription:

Support Vector Machines www.cs.wisc.edu/~dpage 1

Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel trick the primal and dual formulations of SVM learning support vectors the kernel matrix valid kernels polynomial kernel Gausian kernel string kernels support vector regression 2

Algorithm You Should Know SVM as constrained optimization Fit training data and maximize margin Primal and dual formulations Kernel trick Slack variables to allow imperfect fit For now, assuming optimizer is black box Next lecture will look inside black box: sequential minimal optimization (SMO) 3

Burr Settles, UW CS PhD 4

Four key SVM ideas Maximize the margin don t choose just any separating hyperplane Penalize misclassified examples use soft constraints and slack variables Use optimization methods to find model linear programming quadratic programming Use kernels to represent nonlinear functions and handle complex instances (sequences, trees, graphs, etc.) 5

Some key vector concepts the dot product between two vectors w and x is defined as: w x = w T x = for example i w i x i " $ $ # $ 1 3 5 % " ' $ ' $ &' # $ 4 2 1 % ' ' &' = (1)(4) + (3)( 2) + ( 5)( 1) = 3 the 2-norm (Euclidean length) of a vector x is defined as: i x 2 = x i 2 6

Linear separator learning revisited suppose we encode our classes as {-1, +1} and consider a linear classifier h(x) = ) + * +, n " % 1 if w i x # $ i & ' + b > 0 i=1 1 otherwise an instance x, y will be classified correctly if y(w T x + b) > 0 x 2 x 1 7

Large margin classification Given a training set that is linearly separable, there are infinitely many hyperplanes that could separate the positive/negative instances. Which one should we choose? In SVM learning, we find the hyperplane that maximizes the margin. Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010 8

Large margin classification suppose we learn a hyperplane h given a training set D let x + denote the closest instance to the hyperplane among positive instances, and similarly for x - and negative instances the margin is given by margin D (h) = 1 2 ŵt ( x + x ) = 1 w 2 length 1 vector in same direction as w x + x - 9

1 2 3 4 5 6 7 8 9 10 11 12 13 Example X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X 3 irrelevant, X 2 twice as important as X 1 X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 Example X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X 3 irrelevant, X 2 twice as important as X 1 X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 Separator is perpendicular to weight vector X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X 3 irrelevant, X 2 twice as important as X 1 2X 2 + X 1 2 0 X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 Changing b moves (shifts) separator X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X 3 irrelevant, X 2 twice as important as X 1 2X 2 + X 1 6 0 X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 Size of w, b sensitive to data scaling - X 2 + O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Assume labeled data as above + X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 Margin is width between our earlier lines X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X 3 irrelevant, X 2 twice as important as X 1 2X 2 + X 1 4 0 2X 2 + X 1 2 0 2X 2 + X 1 6 0 - + + X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 Specifically margin here is e + - e - 2 X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Let e+ and e-, respectively, be the points where w intersects the hyperplanes: 2X 2 + X 1 2 0 2X 2 + X 1 6 0 + - e- e+ + X 1

1 2 3 4 5 6 7 8 9 10 11 12 13 But can double margin artificially X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 2X 2 + X 1 8 0 2X 2 + X 1 4 = 0 2X 2 + X 1 12 = 0 - + Keep w fixed, double b e- e+ + OR 4X 2 + 2X 1 4 0 4X 2 + 2X 1 2 = 0 4X 2 + 2X 1 6 = 0 X 1 Double w, keep b fixed

1 2 3 4 5 6 7 8 9 10 11 12 13 Should normalize margin with norm of w X 2 O -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 -19-18 -17-16 -15-14 -13-12 -11-10 -9-8 -7-6 -5-4 -3-2 1-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 But normalized margin remains unchanged: e + e - 2 / w 2 Can keep w 2 fixed and try to maximize e + e - 2 Or fix e + e - 2 (e.g., to 2) and minimize w 2 X 1

The hard-margin SVM given a training set D = { x (1), y (1),, x (m), y (m) } we can frame the goal of maximizing the margin as a constrained optimization problem minimize w,b 1 2 w 2 2 adjust these parameters to minimize this subject to constraints : y (i) (w T x (i) + b) 1 for i = 1,, m correctly classify x (i) with room to spare and use standard algorithms to find an optimal solution to this problem 19

The soft-margin SVM [Cortes & Vapnik, Machine Learning 1995] if the training instances are not linearly separable, the previous formulation will fail we can adjust our approach by using slack variables (denoted by ξ) to tolerate errors minimize w,b,ξ (1) ξ (m) 1 2 w 2 m i=1 2 + C ξ (i) subject to constraints : y (i) (w T x (i) + b) 1 ξ (i) for i = 1,, m ξ (i) 0 C determines the relative importance of maximizing margin vs. minimizing slack 20

The effect of C in a soft-margin SVM Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010 21

Nonlinear classifiers +1 x 2-1 x 1 22

Nonlinear classifiers x 1 x 2 +1 x 2-1 x 1 23

Nonlinear classifiers x 1 x 2 +1 x 2-1 x 1 24

Nonlinear classifiers xy +1 y -1 x 1 25

Nonlinear classifiers xy +1 y -1 x 1 26

Nonlinear classifiers xy +1 y -1 x 1 27

Nonlinear classifiers +1 x 2-1 x 1 28

Nonlinear classifiers What if a linear separator is not an appropriate decision boundary for a given task? For any (consistent) data set, there exists a mapping ϕ to a higher-dimensional space such that the data is linearly separable φ(x) = ( φ 1 (x), φ 2 (x),, φ k (x)) Example: mapping to quadratic space x = x 1, x 2 suppose x is represented by 2 features φ(x) = ( x 2 1, 2x 1 x 2, x 2 2, 2x 1, 2x 2, 1) now try to find a linear separator in this space 29

Nonlinear classifiers for the linear case, our discriminant function was given by h(x) = # $ % 1 if w x + b > 0 1 otherwise for the nonlinear case, it can be expressed as h(x) = $ & % '& 1 if w φ(x) + b > 0 1 otherwise where w is a higher dimensional vector 30

SVMs with polynomial kernels Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010 31

The kernel trick explicitly computing this nonlinear mapping does not scale well a dot product between two higher-dimensional mappings can sometimes be implemented by a kernel function example: quadratic kernel k(x, z) = ( x z +1) 2 = ( x 1 z 1 + x 2 z 2 +1) 2 = x 1 2 z 1 2 + 2x 1 x 2 z 1 z 2 + x 2 2 z 2 2 + 2x 1 z 1 + 2x 2 z 2 +1 = ( x 2 1, 2x 1 x 2, x 2 2, 2x 1, 2x 2, 1)i ( z 2 1, 2z 1 z 2, z 2 2, 2z 1, 2z 2, 1) = φ(x) φ(z) 32

The kernel trick thus we can use a kernel to compute the dot product without explicitly mapping the instances to a higher-dimensional space k(x, z) = ( x z +1) 2 = φ(x) φ(z) But why is the kernel trick helpful? 33

Using the kernel trick given a training set D = { x (1), y (1),, x (m), y (m) } suppose the weight vector can be represented as a linear combination of the training instances w = m α i x (i) i=1 then we can represent a linear SVM as m α i x (i) i x + b i=1 and a nonlinear SVM as m α i φ x (i) i=1 m ( )iφ x = α i k x (i),x i=1 ( ) + b ( ) + b 34

Can we represent a weight vector as a linear combination of training instances? consider perceptron learning, where each weight can be represented as w j = m (i) α i x j i=1 proof: each weight update has the form w j (t) = w j (t 1) + ηδ t (i) y (i) x j (i) w j = m t i=1 ηδ t (i) y (i) x j (i) δ t (i) = " $ # %$ 1 if x (i) misclassified in epoch t 0 otherwise w j = w j = m $ ' ηδ (i) t y (i) (i) % & ( ) x j i=1 t m (i) α i x j i=1 35

The primal and dual formulations of the hard-margin SVM primal minimize w,b 1 2 w 2 2 subject to constraints : y (i) (w T x (i) + b) 1 for i = 1,, m dual maximize α 1,,α m m α i 1 i=1 2 m m α j α k y ( j ) y (k ) ( x ( j ) i x (k ) ) j=1 k=1 subject to constraints : α i 0 for i = 1,, m m α i y (i) = 0 i=1 36

The dual formulation with a kernel (hard margin version) primal minimize w,b 1 2 w 2 2 subject to constraints : y (i) (w T φ( x (i) ) + b) 1 for i = 1,, m dual maximize α 1,,α m m α i 1 i=1 2 m m α j α k y ( j ) y (k ) k( x ( j ), x (k ) ) j=1 k=1 subject to constraints : α i 0 for i = 1,, m m α i y (i) = 0 i=1 37

Support vectors the final solution is a sparse linear combination of the training instances those instances having α i > 0 are called support vectors they lie on the margin boundary the solution wouldn t change if all the instances with α i = 0 were deleted support vectors 38

The kernel matrix the kernel matrix (a.k.a. Gram matrix) represents pairwise similarities for instances in the training set! # # # # # " k(x (1),x (1) ) k(x (1),x (2) ) k(x (1),x (m) ) k(x (2),x (1) ) k(x (m),x (1) ) k(x (m),x (m) ) $ & & & & & % it represents the information about the training set that is provided as input to the optimization process 39

Some common kernels polynomial of degree d k(x, z) = ( x z) d polynomial of degree up to d k(x, z) = ( x z +1) d radial basis function (RBF) (a.k.a. Gaussian) k(x, z) = exp( γ x z 2 ) 40

The RBF kernel the feature mapping ϕ for the RBF kernel is infinite dimensional! recall that k(x, z) = ϕ(x) ϕ (z) k(x, z) = exp " 1 2 x z 2 # $ % & ' for γ = 1 2 = exp " 1 2 x 2 # $ = exp " 1 2 x 2 # $ % & ' exp " 1 2 z 2 # $ % & ' exp " 1 2 z 2 # $ % & ' exp( x z) % & ' " $ # n=0 ( x z) n n! % ' & from the Taylor series 41 expansion of exp(x z)

The RBF kernel illustrated γ = 10 γ = 100 γ = 1000 Figures from openclassroom.stanford.edu (Andrew Ng) 42

What makes a valid kernel? k(x, z) is a valid kernel if there is some ϕ such that k(x, z) = φ(x) φ(z) this holds for a symmetric function k(x, z) if and only if the kernel matrix K is positive semidefinite for any training set (Mercer s theorem) definition of positive semidefinite (p.s.d): v :v T Kv 0 43

Support vector regression the SVM idea can also be applied in regression tasks an ε-insensitive error function specifies that a training instance is well explained if the model s prediction is within ε of y (i) ( w T x (i) + b) y (i) = ε y (i) ( w T x (i) + b) = ε 44

Support vector regression minimize w,b,ξ (1) ξ (m), ˆξ (1) ˆξ (m) 1 2 w 2 m ( (i) ˆξ ) i=1 2 + C ξ (i) + subject to constraints : ( w T x (i) + b) y (i) ε + ξ (i) for i = 1,, m y (i) ( w T x (i) + b) ε + ξ (i), ˆξ (i) 0 ˆξ (i) slack variables allow predictions for some training instances to be off by more than ε 45

Learning theory justification for maximizing the margin error D (h) error D (h) + " VC log 2m # $ VC +1 % & ' + log 4 δ m error on true distribution training set error VC-dimension of hypothesis class Vapnik showed there is a connection between the margin and VC dimension VC 4R 2 margin D (h) 2 constant dependent on training data thus to minimize the VC dimension (and to improve the error bound) è maximize the margin 46

The power of kernel functions kernels can be designed and used to represent complex data types such as strings trees graphs etc. let s consider a specific example 47

The protein classification task Given: amino-acid sequence of a protein Do: predict the family to which it belongs GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH 48

The k-spectrum feature map we can represent sequences by counts of all of their k-mers x = AKQDYYYYEI k = 3 ϕ(x) =( 0, 0,, 1,, 1,, 2 ) AAA AAC AKQ DYY YYY the dimension of ϕ(x) = A k where A is the size of the alphabet using 6-mers for protein sequences, 20 6 = 64 million almost all of the elements in ϕ(x) are 0 since a sequence of length l has at most l-k+1 k-mers 49

The k-spectrum kernel consider the k-spectrum kernel applied to x and z with k = 3 x = AKQDYYYYEI z = AKQIAKQYEI ϕ(x) =( 0,, 1,, 1,, 0 ) AAA AKQ YEI YYY ϕ(z) =( 0,, 2,, 1,, 0 ) AAA AKQ YEI YYY ϕ(x) ϕ(z) = 2 + 1 = 3 50

(k, m)-mismatch feature map closely related protein sequences may have few exact matches, but many near matches the (k, m)-mismatch feature map uses the k-spectrum representation, but allows up to m mismatches x = AKQ k = 3, m = 1 ϕ(x) =( 0,, 1,, 1,, 1,, 0 ) AAA AAQ AKQ DKQ YYY 51

Using a trie to represent 3-mers example: representing all 3-mers of the sequence QAAKKQAKKY A K Q A K K Q A K K Q Y A A 1 2 1 1 1 1 1 K 52

Computing the kernels efficiently k-spectrum kernel with tries [Leslie et al., NIPS 2002] for each sequence build a trie representing its k-mers compute kernel ϕ(x) ϕ(z) by traversing trie for x using k-mers from z update kernel function when reaching a leaf (k, m)-mismatch kernel for each sequence build a trie representing its k-mers and also k-mers with at most m mismatches compute kernel ϕ(x) ϕ(z) by traversing trie for x using k-mers from z update kernel function when reaching a leaf ( ) O k m+1 A m ( x + z ) scales linearly with sequence length: 53

Kernel algebra given a valid kernel, we can make new valid kernels using a variety of operators kernel composition k(x,v) = k a (x,v) + k b (x,v) mapping composition φ(x) = ( φ a (x), φ b (x)) k(x,v) = γ k a (x,v), γ > 0 φ(x) = γ φ a (x) k(x,v) = k a (x,v)k b (x,v) k(x,v) = x T Av, A is p.s.d. k(x,v) = f (x) f (v)k a (x,v) φ l (x) = φ ai (x)φ bj (x) φ(x) = L T x, where A = LL T φ(x) = f (x)φ a (x) 54

Comments on SVMs we can find solutions that are globally optimal (maximize the margin) because the learning task is framed as a convex optimization problem no local minima, in contrast to multi-layer neural nets there are two formulations of the optimization: primal and dual dual represents classifier decision in terms of support vectors dual enables the use of kernel functions we can use a wide range of optimization methods to learn SVM standard quadratic programming solvers SMO [Platt, 1999] linear programming solvers for some formulations etc. 55

kernels provide a powerful way to Comments on SVMs allow nonlinear decision boundaries represent/compare complex objects such as strings and trees incorporate domain knowledge into the learning task using the kernel trick, we can implicitly use high-dimensional mappings without explicitly computing them one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMs and some encoding one class vs. rest ECOC etc. empirically, SVMs have shown state-of-the art accuracy for many tasks the kernel idea can be extended to other tasks (anomaly detection, regression, etc.) 56