Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Similar documents
Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear & nonlinear classifiers

CS798: Selected topics in Machine Learning

Linear & nonlinear classifiers

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Support Vector Machines

Support Vector Machines Explained

Review: Support vector machines. Machine learning techniques and image analysis

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

L5 Support Vector Classification

Jeff Howbert Introduction to Machine Learning Winter

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Perceptron Revisited: Linear Separators. Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Pattern Recognition 2018 Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Announcements - Homework

Lecture Notes on Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machine

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Machine Learning from Data

Support Vector Machines.

SUPPORT VECTOR MACHINE

Support Vector Machines

Support Vector Machine & Its Applications

Max Margin-Classifier

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Kernel Methods and Support Vector Machines

Support Vector Machines and Speaker Verification

Support vector machines Lecture 4

Support Vector Machines

Machine Learning : Support Vector Machines

Learning with kernels and SVM

Applied Machine Learning Annalisa Marsico

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support vector machines

Statistical Pattern Recognition

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SVMs: nonlinearity through kernels

Support Vector Machine

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines and Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Applied inductive learning - Lecture 7

Soft-Margin Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture 10: Support Vector Machine and Large Margin Classifier

Constrained Optimization and Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Support Vector Machines for Classification and Regression

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kernel Methods. Machine Learning A W VO

ML (cont.): SUPPORT VECTOR MACHINES

(Kernels +) Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines and Kernel Methods

SVMs, Duality and the Kernel Trick

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Learning From Data Lecture 25 The Kernel Trick

Incorporating detractors into SVM classification

Support Vector Machines, Kernel SVM

Convex Optimization and Support Vector Machine

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Kernel methods, kernel SVM and ridge regression

Non-linear Support Vector Machines

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

CSC 411 Lecture 17: Support Vector Machine

CS-E4830 Kernel Methods in Machine Learning

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines

Transcription:

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012

Linear classifier Which classifier? x 2 x 1 2

Linear classifier Margin concept x 2 Support Vector Machine (SVM): a classifier that finds the solution with maximum margin x 1 3

Linear classifier geometry (recall) Discriminant function: y(x) = w T x + w 0 (Signed) distance from x to the boundary: w T x + w 0 w Distance does not chance when scaling w and w 0 equally 4

Margin Margin: When classes are linearly separable distance to the closest point min i y (i) wt x (i) + w 0 w 5

Maximum margin: optimization problem We want to find a line with maximum margin: max w,w 0 min i y (i) wt x (i) + w 0 w 1 max w,w 0 w min i y i (w T x (i) + w 0 ) It is hard to solve. Instead we set min i y i w T x i + w 0 = 1 since we can scale w and w 0 (Equal problem). 6

Maximum margin: optimization problem 1 max w,w 0 w s. t. y i w T x i + w 0 1 i = 1,, N min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N A regularization problem subject to the margin constraints 7

SVM: primal problem min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N x 2 x 1 8

Representer theorem Theorem: the solution to this problem: w = argmin w 2 = w T w w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N can be represented as w = α i x (i) i=1 w is a linear combination of the training examples N 9

Lagrangian multipliers min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N The above problem is equivalent to the following one: min w,w 0 max *α i 0+ 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 Lagrangian multiplier 10

Maximum margin equivalent problems min w,w 0 max *α i 0+ 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 For this problem we can interchange the order of min and max: max min 1 *α i 0+ w,w 0 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 α =,α 1,, α N - 11

Solving optimization problem max min *α i 0+ w,w 0 J w, w 0, α J w, w 0, α = 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 J w,w 0,α w J w,w 0,α N = 0 w = i=1 α i y (i) x (i) = 0 w i=1 α i y (i) = 0 0 N w 0 was dropped out but a global constraint on α was created. 12

Maximum margin: dual problem max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) x i T x j N Subject to i=1 α i y (i) = 0 QP: quadratic problem α i 0 i = 1,, n we find α by solving above problem and then w = α i y (i) x (i) N i=1 13

Maximum margin: decision boundary If the distance to the boundary of a particular x i is y i w T x i + w 0 > 1 α i = 0 and thus x i is not a support vector. Inactive constraint The direction of hyper-plane (only based on support vectors): Support vectors w = α i >0 α i y (i) x (i) w 0 can be set by making the margin equidistant to two classes. 14

Karush-Kuhn-Tucker (KKT) conditions L w,w 0,α w = 0 L w,w 0,α w 0 = 0 y i w T x i + w 0 1 i = 1,, N α i 0 i = 1,, N α i 1 y i w T x i + w 0 = 0 i = 1,, N 15

Support Vectors x 2 α > 0 α > 0 α > 0 1 w x 1 16

Support Vectors x 2 α > 0 α = 0 α = 0 α > 0 α > 0 1 w x 1 17

Classification of a test example Test example: x y = sign w 0 + w T x y = sign(w 0 + α i y i x i α i >0 y = sign(w 0 + α i >0 T x) α i y i x i T x) The classifier is based on the expansion in terms of dot products of x with support vectors. 18

Support Vectors Linear hyper-plane defined by support vectors Support vectors are sufficient to predict labels of new points How many support vectors in linearly separable case (d dimensions)? d + 1 19

More general than linearly separable case Allow error in classification Overlapping classes that can be approximately separated by a linear boundary Noise in the linearly separable data x 2 20 x 1 x 1

More general than linearly separable case: objective function Minimizing the number of misclassified points?! NP-complete Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct plane 21

SVM (soft margin): Primal problem SVM with slack variables min w 2 + C ξ w,w 0, ξ N i i i=1 i=1 s. t. y i w T x i + w 0 1 ξ i i = 1,, N ξ i 0 x 2 N ξ i : slack variables ξ i > 1: if x i misclassifed ξ i 0 ξ i < 1 : if x i correctly classified but inside margin 22 x 1

Soft margin linear penalty if mistake (hinge loss) C: tradeoff parameter Soft margin approach is still QP Has a unique minimum ξ = 0 ξ = 0 23

Soft margin parameter C is a regularization parameter: small C allows margin constraints to be easily ignored large margin large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin C can be determined using a technique like crossvalidation 24

SVM (soft margin): dual problem max α N i=1 α i 1 2 N i=1 N i=1 α i α j y (i) y (j) x i T x j Subject to α i y (i) = 0 0 α i C i = 1,, n By solving the above quadratic problem first we find α and then find w = α i y (i) x (i) N i=1 For a test sample x (as before): y = sign w 0 + w T x = sign(w 0 + α i y i x i T x) α i >0 25

SVM (soft margin): another view min w 2 + C ξ w,w 0, ξ N i i i=1 i=1 s. t. y i w T x i + w 0 1 ξ i i = 1,, N ξ i 0 Equivalent to the unconstrained optimization problem N min w 2 + C max (0,1 y (i) (w T x (i) + w 0 )) w,w 0 i=1 N 26

SVM loss function vs. other loss functions Hinge loss vs. 0-1 loss max (0,1 y (i) (w T x (i) + w 0 )) 0-1 Loss Hinge Loss w T x + w 0 y = 1 27

SVM loss function vs. other loss functions Hinge loss vs. log conditional likelihood max (0,1 y (i) (w T x (i) + w 0 )) Hinge Loss w T x + w 0 y = 1 28

Not linearly separable data Noisy data or overlapping classes (we discussed about it: soft margin) Near linearly separable x 2 x 1 Non-linear decision surface x 2 Transform to a new feature space 29 x 1

Nonlinear SVM Φ: x φ(x) 30

SVM in a transformed space φ: R d R m x φ x Find a hyper-plane in the feature space: f x = w T φ x + w 0 = 0 x 2 φ 2 (x) φ: x φ x 31 x 1 φ 1 (x)

SVM in a transformed space: Primal Primal problem: 32 min w 2 + C ξ i w,w 0 i=1 s. t. y i w T φ(x i ) + w 0 1 ξ i i = 1,, N ξ i 0 Classifying a new data: y = sign w 0 + w T φ(x) = sign(w 0 + α α i y i φ(x i ) T i >0 φ(x) ) Transform to a space where data can be classified linearly w R m If m d (very high dimensional feature space) then there are many more parameters to learn N

SVM classifier in a transformed space: Dual Optimization problem: max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) φ x (i) Subject to N i=1 α i y (i) = 0 0 α i C i = 1,, n T φ x (j) If we have inner products φ x (i) T φ x (j), only α =,α 1,, α N - needs to be learnt It is not necessary to learn m parameters 33

Kernel trick Kernel: inner product in a feature space k x, x = φ x T φ(x ) φ x = φ 1 x,, φ m x T Kernel trick Extension of many well-known algorithms Idea: the input vectors enters only in the form of scalar products We can replace that inner product with other choices of kernel 34

Kernel SVM Optimization problem: max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) φ k(x x (i) i T, φx (j) x (j) ) Subject to N i=1 α i y (i) = 0 0 α i C i = 1,, n Classifying a new data: y = sign w 0 + w T φ(x) = sign(w 0 + α i y i φ(x i ) T φ(x) k(x i, x) α i >0 ) 35

Constructing kernels Construct kernel functions directly Ensure that it is a valid kernel Corresponds to an inner product in some feature space. Example: k(x, x ) = x T x 2 φ x = x 1 2, 2x 1 x 2, x 2 2 T We need a way to test whether a kernel is valid without having to construct φ x 36

Valid kernel: necessary & sufficient conditions [Shawe-Taylor & Cristianini 2004] Gram matrix K N N : K ij = k(x (i), x (j) ) Restricting the kernel function to a set of points *x 1, x 2,, x (N) + K must be positive semi-definite For all possible choices of the set *x i + 37

Some common kernels Linear: k(x, x ) = x T x Polynomial: k x, x = (x T x + c) m c 0 Contains all polynomials terms up to degree m Gaussian: k x, x = exp ( x x 2 2σ 2 ) Infinite dimensional feature space Sigmoid: k x, x = tanh (ax T x + b) Stationary: function of x x RBF functions: k x, x = g x x 38

SVM Gaussian kernel: Example 39 Source: Zisserman s slides

SVM Gaussian kernel: Example 40 Source: Zisserman s slides

SVM Gaussian kernel: Example 41 Source: Zisserman s slides

SVM Gaussian kernel: Example 42 Source: Zisserman s slides

SVM Gaussian kernel: Example 43 Source: Zisserman s slides

SVM Gaussian kernel: Example 44 Source: Zisserman s slides

SVM Gaussian kernel: Example 45 Source: Zisserman s slides

Kernel function for objects Kernel functions can be defined over objects: graphs, sets, strings, Kernel for sets: example k A, B = 2 A B 46

SVM: Summary Hard margin: maximizing margin 47 Primal and dual problems Dual problem represents classifier decision in terms of support vectors Quadratic optimization problem single global minimum Soft margin: handling noisy data and overlapping classes Slack variables in the primal problem Again dual problem is a quadratic problem Kernel SVM s Learns linear decision boundary in a high dimension space using SVM Suitable boundaries can be found in the original dimension space Kernel trick