Support Vector Machine

Similar documents
Support Vector Machine (continued)

Pattern Recognition 2018 Support Vector Machines

CS798: Selected topics in Machine Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear & nonlinear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods and Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Statistical Machine Learning from Data

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Announcements - Homework

Support Vector Machines

Support Vector Machines

Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

(Kernels +) Support Vector Machines

Support Vector Machines and Kernel Methods

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Max Margin-Classifier

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CS , Fall 2011 Assignment 2 Solutions

Lecture Notes on Support Vector Machine

Support Vector Machines Explained

Support Vector Machines and Speaker Verification

Support Vector Machines, Kernel SVM

Support Vector Machines

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Support Vector Machines for Classification and Regression

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Sparse Kernel Machines - SVM

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning with kernels and SVM

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear, Binary SVM Classifiers

Support Vector Machines

Support Vector Machine. Industrial AI Lab.

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CSC 411 Lecture 17: Support Vector Machine

LECTURE 7 Support vector machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS145: INTRODUCTION TO DATA MINING

Sequential Minimal Optimization (SMO)

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Learning From Data Lecture 25 The Kernel Trick

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Lecture Support Vector Machine (SVM) Classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ICS-E4030 Kernel Methods in Machine Learning

L5 Support Vector Classification

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning, Fall 2011: Homework 5

Applied Machine Learning Annalisa Marsico

Classifier Complexity and Support Vector Classifiers

This is an author-deposited version published in : Eprints ID : 17710

Non-linear Support Vector Machines

Statistical Pattern Recognition

Support Vector Machines

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

SVMs, Duality and the Kernel Trick

SUPPORT VECTOR MACHINE

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Applied inductive learning - Lecture 7

LMS Algorithm Summary

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 3 January 28

Polyhedral Computation. Linear Classifiers & the SVM

Constrained Optimization and Support Vector Machines

Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Transcription:

Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity- it is possible to construct new kernels from old ones, if k 1, k 2 are kernels then k 1 + k 2 is a kernel ck 1 is a kernel for c > 0 ak 1 + bk 2 is a kernel for a, b > 0 1

Examples: k(x, y) = x, y n k(x, y) = e x y 2 /(2σ) Let x = (x 1, x 2 ) T, y = (y 1, y 2 ) T. x, y 2 = (x 1 y 1 + x 2 y 2 ) 2 = x 2 1 y2 1 + x2 2 y2 2 + 2x 1y 1 x 2 y 2 = (x 2 1, x2 2, 2x 1 x 2 ), (y 2 1, y2 2, 2y 1 y 2 ) = ϕ(x), ϕ(y) (1) Given a set of vectors x 1,..., x N a kernel matrix is defined as K = k(x 1, x 1 ) k(x 1, x 2 ) k(x 1, x N ) k(x 2, x 1 ) k(x 2, x 2 ) k(x 2, x N ). k(x N, x 1 ) k(x n, x 2 ) k(x N, x N ) 2

Properties: the kernel matrix is positive semi-definite (i.e. for any vector a it satisfies a T Ka 0) any symmetric (i.e. a ij = a ji ) positive definite matrix can be regarded as a kernel matrix (an inner product matrix in some space) every (semi) positive definite, symmetric function K is a kernel, i.e. there exists a mapping ϕ such that k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) The kernel can be expanded in the series k(x 1, x 2 ) = i=1 λ i ϕ i (x 1 )ϕ i (x 2 ), λ i > 0 i where λ i are nonnegative eigenvalues. 3

Cover s Theorem: Given a set of training data that is not linearly separable, one can with high probability transform it into a training set that is linearly separable by projecting it into a higher-dimensional space via some non-linear transformation. 8 x 2 6 4 2 0 φ 3 8 6 4 2 2 4 6 8 10 5 0 5 10 x 1 10 0 0 φ 2 10 10 5 φ 1 0 5 10 The data is not linearly separable in the original space [x 1, x 2 ]. Select ϕ 1 = x 1, ϕ 2 = x 2, ϕ 3 = x 2 1 + x2 2. The data becomes linearly separable in the new, higher dimensional feature space. 4

Maximum Margin Classifiers: Consider the twoclass classification problem using the linear models of the form y(x) = w T ϕ(x) + b (2) where ϕ(x) denotes a fixed feature-space transformation. The training data set comprises N input vectors x 1,...,x N, with corresponding target values t 1,..., t N where t n { 1, 1}, and the new data points x is according to the sign of y(x). Nonoverlapping class distribution: We start with this case. The support vector machine approaches the problem by maximizing the margin, which is defined to be the smallest distance between the decision boundary and any of the samples. 5

w 0 w w g(x) w x g>0 g<0 g=0 Since a data point s distance to the hyperplane is given as y(x)/ w, in the separable case, we can set y(x) = 1 for the nearest point, so the margin is 1/ w. In the case that the training data set is linear separable in the feature space, for the linear classifier y(x) = w T ϕ(x) + b (3) 6

we have t n y(x n ) > 0, for all data points. We can set t n y(x n ) = 1 for the point that is closest to the surface. All data points should satisfy the constraints t n ( w T ϕ(x) + b ) 1, n = 1,..., N. (4) We have to solve the optimization problem arg min w,b 1 2 w 2 (5) subject to (4), which is an example of quadratic programming (QP) problem. Introducing Lagrange multipliers a n 0, for each of the constraints in (4), giving the Lagrangian function L(w, b, a) = 1 2 w 2 where a = [a 1,..., a N ] T. a n {t n ( w T ϕ(x)+b ) 1} (6) 7

Setting the derivatives of L(w, b, a) with respect to w, b equal to zero, we obtain the following two conditions w = 0 = Substitute these into (3) we have y(x) = a n t n ϕ(x n ) (7) a n t n (8) a n t n k(x, x n ) + b (9) where k(x, x m ) is the kernel function. Eliminating w, b from L(w, b, a) using these conditions then gives the dual representation of the maximum margin problem of L(a) = a n 1 2 m=1 a n a m t n t m k(x n, x m ) 8 (10)

subject to a n 0 (11) 0 = a n t n (12) which is also a quadratic programming problem. Applying the KKT conditions, a n 0 t n y(x n ) 1 0 a n {t n y(x n ) 1} = 0 Thus for every data point either a n = 0 or t n y(x n ) = 1. Any data point with a n = 0 will not appear in (9). The remaining data points are called support vectors, satisfying t n y(x n ) = 1. They correspond to points that lie on the maximum margin hyperplane in feature space, i.e 9

y(x) = n S a n t n k(x, x n ) + b (13) where S denotes the set in indices of the support vectors. Having solved the QP problem and found a, we make use of t 2 n = 1 to solve b as b = 1 ( t n ) a m t m k(x n, x m ) (14) N S m S n S where N S is the total number of support vectors. Example: XOR problem. Input vector x i Desired response t i (-1,-1) -1 (-1, 1) 1 (-1, 1) 1 (1, 1) -1 10

Let k(x, x i ) = (1 + x T x i ) 2, with x = [x 1, x 2 ] T and x i = [x i,1, x i,2 ] T. k(x, x i ) = 1 + x 2 1 x2 i,1 + 2x 1x i,1 x 2 x i,2 + x 2 2 x2 i,2 + 2x 1x i,1 + 2x 2 x i,2 = ϕ(x) T ϕ(x i ) (15) with ϕ(x) = [1, x 2 1, 2x 1 x 2, x 2 2, 2x 1, 2x 2 ] T and ϕ(x i ) = [1, x 2 i,1, 2x i,1 x i,2, x 2 i,2, 2x i,1, 2x i,2 ] T, i. The kernel matrix is K = 9 1 1 1 1 9 1 1 1 1 9 1 1 1 1 9 The objective function in dual form is L(a) = a n 1 2 m=1 = a 1 + a 2 + a 3 + a 4 1 2 a n a m t n t m k(x n, x m ) ( 9a 2 1 2a 1a 2 2a 1 a 3 + 2a 1 a 4 + 9a 2 2 + 2a 2a 3 2a 2 a 4 + 9a 2 3 2a 3 a 4 + 9a 2 4 ) (16) 11

Setting the derivatives of L(a) as zeros, we have 9a 1 a 2 a 3 + a 4 = 1 a 1 + 9a 2 + a 3 a 4 = 1 a 1 + a 2 + 9a 3 a 4 = 1 a 1 a 2 a 3 + 9a 4 = 1 we have a 1 = a 2 = a 3 = a 4 = 1/8. The result indicates all four data samples are support vectors. The weight vector w = 4 a n t n ϕ(x n ) = [0, 0, 1/ 2, 0, 0, 0] T (17) 12

The bias term b is b = 1 ( t n a m t m k(x, x m ) N S m S = 1 4 = 1 4 n S 4 4 ( t n 1 4 ) t m k(x n, x m ) 8 m=1 (t n 18 ) ( 12 + 12 + 12 12) ) = 0 (18) So the decision hyperplane is given by the polynomial x 1 x 2 = 0. 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1 0 1 13