Introduction to SVM and RVM

Similar documents
Perceptron Revisited: Linear Separators. Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (continued)

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machine & Its Applications

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Support Vector Machines. Manfred Huber

Pattern Recognition 2018 Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernel Methods and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines Explained

Relevance Vector Machines

Linear & nonlinear classifiers

Support Vector Machines and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines.

SUPPORT VECTOR MACHINE

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine for Classification and Regression

Introduction to Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Classifier Complexity and Support Vector Classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Statistical Machine Learning from Data

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernel Methods. Barnabás Póczos

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

SVM optimization and Kernel methods

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Review: Support vector machines. Machine learning techniques and image analysis

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical Pattern Recognition

Support Vector Machines: Maximum Margin Classifiers

L5 Support Vector Classification

Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines and Speaker Verification

18.9 SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

(Kernels +) Support Vector Machines

Kernel methods, kernel SVM and ridge regression

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Constrained Optimization and Support Vector Machines

Max Margin-Classifier

ML (cont.): SUPPORT VECTOR MACHINES

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

A Tutorial on Support Vector Machine

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines for Classification: A Statistical Portrait

SVMs, Duality and the Kernel Trick

Machine Learning Lecture 7

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CMU-Q Lecture 24:

Support Vector Machines

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Kernel Methods. Machine Learning A W VO

Chapter 6: Classification

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Statistical Methods for NLP

Statistical Methods for SVM

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Support Vector Machine

Formulation with slack variables

Lecture 10: Support Vector Machine and Large Margin Classifier

Announcements - Homework

Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Introduction to Machine Learning

Support Vector Machines

Lecture Notes on Support Vector Machine

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Incorporating detractors into SVM classification

Pattern Recognition and Machine Learning

Transcription:

Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB

Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance vector machine RVM Introduced by Tipping, M.E, 2001 Few documents, lots of potential research topics Both are kernel based supervised learning Kernel is a key concept in machine learning Both use few key data for classification/regression SVM: few support vectors RVM: few relevance vectors

Support vector machine (SVM) Supervised learning: classification and regression Advantages of support vector machine Can always find a global optimization solution Use few support vector instead of whole dataset Kernel-based mapping tricks to deal with non-linear boundary case Pure data driven, no need for priori assumptions of model structure

Outlines for SVM part Maximal Margin Classifier (MMC) Global solution Support vectors Support Vector Classifier (SVC) Slack variables Support Vector Machine (SVM) Enlarged feature space Kernel trick

Maximal Margin Classifier (MMC)

Training dataset are purely linear separable in MMC Input contains p covariates (p independent variables): X = (X 1, X 2,, X p ) T, output Y belong to one of two classes: -1 or +1 Training dataset of N observations {( y 1, x 1 ),, ( y N, x N )}: the ith observation x i R p : x i = (x i1, x ip ) T and output y i {-1,1}; i =1,2,, N. Let f X = β 0 + β 1 X 1 + + β p X p, we need to construct a linear separating hyperplane f X = 0 as classifier. For new input data x 0 R p, if f (x 0 ) > 0, then predict output as 1 class, else -1 class.

Maximal margin hyperplane (MMH) (Eg. p = 2) X 2 Green points: y = -1, red points: y =1 Margin M: distance from hyperplane to the nearest training dataset. Intuition is, separating hyperplane should be as far away from the data of both classes as possible: larger M, better hyperplane. MMH is the hyperplane which has the largest M (graph D): X 1

Optimization problem Two dashed lines: margin lines To find coefficients β 0, β = (β 1,, β p ) T of MMH, we need to resolve a maximization problem. Rewritten as following minimization problem: min 2 β 2 subject to y i x T i β + β 0 1 0, i = 1,... N β,β 0 1 Can find a global solution! (Reason why we want to construct a linear hyperplane)

Transform to dual problem The lagrangian here is: L(α, β, β 0 ) = 1 2 β 2 N α i [y i (x T i β + β 0 ) 1, i=1 where α = (α 1,..., α N ) T are Lagrange multipliers The previous minimization problem is the same as minimizing following L D (α) with respect to α = (α 1,..., α N ) T (Dual problem) N N L D α = α i 1 2 α i α k y i y k x i T x k i=1 i,k=1 Optimal solution to α = (α 1,..., α N ) T lead to optimal solution to β 0, β = (β 1,, β p )

Inner product < x i, x j > Que: why we want to resolve the dual problem instead of the original optimization problem? Ans: a. Help us to identify the support vectors. b. Inner product < x i, x j > = x i T x j appears, which will be an important element later in non-linearly separable classification problems.

Solve Dual problem The lagrangian L(α, β, β 0 ) = 1 2 β 2 N α i [y i (x T i β + β 0 ) 1 i=1 Step 1 fix α, minimize L(α, β, β 0 ) with respec to β and β 0 : set L = L N N = 0 and get solutions β = α β β i=1 i y i x i, and 0 = α i y i 0 Step 2-- substitute the two equation in (I) back to L and get: N L D α = α i i=1 1 2 N i,k=1 Step 3-- minimize L D (α) with respect to α α i α k y i y k x i T x k i=1 (I) Step 3 can be resolved (global solution) by sequential minimal optimization (SMO) algorithm.

Support vectors (I) When dual problem is resolved, we have that for α i, i = 1,... N: If α i > 0., the corresponding x i lie exactly on the margin lines (dashed line), they are the support vector. If x i is outside the margin lines on the boundary, then α i = 0.

Support vectors (II) N L The original coefficient β = i=1 α i y i x i is then β = l=1 α l y l where x l are support vectors and L is much smaller than N! x l For new input data x 0 : f (x 0 ) =β T x 0 + β 0 = L l=1 α l y l x T l x 0 + β 0 If f (x 0 ) > 0, then predict output as 1, else -1.

Advantages and Disadvantage of MMC Advantages: maximal margin between two classes global solution use few support vectors for prediction Disadvantages: hard margin, training dataset must line outside margin lines. can not seperate noisy data (left graph). not robust to outliers (right graph). Soft margin: Support vector classifier (SVC)

Support Vector Classifier (SVC)

Soft margin in SVC A soft margin can allow that certain points: lie on the incorrect side of margin lines (green ξ * 4 and red ξ * 1, ξ * 2 ). lie at the wrong side of the separating hyperplane (green ξ * 3 and red ξ * 5). Allow maximal D points on the wrong side of hyperplane.

Optimization problem at SVC Constructing the SVC hyperplane is then solving the following optimization problem: ε 1,..., ε N are slack variables: ε i = 0, then the ith observation is on the correct side of margin. ε i > 0, then the ith observation is on the wrong side of the margin. ε i > 1, then the ith observation lies on the wrong side of the hyperplane. Maximal allowed number is D, a hyperparameter chosen by CV.

Support vectors in SVC Only if the ith observation lies on or violate the margin line, then α i > 0. Those observations are support vectors, else α i = 0 For new input x 0, if f (x 0 ) > 0, then predict output as 1, else -1: N f (x 0 ) = i=1 α i y i x T i x 0 + β 0 = l=1 α l y l x T l x 0 + β 0 x l are support vectors and L is much smaller than N L

Support vectors in SVC Especially suitable for large data set.

Linear boundary can not work Disadvantages of SVC: sometime a linear hyperplane just won't work in the original p dimension input space (left, p =2) We can map original data to a higher m dimension feature space where the data are separable with a linear hyperplane (right, m =3 ) Mapping tricks: SVM

Support Vector Machine

Basic idea of SVM (I) Transform (map) the data X = (X 1, X 2,, X p ) T from the original p dimension input space into the m dimension enlarged feature space: X h(x) where h(x) = (h 1 (X), h 2 (X),, h m (X) ) T with m > p. Then find the unique (global solution) optimal hyperplane in the m dimension space Example: p =2, m = 3 with h 1 (X) = X 1 2, h 2 (X) = 2 X 1 X 2, h 3 (X) = X 2 2

Inner products After the transformation, the ith input in the enlarged feature space become a m dimension vector: h(x i ) = (h 1 (x i ), h 2 (x i ),..., h m (x i )) T The dual problem is then: N L D α = α i i=1 1 2 N α i i,k=1 α k y i y k h(x i ) T h(x k ) For new input data x 0 : f (x 0 ) =β T L h x 0 + β 0 = l=1 α l y l h(x l ) T (x 0 ) + β 0 (L much smaller than N) If f (x 0 ) > 0, the predicted output is 1, else -1.

Transformation h(x i ) = (h 1 (x i ), h 2 (x i ),..., h m (x i )) T Instead of computing x i T x j, need to compute h(x i ) T h(x j ) m can be very high and computations of h(x i ) T h(x j ) intractable. are In SVM, we don t actually choose transformation basis functions h(x)= (h 1 (x), h 2 (x),, h m (x)) T Instead we choose a Kernel function so that K(x i,x j )=h(x i ) T h(x j ).

The Kernel Trick A kernel function is some function that corresponds to an inner product in some enlarged feature space. Eg p = 2: x = (x 1 x 2 ) T. Suppose a transformation basis is : h(x) = (1 x 1 2 2 x 1 x 2 x 2 2 2x 1 2x 2 ) T, thus m = 6 The x i =(x i1 x i2 ) T, x j =(x j1 x j2 ) T inner product h(x i ) T h(x j ) is: h(x i ) T h(x j ) = [1 x i1 2 2 x i1 x i2 x i2 2 2x i1 2x i2 ] T [1 x j1 2 2 x j1 x j2 x j2 2 2x j1 2x j2 ] =1+ x i12 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i22 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = (1 + x it x j ) 2 If we define a kernel function K(x i,x j ) = (1 + x it x j ) 2, then there is no need to compute h(x i ) T h(x j ) explicitly.

Examples of Kernel Functions Linear: K(x i,x j )= x i T x j Polynomial of power d: K(x i,x j )= (1+ x i T x j ) d Gaussian (radial-basis function network): xi x K( xi, xj) exp( 2 2 j 2 ) Sigmoid: K(x i,x j )= tanh(β 0 x i T x j + β 1 )!All the kernel function K( ) must satisfy Mercer's condition.

Kernel tricks By using kernel functions, dual problem is from: to: N L D α = i=1 α i 1 2 N L D α = i=1 α i 1 2 For new input data x 0 : N i,k=1 α i α k y i y k h(x i ) T h(x k N i,k=1 α i α k y i y k K(x i,x j ) f (x 0 ) =β T L h x 0 + β 0 = l=1 α l y l K(x l,x 0 ) + β 0 If f (x 0 ) > 0, the predicted output is 1, else -1.

Multiple classification One versus the rest: Training for each class with all the others serving as the non-class training samples Hierarchical Trees - One vs One

SVM regression example Pure data driven No need for assumption of regression model Still use only few support vectors instead of whole data set

Summary of SVM Advantages: Global optimization solution, few support vectors for prediction, fast computation speed, pure data driven, avoid overfitting... Applications:

Disadvantages of SVM Predictions are not probabilistic. Number of support vectors grows steeply with the size of the training set. Need to use cross validation to choose the hyperparameter D and ε (ε is hyperparameter in SVR). The kernel function K( ) must satisfy Mercer's condition.

Revelance vector machine http://www.miketipping.com/sparsebayes.htm Training dataset T ={(y 1, x 1 ),, (y N, x N )} Assume p(y x, w) follows Gaussian distribution N( f (x), σ 2 ): f (x) = N i=1 w i K(x i,x 0 ) + β 0 (II) where w = (w 1,..., w N ) T are weight vectors Based on (II), we can write down likelihood of dataset p(t w ). Given a specific prior distribution p(w), RVM calculate the posterior probability p(w T ) from Bayes rule.

RVM p(w i T) becomes infinitely peaked at zero, and the corresponding i th kernel functions can be 'pruned away. The rest L (L much less than N) number of nonzero weights corresponding the training datapoints which is called relevance vectors For new input x 0, use predictive distribution p(y 0 x 0, T) for further inference/prediction of y 0 : p( y0 x 0, T ) p( y0 x 0, w) p( w T ) dw

RVM vs SVM RVM give out probabilistic prediction p(y 0 x 0, T), SVM just give out point prediction y 0 The number of relevance vectors can be much smaller than that of support vectors. RVM does not need the tuning of a hyperparameter as in SVM during the training phase. The kernel function K( ) do not need to satisfy Mercer's condition.

RVM regression vs SVM regression An example with DGP: y = sinc(x) + N(0, sd) where sd = 0.1 RVR use only 7 relevance vectors while SVR use 29 support vectors. RVR has less error. SVR has to use CV to choose hyperparameter C and ε, while RVR automatically estimate them by learning procedure. However, training phase of RVM typically involves a highly nonlinear optimization process. (Can only find local optimization).