Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Similar documents
Support Vector Machine (continued)

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Pattern Recognition 2018 Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

CSC 411 Lecture 17: Support Vector Machine

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Basis Expansion and Nonlinear SVM. Kai Yu

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernels and the Kernel Trick. Machine Learning Fall 2017

Reproducing Kernel Hilbert Spaces

Review: Support vector machines. Machine learning techniques and image analysis

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Methods for SVM

SVMs, Duality and the Kernel Trick

Kernel Methods and Support Vector Machines

Support Vector Machines, Kernel SVM

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machines

Introduction to SVM and RVM

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines and Kernel Methods

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear & nonlinear classifiers

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SUPPORT VECTOR MACHINE

ML (cont.): SUPPORT VECTOR MACHINES

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Reproducing Kernel Hilbert Spaces

Lecture Notes on Support Vector Machine

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

L5 Support Vector Classification

Kernel Methods. Barnabás Póczos

Statistical Methods for Data Mining

Linear & nonlinear classifiers

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Reproducing Kernel Hilbert Spaces

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

The Representor Theorem, Kernels, and Hilbert Spaces

Applied Machine Learning Annalisa Marsico

Announcements - Homework

Support Vector Machines

A Bahadur Representation of the Linear Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Support vector machines Lecture 4

Support Vector Machines and Speaker Verification

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machines.

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

(Kernels +) Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Machine Learning Practice Page 2 of 2 10/28/13

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

18.9 SUPPORT VECTOR MACHINES

Support vector machines

Max Margin-Classifier

Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

LMS Algorithm Summary

SVMC An introduction to Support Vector Machines Classification

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Machine Learning

Lecture 18: Multiclass Support Vector Machines

Classifier Complexity and Support Vector Classifiers

Kernel methods, kernel SVM and ridge regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Statistical Pattern Recognition

Discriminative Models

Midterm exam CS 189/289, Fall 2015

Statistical Machine Learning from Data

ICS-E4030 Kernel Methods in Machine Learning

Transcription:

Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have the buffer-zone between the two classes as large as possible. We can formulate this max-margin problem as follows: 1 β,β 0 2 β 2 (1) subject to y i (β x i + β 0 ) 1 0, where β x i = β t x i denotes the (Euclidian) inner product between two vectors. The constraints are imposed to make sure that the points are on the correct side of the dashed lines, i.e., β x i + β 0 +1 for y i = +1, β x i + β 0 1 for y i = 1. If x i is on one of the dashed lines, i.e., y i (β x i + β 0 ) 1 = 0, then we say that this i-th constraint (or i-th point) is active. If y i (β x i +β 0 ) 1 > 0, we say it s inactive. Apparently, there should be at least two points, one from each class, which are active (otherwise we can improve the margin). This type of constrained optimization (1), known as the convex programg problem, has been well-studied in the literature. We actually solve its 1

twin-sister, the so-called dual problem, max λi 1 λ i λ j y i y j (x i x j ) (2) λ 1,...,λ n 2 i,j subject to λi y i = 0, λ i 0, where the variables are λ 1,..., λ n and λ i is associated with the ith observation. Here are the connections between these two optimization problems, where ( ˆβ 0, ˆβ) denote the solution for the primal problem (1), and (λ 1,..., λ n ) the solution for the dual problem (2). Complementarity condition: λ i {y i (x i ˆβ + ˆβ 0 ) 1} = 0. That is, if a point is inactive (i.e, not on the dashed line), the corresponding λ i must equal 0. Those points for which λ i > 0 are called support vectors. We can obtain an estimate of the slope β after solving the dual problem based on the following equality ˆβ = λ i y i x i = i N s λ i y i x i, where N s denotes the set of support vectors. So. Pick any support vector, due to the complementarity condition, it must satisfy the equality y i (β x i + β 0 ) 1 = 0, so we can solve for ˆβ 0. Of course it is numerically safer to obtain an estimate of β 0 from each support vector, and then take the mean of all such values. For any new observation x, the prediction is sign( i N s λ i y i x i x + ˆβ 0 ). Note that the classifier only depends on the value, (y i, x i ), of the support vectors, so it s a sparse solution. Also, the classifier is robust in the sense that if we move an inactive point around in the correct region (i.e., don t cross the dashed line), then the estimated classifier stays the same since the inactive points, as long as they stay inactive (therefore their λ i = 0), do not affect the classifier. 2

Linear SVM (non-separable case) What if the two classes of points are not separable? Then we introduce a slack variable ξ i for each sample, and formulate the max-margin problem as follows 1 β,β 0 2 β 2 + γ ξ i (3) subject to y i (x i β + β 0 ) 1 + ξ i 0, ξ i 0. Note that ξ i > 0 only for samples that are on the wrong side of the dashed line, and ξ i is automatically (by the optimization) set to be 0 for samples that are on the correct side of the dashed line. The optimization (3) reflects the trade-off between the margin 2/ β 2 (your gain) and the sum of the positive slack variables (the price you need to pay), and γ is a tuning parameter which can be treated as given at this moment and is often selected by cross-validation in practice. Similarly to the separable case, we solve the dual problem max λi 1 λ i λ j y i y j (x i x j ) λ 1,...,λ n 2 i,j subject to λi y i = 0, 0 λ i γ. The connections between the primal and the dual problems are similar to the ones for the separable case. Especially, the inactive points (i.e., the ones 3

on the correct side of the dashed line) have their λ i = 0. So after solving for λ i s (only a hand of them will be non-zero), we can obtain a sparse classifier, which only depends on a subset of the data points (i.e., the support vectors). Although this new optimization (3) is proposed for non-separable cases, we can use it on separable cases too: depending on the magnitude of γ, we might be willing to pay some price to have points cross the dashed line, if it will lead to a big gain of the margin. 4

Non-linear SVM In linear SVMs, the decision boundaries are linear (i.e., hyper-planes in the feature space). To obtain more flexible classifiers that have nonlinear decision boundaries, we can consider this approach: first embed data points into a large feature space Φ : X F, Φ(x) = (φ 1 (x), φ 2 (x),... ), (4) then operate the linear SVM in the feature space F; since a linear function of Φ(x) may be a non-linear function of x, we can then obtain nonlinear classifiers. Kernel trick. When operating the linear SVM in the feature space, the only quantity we need to evaluate is the inner product K Φ (x i, x) = Φ(x), Φ(x i ). (5) Note that I did not write the inner product as Φ(x) t Φ(x i ) since the feature space F might be infinite dimensional (not a finite dimensional Euclidean space). For example, the mapped feature Φ(x) could be a function, and then Φ(x), Φ(x i ) is just whatever inner product between two functions from the feature space F (apparently, we have assumed that F is a Hilbert space). Recall the linear SVM solves max λi 1 λ i λ j y i y j (x i x j ) λ i 2 i,j subject to λi y i = 0, 0 λ i γ. for λ 1,..., λ n, and then the prediction at a new point x is given by sign(f(x )) = sign ( λ i y i (x i x ) + ˆβ ) 0 i N s Now the non-linear SVM solves max λi 1 λ i λ j y i y j K(x i, x j ) λ i 2 i,j subject to λi y i = 0, 0 λ i γ. 5

for λ 1,..., λ n, and then the prediction at a new point x is given by sign(f(x )) = sign ( λ i y i K(x i, x ) + ˆβ ) 0 i N s The bivariate function K in (5) is often referred to as the reproducing kernel (r.k.) function or simply the kernel function 1. We can view K(x, z) as a similarity measure between x and z, which generalizes the ordinary Euclidean inner product between x and z. Popular kernels (see p434 of textbook) include dth degree polynomial K(x, z) = (1 + x z) d, Radial basis (the feature space is of infinite-dimension) K(x, z) = exp( x z 2 /c). SVM as a penalization method Let f(x) = x β + β 0 and y i { 1, 1}. Then β,β 0 n [1 y i f(x i )] + + ν β 2 (6) i=1 has the same solution as the linear SVM (3), when the tuning parameter ν is properly chose (which will depend on γ in (3)). So SVM is a special case of the following Loss + Penalty framework β,β 0 n L(y i, f(x i )) + ν β 2. i=1 The loss used in SVM is called the hinge loss, L(y, f(x)) = [1 yf(x)] +. 1 It can be shown that a kernel function K(x, z) must be symmetric and semi-positive definite. 6

Other popular loss functions for classification problems are (negative) Loglikelihood loss, ( L(y, f(x)) = log 1 + e yf(x)), the squared error loss, L(y, f(x)) = (y f(x)) 2 = (1 yf(x)) 2, and the 0/1 loss L(y, f(x) = 1{yf(x) 0}. A nonlinear SVM (associated with a kernel function K) solves f n [1 y i f(x i )] + + ν f 2 H K, (7) i=1 where f (usually nonlinear) belongs to a Hilbert space H 2 K, which is detered by the kernel function K, and f 2 H K denotes the corresponding norm. Although the function space H K is very large (possibly infinite dimensional), a beautiful result, known as the representer theorem, shows that 2 Known as the reproducing kernel Hilbert space (RKHS) 7

the imizer of (7) is always finite dimensional (with maximal dim = n) and takes the following form arg f HK [ 1 n n ] L(y i, f(x i )) + ν f 2 H K i=1 = α 1 K(x, x 1 ) + + α n K(x, x n ). Note that this is true for any loss function, so we replace the hinge loss in (7) by L(y, f(x)). For example, suppose we use squared error loss, the optimization (7) becomes y K n n α 2 + ναk n n α, where K n n is a n n matrix with the (i, j)th entry equal to K(x i, x j ), and we have used the result that α 1 K(, x 1 ) + + α n K(, x n ) 2 H K = α t Kα. In other words, what a nonlinear SVM does is to first create a set of n new predictors K(, x i ), each of which measures the similarity to the i-th sample, and then solve for a linear function of these n predictors in the loss + penalty framework where the penalty is a ridge-type L 2 penalty. Wait, we have learned that ridge penalty won t lead to a sparse solution, but why the solution from SVM is sparse? Yes, not like the L 1 penalty, the L 2 penalty on α won t lead to a sparse solution. The penalty in SVM isn t L 1, but the hinge loss function is like L 1, and it is the hinge loss that leads to the sparse solution of α. 8