Lecture 10: A brief introduction to Support Vector Machine

Similar documents
Lecture 10: Support Vector Machine and Large Margin Classifier

Jeff Howbert Introduction to Machine Learning Winter

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines for Classification and Regression

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture Notes on Support Vector Machine

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines Explained

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Support Vector Machines

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Statistical Machine Learning from Data

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines and Speaker Verification

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Announcements - Homework

SVMs, Duality and the Kernel Trick

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CIS 520: Machine Learning Oct 09, Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines.

Support Vector Machines, Kernel SVM

Soft-Margin Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS145: INTRODUCTION TO DATA MINING

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Kernel Methods and Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Learning From Data Lecture 25 The Kernel Trick

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Support Vector Machine (SVM) Classifiers

Linear & nonlinear classifiers

CSC 411 Lecture 17: Support Vector Machine

Statistical Pattern Recognition

Support Vector Machines and Kernel Methods

Support Vector Machine

Modelli Lineari (Generalizzati) e SVM

Support Vector Machine

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Introduction to Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CS798: Selected topics in Machine Learning

Support Vector Machines

Support Vector Machines

Constrained Optimization and Support Vector Machines

(Kernels +) Support Vector Machines

Support Vector Machines. Maximizing the Margin

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to SVM and RVM

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Polyhedral Computation. Linear Classifiers & the SVM

SUPPORT VECTOR MACHINE

Introduction to Machine Learning

Statistical Methods for SVM

Max Margin-Classifier

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear & nonlinear classifiers

Convex Optimization and Support Vector Machine

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Support vector machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine for Classification and Regression

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

LECTURE 7 Support vector machines

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines

Support Vector Machine II

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines

Classification and Support Vector Machine

LMS Algorithm Summary

Transcription:

Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton University, State University of New York E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/aama/ 1 / 22

What s the Problem? Training Data: { (y i, x i ), i = 1 n, y i Y, x i S R d}. Goal: a mapping φ : S Y Inputs: x S. Outputs: y Y. Regression Setting Y R. Predict y i as ŷ i = φ(x i ), so that i (y i ŷ i ) 2 is as small as possible. (Binary) Classification Setting Y = { 1, +1}. Predict y i as ŷ i = φ(x i ), so that i 1 {y i ŷ i } is as small as possible. 2 / 22

0 1 Loss or Other Loss Goal: a mapping φ : S Y, which predicts y i as ŷ i = φ(x i ), so that y i ŷ i as few as possible. Introduce a function of x, f (x): let φ(x) = sign(f (x)). f (x) > 0 +1 (positive class; case), f (x) < 0 1 (negative class; control). min f i 1 {y i φ(x i )}, or equivalently min f i 1 {yi f (x i ) 0} 0 1 loss function: 1 {u 0}, where u = y i f (x i ). may not work: not convex!!! What function looks similar to 0 1 loss, but is convex? One answer: Hinge loss, [1 u] +. 3 / 22

loss 0 1 2 3 4 hingle loss 0 1 loss DWD loss (C=1) 2 1 0 1 2 3 u 4 / 22

Linear Discrimination If f (x) = x ω + β is linear, then f (x) = 0 is a hyperplane. Want: a hyperplane in the middle of two classes. 5 / 22

Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 6 / 22

Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 7 / 22

Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 8 / 22

Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 9 / 22

Linear Discrimination If f (x) = x ω + β is linear, then f (x) = 0 is a hyperplane. Want: a hyperplane in the middle of two classes. Want: the margin induced by the hyperplane between these two classes to be large. 10 / 22

Illustration Plot of SVM 6 4 Positive class Negative class u=0 u=1 Support Vector 2 0 2 4 6 11 / 22

Formulation The margin will be 2 f (x SV ). We want to maximize it. ω By the definition of support vectors, also need to make sure that: i, y i f (x i ) y SV f (x SV ) = f (x SV ) But, the norm of ω can be arbitrary. Rescale the norm of ω, so that f (x SV ) = 1. So we will try to Equivalently, max ω,β 2 ω, s.t. y i f (x i ) 1. ω 2 min ω,β 2, s.t. y i f (x i ) 1. 12 / 22

ω 2 min ω,β 2, s.t. y i f (x i ) 1. Non-separable Case If i, y i f (x i ) 1 is absolutely impossible, introduce a slack variable ξ i 0 for each data vector x i. y i f (x i ) 1 relaxed to y i f (x i ) 1 ξ i. ξ i represents the violation of correct classification. We want i ξ i to be small. Add it to the objective function, rescaled by a constant C. min ω,β,ξ ( ω 2 + C ) ξ i, 2 i s.t. y i f (x i ) 1 ξ i, ξ i 0. 13 / 22

Regularization Framework Equivalent to, min ω,β ( ω 2 2 + C i [1 y i f (x i )] + ). Or, Or, min ω,β ( 1 n i 1 min ω,β n [1 y i f (x i )] + + λ ω 2 2 [1 y i f (x i )] +, i s.t. ω 2 L. ). 14 / 22

Remarks about SVM Solved by Quadratic Programming (QP) of the duality problem. Final solution is ω = i α i y i x i. For support vectors, α i > 0; otherwise α i = 0 Important observation: determined / influenced by the support vectors only. 15 / 22

Solution How to solve SVM? Take the first formulation for non-separable case as an example: ( ω 2 Primal: min + C ) ξ i, ω,β,ξ 2 i s.t. y i f (x i ) 1 ξ i, ξ i 0. We will derive the duality problem, which is a standard Quadratic Programming (QP) problem. Then use a standard QP package to solve it. 16 / 22

Duality of SVM Lagrangian: L S = ω ω 2 + i {Cξ i α i [y i (x iω + β) + ξ i 1] µ i ξ i }. Here α i, µ i 0 are Lagrange multipliers. KKT conditions say L S ω, L S β and L S ξ i all need to equal to 0. ω i α i y i x i = 0, α i y i = 0, C α i µ i = 0. Moreover, α i [y i (x i ω + β) + ξ i 1] 0 and µ i ξ i 0. i 17 / 22

Duality of SVM KKT conditions combined with α i 0, µ i 0, lead to the dual problem: Dual: max 1 α i α j y i y j x α i 2 ix j + α i i j i s.t. i α i y i = 0, 0 α i C. This is a standard QP problem. Solvable. Very fast. There are n unknown variables instead of d + 1 + n as in the primal problem. 18 / 22

From Dual Solution to Primal Solution The ω can be found by the first KKT condition: ω = i α i y i x i. Also note that α i [y i (x i ω + β) + ξ i 1] 0 and µ i ξ i 0, and C α i µ i = 0. Thus α i > 0 y i (x iω + β) + ξ i 1 = 0, α i < C µ i 0 ξ i = 0. In order to find out β, we find a data vector (x i, y i ) where 0 < α i < C, and calculate β = (x i ω) ξ i/y i + 1/y i. 19 / 22

Nonlinear SVM Why do we learn the dual solution? To implement the SVM To extend SVM for nonlinear classification by the kernel trick In computation of linear SVM, the inner products x i x j = x i x j (or the dot product), for i, j = 1,..., n are sufficient. For nonlinear SVM, each input vector x i is mapped to a higher-dimensional feature space: x i φ(x i ) R Q, and the inner products φ(x i ) φ(x j ) are only needed in computation. The kernel trick replaces φ(x i ) φ(x j ) with a kernel k(x i, x j ). Then the mapping φ is implicitly defined by k. 20 / 22

The linear kernel: Examples of popular kernels k(x, y) = x y. The leads to the original, linear SVM. The polynomial kernel: k(x, y) = (c + x y) d. We can write down the expansion explicitly, so the mapping φ can be explicitly expressed. The Gaussian (radial basis function) kernel: k(x, y) = exp ( 1σ ) 2 x y 2. The feature space where φ(x) lies is infinite dimensional. (The mapping φ is only implicitly defined) 21 / 22

Software library(kernlab) # train the linear SVM svp <- ksvm(xtrain,ytrain,kernel= vanilladot ) # Nonlinear SVM with Gaussian kernel svp <- ksvm(xtrain,ytrain,kernel= rbfdot ) # Nonlinear SVM with polynomial kernel svp <- ksvm(xtrain,ytrain,kernel= polydot ) # Predict labels on test ypred = predict(svp,xtest) table(ytest,ypred) 22 / 22