Support Vector Machines. Maximizing the Margin

Similar documents
Soft-margin SVM can address linearly separable problems with outliers

Kernel Methods and Support Vector Machines

Support Vector Machines MIT Course Notes Cynthia Rudin

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

1 Bounding the Margin

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Goals for the lecture

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (continued)

Geometrical intuition behind the dual problem

Review: Support vector machines. Machine learning techniques and image analysis

Bayes Decision Rule and Naïve Bayes Classifier

Support Vector Machines Explained

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Modelli Lineari (Generalizzati) e SVM

Jeff Howbert Introduction to Machine Learning Winter

Lecture 10: A brief introduction to Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods and Support Vector Machines

Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Machine Learning from Data

Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Introduction to Support Vector Machines

Support Vector Machines for Classification and Regression

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Introduction to Support Vector Machines

Non-linear Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Neural networks and support vector machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines and Speaker Verification

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

SUPPORT VECTOR MACHINE

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

List Scheduling and LPT Oliver Braun (09/05/2017)

Support Vector Machine (SVM) and Kernel Methods

Constrained Optimization and Support Vector Machines

Support Vector Machine

Support Vector Machines

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Max Margin-Classifier

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Kernel methods

(Kernels +) Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machines.

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

SVMs, Duality and the Kernel Trick

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Boosting with log-loss

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machines

Support Vector Machine for Classification and Regression

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Support Vector Machine & Its Applications

Classifier Complexity and Support Vector Classifiers

Introduction to SVM and RVM

Linear Classification and SVM. Dr. Xin Zhang

Statistical learning theory, Support vector machines, and Bioinformatics

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical Pattern Recognition

An introduction to Support Vector Machines

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

A Simple Regression Problem

ML (cont.): SUPPORT VECTOR MACHINES

Incorporating detractors into SVM classification

Computational and Statistical Learning Theory

Brief Introduction to Machine Learning

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines

Lecture 7: Kernels for Classification and Regression

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Lecture 9: Multi Kernel SVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

SVM optimization and Kernel methods

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Support Vector Machines

CS798: Selected topics in Machine Learning

Pattern Recognition and Machine Learning. Artificial Neural networks

Learning with kernels and SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Transcription:

Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the Lagrange ultipliers, α i 0, a nonnegative weight for each training exaple. k is a kernel function, e.g., we ight choose the dot product, k(x, x ) = x x = Σ x j x j if α i > 0, then x i is a support vector. Maxiizing the Margin The argin is the region h(x). The goal of learning is to axiize the width of the argin with positive exs. and negative exs.. A SVM is defined by its support vectors, the positive exs. and negative exs.. The dot product, linear, kernel function k(x, x ) = x x = Σ j x j x j leads to a classifier siilar to perceptrons with a argin. Other kernel fns. lead to nonlinear decision boundaries.

Exaple SVM for Separable Exaples.5 w.x + b = - w.x + b = 0 w.x + b =.5 0 - - 0 4 5 6 Exaple SVM for Nonseparable Exaples.5 w.x + b = - w.x + b = 0 w.x + b =.5.5 4 4.5 5 5.5 6 6.5 7

0 - Exaple Gaussian Kernel SVM 4 5 6 7.5.5 0 Exaple Gaussian Kernel, Zooed In 0-4. 4.4 4.6 4.8 5 5. 5.4 5.6 5.8.8.6.4.

Hyperplane Classification Consider the class of hyperplanes, i.e., dot product plus a bias: (w x) + b = 0 For linearly separable exaples, there a unique optial hyperplane defined by axiizing the argin. This can be expressed as: ax w,b in{ x x i (w x) + b = 0} Choose w and b to axiize the iniu distance fro an exaple to the hyperplane. Hyperplane Exaple Figure fro Scholkopf and Sola, Learning with Kernels, MIT Press, 00.

The optial hyperplane can be found by solving: iniize w / s.t. y i ((w x i ) + b), i {,..., } That is, we require the sallest weights such that positive exaples and negative exaples. The sallest weights correspond to the axiu argin. Note that: w = w w so the width of the argin is equal to: w w = w Convert to a Lagrangian: Math Tricks w / Σ i= α i (y i ((w x i ) + b) ) The derivatives are zero when: w = Σ α i y i x i and 0 = Σ α i y i i= i= Substituting for w in the Lagrangian leads to the objective function: Σ α i i= Σ Σ α iα j y i y j (x i x j ) i= j= which we want to axiize subject to α i 0 and Σ α iy i = 0 i=

Substituting for w in the hypothesis: h(x) = (w x) + b leads to: h(x) = b + Σ i= y i α i (x x i ) In this case (dot product kernel, linearly separable exaples), α i > 0 iplies y i ((w x i ) + b) =, i.e., support vectors will be on the argin boundary. Kernel Tricks The kernel trick is to substitute other functions k(x, x ) in place of (x x ). The ost popular are Polynoial: (x x ) d, where d {,,...} (a(x x ) + c) d, where d {,,...}, a, c > 0 Gaussian: exp{ x x /(σ )}, where σ > 0. The trick of a kernel function is efficiently coputing the dot product of a large nuber of basis functions. k(x, x ) = (Φ(x) Φ(x )) where Φ(x) = (Φ (x), Φ (x),...).

Exaple Kernel Function Let k(x, x ) = ((x x ) + ). Then in two diensions, ((u v) + ) = ((u, u ) (v, v )) + ) = (u v + u v + ) = u v + u v + + u u v v + u v + u v = (u, u,, u u, u, u ) (v, v,, v v, v, v ) The kernel function obtains the sae result as the dot product of the basis functions without explicitly coputing all the basis functions. The Gaussian kernel corresponds to an infinite nuber of basis functions! Nonlinear Classification Using a kernel function, the hypothesis becoes: h(x) = b + Σ i= y i α i k(x, x i ) The Lagrangian conversion results in the proble: Find the α i weights that solve: axiize Σ i= α i subject to α i 0 and Σ i= Σ j= α iα j y i y j k(x i, x j ) Σ i= α iy i = 0 The support vectors are {x i α i > 0}. They satisfy y i h(x i ) =.

Figure fro Scholkopf and Sola, Learning with Kernels, MIT Press, 00. Soft Margin Classification It ight not be possible or desirable to satisfy: y i ((w x i ) + b) To allow violations, an error ter can be added: y i ((w x i ) + b) ξ i where ξ i 0 and the proble is to: iniize w / + C Σ i= ξ i where C is chosen by the user. Note that Σ i= ξ i the nuber of training istakes. In the conversion, replace α i 0 with 0 α i C.

Figure fro Burges, A tutorial on support vector achines for pattern recognition, Data Mining and Knowledge Discovery :-67, 998. ν-paraeterization Another type of soft argin classifier satisfies: y i ((w x i ) + b) ρ ξ i where ρ > 0 is a free paraeter, and solves: iniize w / ρ + ν Σ i= ξ i where ν is chosen by the user. In the conversion, α i 0 is replaced with: 0 α i ν and Σ α i = i= ν (0, ) eans at least ν support vectors.

C =, Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7 C = 0, Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7

C = 00, Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7 ν = 0., Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7

ν = 0., Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7 ν = 0., Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7

Support Vector Regression SV classification uses y {, }, while regression tries to predict y R. SV regression uses ɛ-insensitive loss: y z ɛ = or equivalently: 0 if y z ɛ y z ɛ otherwise y z ɛ = ax(0, y z ɛ) To obtain a hypothesis h(x) = b + w x iniize w / + C Σ i= y h(x i ) ɛ One can think of SV regression as fitting a tube of radius ɛ to the data. Figure fro Scholkopf and Sola, Learning with Kernels, MIT Press, 00.

SV Regression Proble Applying the usual ath tricks results in: h(x) = b + Σ i= α i k(x, x i ) where the α i weights are found by solving: axiize ɛ Σ α i i= + Σ α i y i i= Σ i= Σ j= α iα j k(x i, x j ) subject to C α i C and Σ i= α i = 0