Support vector networks

Similar documents
Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS798: Selected topics in Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines

Classifier Complexity and Support Vector Classifiers

Machine Learning, Fall 2011: Homework 5

Learning with kernels and SVM

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CS , Fall 2011 Assignment 2 Solutions

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support vector machines Lecture 4

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Pattern Recognition 2018 Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

SVMs: nonlinearity through kernels

Linear classifiers Lecture 3

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS145: INTRODUCTION TO DATA MINING

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Review: Support vector machines. Machine learning techniques and image analysis

Linear Classifiers: Expressiveness

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

L5 Support Vector Classification

SUPPORT VECTOR MACHINE

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning. Support Vector Machines. Manfred Huber

Lecture Notes on Support Vector Machine

Support Vector Machines

CS 590D Lecture Notes

Statistical Methods for SVM

Lecture 7: Kernels for Classification and Regression

Support Vector Machines

Kernel Methods and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

ML (cont.): SUPPORT VECTOR MACHINES

Statistical Pattern Recognition

Induction in the version space

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Midterm exam CS 189/289, Fall 2015

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Max Margin-Classifier

Support Vector Machine

Induction on Decision Trees

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Announcements - Homework

Support Vector Machines. Maximizing the Margin

Support Vector Machine & Its Applications

Support Vector Machines

An introduction to Support Vector Machines

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning 2010

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine Learning Lecture 5

This is an author-deposited version published in : Eprints ID : 17710

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear & nonlinear classifiers

Introduction to Support Vector Machines

Support Vector Machines.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines and Speaker Verification

Support Vector Machines

Statistical learning theory, Support vector machines, and Bioinformatics

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernel Methods. Charles Elkan October 17, 2007

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Support Vector Machines and Kernel Methods

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines, Kernel SVM

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Kernelized Perceptron Support Vector Machines

Deep Learning for Computer Vision

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Introduction to Discriminative Machine Learning

Linear & nonlinear classifiers

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Transcription:

Support vector networks Séance «svn» de l'ue «apprentissage automatique» Bruno Bouzy bruno.bouzy@parisdescartes.fr www.mi.parisdescartes.fr/~bouzy

Reference These slides present the following paper: C.Cortes, V.Vapnik, «support vector networks», Machine Learning (1995) They are commented with my personal view to teach the key ideas of SVN. The outline mostly follows the outline of the paper.

Outline Abstract Preamble «Introduction» Optimal hyperplane Soft margin hyperplane Dot-product in feature space General features of SVN Experimental analysis Conclusion

Abstract New learning method for 2-group classification Input vectors non-linearly mapped to a high dimension space (the feature space) In feature space: linear decision surface High generalisation ability Experiments: good performances in OCR

Preamble How to separate 2 separable classes? Figure 0a

Preamble Separating 2 separable classes: easy! Simply choose a point somewhere in between two opposite examples No error! Figure 0b

Preamble What may happen when a new example comes? One error... (bouh... generalisation is poor) Figure 0c

Preamble How to optimally separate 2 separable classes? Figure 0d

Preamble Optimally separating 2 separable classes: not difficult! Choose the point at the middle! optimal margin Figure 0e

Preamble What may happen when a new example comes? better chance that no error this time (whew... generalisation is better) optimal margin Figure 0f

Preamble support vectors the optimal point depends only on some examples: the support vectors, and not on the other. optimal margin support vector O support vector X Figure 0g

Preamble How to separate 2 non-separable classes? huhu... Figure 0h

Preamble minimizing the number of training errors... 3 2 3 4 3 2 Figure 0i

Preamble and minimizing the number of test errors... soft margin Figure 0j

Preamble Transforming the problem (dim=1) into a higher dimensional one? hoho? y y=x 2 Figure 0k x

Preamble then separate... huhu? Smart! y y=x 2 Figure 0l x

Preamble summary Separable case: optimal margin for good generalisation support vectors Non separable case: soft margin, minimizing the errors Key ideas: Transform the problem into a higher dimensional separable problem Linear separation

«Introduction» To obtain a decision surface to a polynomial of degree 2, create a feature space with N=n(n+3)/2 coordinates: z 1 = x 1, z 2 = x 2,..., z n = x n z n+1 = x 12, z n+2 = x 22,..., z 2n = z n 2 z 2n+1 = x 1 x 2, z 2n+2 = x 1 x 3,..., z N =x n-1 x n

«Introduction» Conceptual problem: How to find an hyperplane that generalizes well? Technical problem: How to computationally treate high dimensional space? (to construct a polynomial of degree 4 or 5 in a dimension 200 space, the high dimension can be 10 6 )

«Introduction» The support vectors, determine the optimal margin (greatest separation): Optimal margin Figure 2 Optimal hyperplane

«Introduction» The bound on the error depends on the number of support vector (5): E(pr(err))<= E(#suppVectors)/E(#trainVectors)

«Introduction» Optimal separating plane equation: w 0.z+b 0 = 0 Weights of the vector hyperplane (6): w 0 = Σ i α i z i Linear decision function (7): I(z) = sign(σ i α i z i.z+b 0 )

«Introduction» Figure 3 classification by a SVN of an unknown pattern classification non-linear transformation comparison input vector in feature space support vectors in feature space unknown input vector x

Figure 4: «Introduction» 2 nd kind of classification by a SVN: classification non-linear transformation support vectors comparison unknown input vector x

Optimal hyperplanes Set of labeled patterns (8): (y 1, x 1 ),..., (y l, x l ), with y i = ±1 Linearly separable (9): if there exist vector w and scalar b such that: w.x i +b >= 1 if y i =1 w.x i +b <= -1 if y i =-1 Linearly separable (10): if there exist vector w and scalar b such that: y i.(w.x i +b) >= 1 for i=1,...,l

Optimal hyperplanes Optimal hyperplane (11): w 0.x+b 0 = 0 Separates the training data with maximal margin: Determines the direction w/ w where the distance ρ(w,b) (12) between the projections of the training vectors of the two classes is maximal (13): ρ(w,b) = min x:y=1 x.w/ w - max x:y=-1 x.w/ w ρ(w 0,b 0 ) = 2/ w 0

Optimal hyperplanes The optimal hyperplane minimizes w.w under the constraints (10). Constructing an optimal hyperplane is a quadratic problem. Vectors such that (10) is an egality are termed support vectors.

Optimal hyperplanes The optimal hyperplane can be written as a linear combination of training vectors (14): W 0 = Σ 0 i=1,l y i α i x i with α i0 >=0 for all training vectors and α i0 >0 for support vectors only T Λ 0 = (α 10, α 20,..., α l0 )

Optimal hyperplanes The quadratic programming problem : W(Λ) = Λ T 1 - ½ Λ T D Λ (15) with: Λ >= 0 (16) Λ T Y = 0 (17) D ij = y i y j x i.x j for i,j = 1,..., l (18)

Optimal hyperplanes When the data of (8) can be separated by an hyperplane: W(Λ 0 ) = 2/ρ 0 2 When the data of (8) cannot be separated by an hyperplane: For any large constant C, find Λ such that W(Λ)>C

Solving scheme: Optimal hyperplanes Divide the training data into portions Start out by solving the first portion If first portion cannot be separated then end & failure If first portion is separated, Make a new training set with support vectors of the first portion and the training vectors of the second portion that do not satisfy (10) Continue by solving this new portion etc.

Soft margin hyperplane Case where the training data cannot be separated without error: One may want to separate with the minimal numbers of errors Non negative variables (23): ξ i >= 0 i=1,...,l Minimize (21): Φ(ξ) = Σ i ξ i Subject to (22): y i.(w.x i +b) >= 1-ξ i i=1,...,l

Soft margin hyperplane Minimizing (21), one finds a minimal subset of training errors: (y i1, x i1 ),..., (y ik, x ik ) Minimize (25): ½ w 2 + CF(Σ i ξ i ) subject to (22) and (23) where F is monotonic convex, F(0)=0 F(u)=u 2 C constant

Soft margin hyperplane The programming problem : W(Λ) = Λ T 1 - ½ [Λ T D Λ + δ 2 /C] (26) with: Λ T Y = 0 (27) δ >= 0 (28) 0<= Λ <= δ1 (29) note that: δ = max(α 1, α 2,..., α l )

Soft margin hyperplane The solution exists and is unique for any data set Not a quadratic problem but a convex programming problem

Dot-product in feature space Transform: the n-dimensional input vector x into an N-dimensional feature vector through a function Φ : Φ : R n -> R N Φ(x i ) = Φ 1 (x i ), Φ 2 (x i ),..., Φ N (x i ) Construct the N dimensional linear separator w and bias b

Dot-product in feature space To classify an unknown vector x: Transform it into Φ(x) vector Take the sign of (31): f(x) = w. Φ(x) + b w is linear combination of support vectors: w = Σ i=1,l y i α i Φ(x i ) (32) f(x) = Σ i=1,l y i α i Φ(x).Φ(x i ) + b (33)

Dot-product in feature space Dot-product in Hilbert space Φ(u). Φ(v) = K(u, v) (34) Any function K, symmetric can be expanded: K(u, v) = Σ i=1, λ i Φ i (u).φ i (v) (35) λ i eigenvalues and Φ i eigenfunctions K(u,v)Φ i (u)du = λ i Φ i (v) To ensure (34) defines a dot-product: λ i >= 0

Dot-product in feature space Merser theorem: K(u,v)g(u)g(v)dudv > 0 for g such that: iff λ i >= 0 g(u)2du <

Dot-product in feature space Functions satisfying the Merser theorem: K(u, v) = exp(- u-v /σ) (36) K(u, v) = (u.v + 1) d (37) Decision surface: f(x) = Σ i=1,l y i α i K(x, x i ) + b To find the α i and x i same solution scheme with: D ij = y i y j K(x i, x j ) i,j = 1,..., l

General features of SVN Constructing the decision rules by SVN is efficient Follow the scheme of soft margin One unique solution SVN is a universal machine By changing K, one obtains various machines SVN controls the generalization ability

Experimental results blablabla

Conclusion blablabla

Example of XOR blablabla