Non-linear Support Vector Machines

Similar documents
Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines Explained

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

CS-E4830 Kernel Methods in Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning. Support Vector Machines. Manfred Huber

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear & nonlinear classifiers

Support Vector Machine for Classification and Regression

Introduction to Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Linear & nonlinear classifiers

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Statistical Machine Learning from Data

Statistical Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

Convex Optimization and Support Vector Machine

Perceptron Revisited: Linear Separators. Support Vector Machines

SMO Algorithms for Support Vector Machines without Bias Term

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Support Vector Machines

An introduction to Support Vector Machines

Support Vector Machine

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Announcements - Homework

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines. Maximizing the Margin

Constrained Optimization and Support Vector Machines

L5 Support Vector Classification

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines and Kernel Algorithms

Applied inductive learning - Lecture 7

Support Vector Machine & Its Applications

Support vector machines

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture Notes on Support Vector Machine

CS798: Selected topics in Machine Learning

ML (cont.): SUPPORT VECTOR MACHINES

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernel Methods and Support Vector Machines

Learning with kernels and SVM

Modelli Lineari (Generalizzati) e SVM

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Support Vector Machines

Support Vector Machines

SVM and Kernel machine

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Formulation with slack variables

CS145: INTRODUCTION TO DATA MINING

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

Support Vector Machine. Natural Language Processing Lab lizhonghua

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machines for Regression

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines with Example Dependent Costs

Support Vector Machines, Kernel SVM

Sequential Minimal Optimization (SMO)

Support Vector Machines and Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

A Tutorial on Support Vector Machine

Support Vector Machines

Brief Introduction to Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

(Kernels +) Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

Machine Learning And Applications: Supervised Learning-SVM

Applied Machine Learning Annalisa Marsico

Transcription:

Non-linear Support Vector Machines Andrea Passerini passerini@disi.unitn.it Machine Learning

Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable problems Soft-margin SVM can address linearly separable problems with outliers Non-linearly separable problems need a higher expressive power (i.e. more complex feature combinations) We do not want to loose the advantages of linear separators (i.e. large margin, theoretical guarantees) Solution Map input examples in a higher dimensional feature space Perform linear classification in this higher dimensional space

Non-linear Support Vector Machines feature map Φ : X H Φ is a function mapping each example to a higher dimensional space H Examples x are replaced with their feature mapping Φ(x) The feature mapping should increase the expressive power of the representation (e.g. introducing features which are combinations of input features) Examples should be (approximately) linearly separable in the mapped space

Feature map Homogeneous (d = 2) ( x1 Φ x 2 ) = Polynomial mapping x 2 1 x 1 x 2 x 2 2 Φ ( x1 x 2 Inhomogeneous (d = 2) ) = x 1 x 2 x 2 1 x 1 x 2 x 2 2 Maps features to all possible conjunctions (i.e. products) of features: 1 of a certain degree d (homogeneous mapping) 2 up to a certain degree (inhomogeneous mapping)

Feature map Φ

Non-linear Support Vector Machines f(x) Linear separation in feature space SVM algorithm is applied just replacing x with Φ(x): f (x) = w T Φ(x) + w 0 A linear separation (i.e. hyperplane) in feature space corresponds to a non-linear separation in input space, e.g.: ( ) x1 f = sgn(w 1 x1 2 + w 2x 1 x 2 + w 3 x2 2 + w 0) x 2

Support Vector Regression Rationale Retain combination of regularization and data fitting Regularization means smoothness (i.e. smaller weights, lower complexity) of the learned function Use a sparsifying loss to have few support vector

Support Vector Regression ɛ-insensitive loss l(f (x), y) = y f (x) ɛ = { 0 if y f (x) ɛ y f (x) ɛ otherwise Tolerate small (ɛ) deviations from the true value (i.e. no penalty) Defines an ɛ-tube of insensitiveness around true values This also allows to trade off function complexity with data fitting (playing on ɛ value)

Support Vector Regression Optimization problem Note 1 min w X,w 0 IR, ξ,ξ IR m 2 w 2 + C subject to (ξ i + ξi ) w T Φ(x i ) + w 0 y i ɛ + ξ i y i (w T Φ(x i ) + w 0 ) ɛ + ξ i ξ i, ξ i 0 Two constraints for each example for the upper and lower sides of the tube Slack variables ξ i, ξi penalize predictions out of the ɛ-insensitive tube

Support Vector Regression Lagrangian We include constraints in the minimization function using Lagrange multipliers (α i, α i, β i, β i 0): L = 1 m 2 w 2 + C (ξ i + ξi ) (β i ξ i + βi ξ i ) α i (ɛ + ξ i + y i w T Φ(x i ) w 0 ) αi (ɛ + ξ i y i + w T Φ(x i ) + w 0 )

Support Vector Regression Dual formulation Vanishing the derivatives wrt the primal variables we obtain: L m w = w (αi α i )Φ(x i ) = 0 w = L w 0 = (α i αi ) = 0 L ξ i = C α i β i = 0 α i [0, C] L = C αi βi = 0 αi [0, C] ξ i (αi α i )Φ(x i )

Support Vector Regression Dual formulation Substituting in the Lagrangian we get: 1 2 + ɛ (αi α i )(αj α j )Φ(x i ) T Φ(x j ) i,j=1 } {{ } w 2 ξ i (C β i α i ) + }{{} =0 m (α i + αi ) + ξ i (C β i α i ) }{{} =0 y i (α i α i ) + w 0 (αi α i )(αj α j )Φ(x i ) T Φ(x j ) i,j=1 m (α i α i ) } {{ } =0

Support Vector Regression Dual formulation max α IR m 1 2 subject to ɛ (αi α i )(αj α j )Φ(x i ) T Φ(x j ) i,j=1 (αi + α i ) + (α i αi ) = 0 y i (αi α i ) α i, α i [0, C] i [1, m] Regression function f (x) = w T Φ(x) + w 0 = (α i αi )Φ(x i) T Φ(x) + w 0

Support Vector Regression Karush-Khun-Tucker conditions (KKT) At the saddle point it holds that for all i: α i (ɛ + ξ i + y i w T Φ(x i ) w 0 ) = 0 α i (ɛ + ξ i y i + w T Φ(x i ) + w 0 ) = 0 β i ξ i = 0 β i ξ i = 0 Combined with C α i β i = 0,α i 0,β i 0 and C α i β i = 0,α i 0,β i 0 we get α i [0, C] α i [0, C] and α i = C if ξ i > 0 α i = C if ξ i > 0

Support Vector Regression Support Vectors All patterns within the ɛ-tube, for which f (x i ) y i < ɛ, have α i, αi = 0 and thus don t contribute to the estimated function f. Patterns for which either 0 < α i < C or 0 < αi < C are on the border of the ɛ-tube, that is f (x i ) y i = ɛ. They are the unbound support vectors. The remaining training patterns are margin errors (either ξ i > 0 or ξi > 0), and reside out of the ɛ-insensitive region. They are bound support vectors, with corresponding α i = C or αi = C.

Support Vectors

Support Vector Regression: example for decreasing ɛ

References Non linear SVM C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2(2), 121-167, 1998. Support vector regression J.Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004 (Section 7.3.3)

Appendix Smallest enclosing hypersphere Support vector ranking

Smallest Enclosing Hypersphere Rationale Usage Characterize a set of examples defining boundaries enclosing them Find smallest hypersphere in feature space enclosing data points Account for outliers paying a cost for leaving examples out of the sphere One-class classification: model a class when no negative examples exist Anomaly/novelty detection: detect test data falling outside of the sphere and return them as novel/anomalous (e.g. intrusion detection systems, Alzheimer s patients monitoring)

Smallest Enclosing Hypersphere Optimization problem min R IR,o H,ξ IR m R 2 + C ξ i subject to Φ(x i ) o 2 R 2 + ξ i i = 1,..., m ξ i 0, i = 1,..., m Note o is the center of the sphere R is the radius which is minimized slack variables ξ i gather costs for outliers

Smallest Enclosing Hypersphere Lagrangian (α i, β i 0) L = R 2 + C ξ i α i (R 2 + ξ i Φ(x i ) o 2 ) β i ξ i Vanishing the derivatives wrt primal variables L m R = 2R(1 α i ) = 0 α i = 1 L m o = 2 α i (Φ(x i ) o)( 1) = 0 o L ξ i = C α i β i = 0 α i [0, C] α i }{{} =1 = α i Φ(x i )

Smallest Enclosing Hypersphere Dual formulation R 2 (1 + α i ) + } {{ } =0 α i (Φ(x i ) ξ i (C α i β i ) }{{} =0 α j Φ(x j )) T (Φ(x i ) j=1 }{{} o α h Φ(x h )) h=1 }{{} o

Smallest Enclosing Hypersphere Dual formulation α i (Φ(x i ) = = α j Φ(x j )) T (Φ(x i ) j=1 α i Φ(x i ) T Φ(x i ) m α i j=1 α i Φ(x i ) T α j Φ(x j ) T Φ(x i ) + α i Φ(x i ) T Φ(x i ) α h Φ(x h )) = h=1 m h=1 α i j=1 }{{} =1 α i Φ(x i ) T α h Φ(x h ) α j Φ(x j ) T m j=1 α j Φ(x j ) m h=1 α h Φ(x h ) =

Smallest Enclosing Hypersphere Dual formulation max α IR m subject to Distance function α i Φ(x i ) T Φ(x i ) α i α j Φ(x i ) T Φ(x j ) i,j=1 α i = 1, 0 α i C, i = 1,..., m. The distance of a point from the origin is: R 2 (x) = Φ(x) o 2 = (Φ(x) α i Φ(x i )) T (Φ(x) = Φ(x) T Φ(x) 2 α j Φ(x j )) j=1 α i Φ(x i ) T Φ(x) + α i α j Φ(x i ) T Φ(x j ) i,j=1

Smallest Enclosing Hypersphere Karush-Khun-Tucker conditions (KKT) At the saddle point it holds that for all i: Support vectors β i ξ i = 0 α i (R 2 + ξ i Φ(x i ) o 2 ) = 0 Unbound support vectors (0 < α i < C), whose images lie on the surface of the enclosing sphere. Bound support vectors (α i = C), whose images lie outside of the enclosing sphere, which correspond to outliers. All other points (α = 0) with images inside the enclosing sphere.

Smallest Enclosing Hypersphere

Smallest Enclosing Hypersphere Decision function The radius R of the enclosing sphere can be computed using the distance function on any unbound support vector x: R 2 (x) = Φ(x) T Φ(x) 2 α i Φ(x i ) T Φ(x)+ α i α j Φ(x i ) T Φ(x j ) i,j=1 A decision function for novelty detection could be: f (x) = sgn (R 2 (x) (R ) 2) i.e. positive if the examples lays outside of the sphere and negative otherwise

Support Vector Ranking Rationale Order examples by relevance (e.g. email urgence, movie rating) Learn scoring function predicting quality of example Constraint function to score x i higher than x j if it is more relevant (pairwise comparisons for training) Easily formalized as a support vector classification task

Support Vector Ranking Optimization problem min w X,w 0 IR, ξ i,j IR subject to 1 2 w 2 + C i,j ξ i,j w T Φ(x i ) w T Φ(x j ) 1 ξ i,j ξ i,j 0 i, j : x i x j Note There is one constraint for each pair of examples having ordering information (x i x j means the former is comes first in the ranking) Examples should be correctly ordered with a large margin

Support Vector Ranking Support vector classification on pairs min w X,w 0 IR, ξ i,j IR 1 2 w 2 + C i,j ξ i,j subject to y i,j w T (Φ(x i ) Φ(x j )) 1 ξ i,j }{{} Φ(x ij ) ξ i,j 0 i, j : x i x j where labels are always positive y i,j = 1

Support Vector Ranking Decision function f (x) = w T Φ(x) Standard support vector classification function (unbiased) Represents score of example for ranking it