Soft-margin SVM can address linearly separable problems with outliers

Similar documents
Non-linear Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

CS-E4830 Kernel Methods in Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine for Classification and Regression

Support Vector Machines Explained

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Statistical Machine Learning from Data

Linear & nonlinear classifiers

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CSC 411 Lecture 17: Support Vector Machine

Linear & nonlinear classifiers

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Convex Optimization and Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Statistical Pattern Recognition

Announcements - Homework

Introduction to Support Vector Machines

L5 Support Vector Classification

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture Notes on Support Vector Machine

Support Vector Machine

SMO Algorithms for Support Vector Machines without Bias Term

Modelli Lineari (Generalizzati) e SVM

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support vector machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines

Applied inductive learning - Lecture 7

Constrained Optimization and Support Vector Machines

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

SVM and Kernel machine

Learning with kernels and SVM

Support Vector Machines: Maximum Margin Classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture Support Vector Machine (SVM) Classifiers

CS798: Selected topics in Machine Learning

Machine Learning And Applications: Supervised Learning-SVM

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines for Regression

Support Vector Machines

An introduction to Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines, Kernel SVM

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support vector machines Lecture 4

Support Vector Machines and Kernel Algorithms

COMP 875 Announcements

Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines and Kernel Methods

Homework 3. Convex Optimization /36-725

Linear Models. min. i=1... m. i=1...m

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Introduction to Support Vector Machines

Support Vector Machines

Kernel Methods and Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

SVM optimization and Kernel methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

HOMEWORK 4: SVMS AND KERNELS

Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

Nearest Neighbors Methods for Support Vector Machines

Linear classifiers Lecture 3

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Statistical Classification. Minsoo Kim Pomona College Advisor: Jo Hardin

LECTURE 7 Support vector machines

Support Vector Machines.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Transcription:

Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable problems Soft-margin SVM can address linearly separable problems with outliers Non-linearly separable problems need a higher expressive power (i.e. more complex feature combinations) We do not want to loose the advantages of linear separators (i.e. large margin, theoretical guarantees) Solution Map input examples in a higher dimensional feature space Perform linear classification in this higher dimensional space Non-linear Support Vector Machines feature map Φ : X H Φ is a function mapping each example to a higher dimensional space H Examples x are replaced with their feature mapping Φ(x) The feature mapping should increase the expressive power of the representation (e.g. introducing features which are combinations of input features) Examples should be (approximately) linearly separable in the mapped space Feature map Homogeneous (d = 2) Inhomogeneous (d = 2) Polynomial mapping ( ) x1 Φ = x 2 ( ) x1 Φ = x 2 x 2 1 x 1 x 2 x 2 2 x 1 x 2 x 2 1 x 1 x 2 x 2 2 Maps features to all possible conjunctions (i.e. products) of features: 1. of a certain degree d (homogeneous mapping) 2. up to a certain degree (inhomogeneous mapping)

Feature map Φ Non-linear Support Vector Machines f(x) Linear separation in feature space SVM algorithm is applied just replacing x with Φ(x): f(x) = w T Φ(x) + w 0 A linear separation (i.e. hyperplane) ( in ) feature space corresponds to a non-linear separation in input space, e.g.: x1 f = sgn(w 1 x 2 1 + w 2 x 1 x 2 + w 3 x 2 2 + w 0 ) x 2 Rationale Retain combination of regularization a data fitting Regularization means smoothness (i.e. smaller weights, lower complexity) of the learned function Use a sparsifying loss to have few support vector 2

ɛ-insensitive loss l(f(x), y) = y f(x) ɛ = { 0 if y f(x) ɛ y f(x) ɛ otherwise Tolerate small (ɛ) deviations from the true value (i.e. no penalty) Defines an ɛ-tube of insensitiveness around true values This also allows to trade off function complexity with data fitting (playing onn ɛ value) Optimization problem Note 1 min w X,w 0 IR, ξ,ξ IR m 2 w 2 + C subject to (ξ i + ξi ) w T Φ(x i ) + w 0 y i ɛ + ξ i y i (w T Φ(x i ) + w 0 ) ɛ + ξ i ξ i, ξ i 0 Two constraints for each example for the upper and lower sides of the tube Slack variables ξ i, ξ i penalize predictions out of the ɛ-insensitive tube Lagrangian We include constraints in the minimization function using Lagrange multipliers (α i, α i, β i, βi 0): L = 1 2 w 2 + C (ξ i + ξi ) (β i ξ i + βi ξi ) α i (ɛ + ξ i + y i w T Φ(x i ) w 0 ) αi (ɛ + ξi y i + w T Φ(x i ) + w 0 ) 3

Vanishing the derivatives wrt the primal variables we obtain: m w = w (αi α i )Φ(x i ) = 0 w = (αi α i )Φ(x i ) w 0 = (α i αi ) = 0 ξ i = C α i β i = 0 α i [0, C] = C αi βi = 0 αi [0, C] ξ i Substituting in the Lagrangian we get: 1 2 + (αi α i )(αj α j )Φ(x i ) T Φ(x j ) } {{ } w 2 ξ i (C β i α i ) + =0 ξ i (C β i α i ) =0 ɛ (α i + αi ) + y i (αi α i ) + w 0 (αi α i )(αj α j )Φ(x i ) T Φ(x j ) m (α i α i ) } {{ } =0 max α IR m 1 2 subject to (αi α i )(αj α j )Φ(x i ) T Φ(x j ) ɛ (αi + α i ) + y i (αi α i ) (α i αi ) = 0 α i, α i [0, C] i [1, m] 4

Regression function f(x) = w T Φ(x) + w 0 = (α i αi )Φ(x i ) T Φ(x) + w 0 Karush-Khun-Tucker conditions (KKT) At the saddle point it holds that for all i: α i (ɛ + ξ i + y i w T Φ(x i ) w 0 ) = 0 αi (ɛ + ξi y i + w T Φ(x i ) + w 0 ) = 0 (C α i )ξ i = 0 (C αi )ξi = 0 Support Vectors All patterns within the ɛ-tube, for which f(x i ) y i < ɛ, have α i, αi estimated function f. = 0 and thus don t contribute to the Patterns for which either 0 < α i < C or 0 < α i < C are on the border of the ɛ-tube, that is f(x i) y i = ɛ. They are the unbound support vectors. The remaining training patterns are margin errors (either ξ i > 0 or ξi > 0), and reside out of the ɛ-insensitive region. They are bound support vectors, with corresponding α i = C or αi = C. Support Vectors 5

: example for decreasing ɛ Rationale Characterize a set of examples defining boundaries enclosing them Find smallest hypersphere in feature space enclosing data points Account for outliers paying a cost for leaving examples out of the sphere Usage One-class classification: model a class when no negative examples exist Anomaly/novelty detection: detect test data falling outside of the sphere and return them as novel/anomalous (e.g. intrusion detection systems, Alzheimer s patients monitoring) 6

Optimization problem min R IR,o H,ξ IR m R 2 + C ξ i subject to Φ(x i ) o 2 R 2 + ξ i i = 1,..., m ξ i 0, i = 1,..., m Note o is the center of the sphere R is the radius which is minimized slack variables ξ i gather costs for outliers Lagrangian (α i, β i 0) L = R 2 + C ξ i α i (R 2 + ξ i Φ(x i ) o 2 ) β i ξ i Vanishing the derivatives wrt primal variables m R = 2R(1 α i ) = 0 α i = 1 m o = 2 α i (Φ(x i ) o)( 1) = 0 o ξ i = C α i β i = 0 α i [0, C] α i =1 = α i Φ(x i ) R 2 (1 α i ) + ξ i (C α i β i ) =0 =0 + α i (Φ(x i ) α j Φ(x j )) T (Φ(x i ) α h Φ(x h )) j=1 o h=1 o 7

α i (Φ(x i ) α j Φ(x j )) T (Φ(x i ) α h Φ(x h )) = j=1 = α i Φ(x i ) T Φ(x i ) α i Φ(x i ) T m α i j=1 α j Φ(x j ) T Φ(x i ) + h=1 m h=1 α i j=1 =1 = α i Φ(x i ) T Φ(x i ) α i Φ(x i ) T α h Φ(x h ) α j Φ(x j ) T m j=1 α j Φ(x j ) m h=1 α h Φ(x h ) = Distance function max α IR m subject to α i Φ(x i ) T Φ(x i ) α i α j Φ(x i ) T Φ(x j ) α i = 1, 0 α i C, i = 1,..., m. The distance of a point from the origin is: R 2 (x) = Φ(x) o 2 = (Φ(x) α i Φ(x i )) T (Φ(x) α j Φ(x j )) = Φ(x) T Φ(x) 2 j=1 α i Φ(x i ) T Φ(x) + α i α j Φ(x i ) T Φ(x j ) Karush-Khun-Tucker conditions (KKT) At the saddle point it holds that for all i: β i ξ i = 0 α i (R 2 + ξ i Φ(x i ) o 2 ) = 0 Support vectors Unbound support vectors (0 < α i < C), whose images lie on the surface of the enclosing sphere. Bound support vectors (α i = C), whose images lie outside of the enclosing sphere, which correspond to outliers. All other points (α = 0) with images inside the enclosing sphere. 8

Decision function The radius R of the enclosing sphere can be computed using the distance function on any unbound support vector x: R 2 (x) = Φ(x) T Φ(x) 2 α i Φ(x i ) T Φ(x) + α i α j Φ(x i ) T Φ(x j ) A decision function for novelty detection could be: f(x) = sgn ( R 2 (x) (R ) 2) i.e. positive if the examples lays outside of the sphere and negative otherwise Support Vector Ranking Rationale Order examples by relevance (e.g. email urgence, movie rating) Learn scoring function predicting quality of example Constraint function to score x i higher than x j if it is more relevant (pairwise comparisons for training) Easily formalized as a support vector classification task 9

Support Vector Ranking Optimization problem min w X,w 0 IR, ξ i,j IR 1 2 w 2 + C i,j ξ i,j subject to w T Φ(x i ) w T Φ(x j ) 1 ξ i,j ξ i,j 0 i, j : x i x j Note There is one constraint for each pair of examples having ordering information (x i x j means the former is comes first in the ranking) Examples should be correctly ordered with a large margin Support Vector Ranking Support vector classification on pairs min w X,w 0 IR, ξ i,j IR subject to 1 2 w 2 + C i,j ξ i,j y i,j w T (Φ(x i ) Φ(x j )) 1 ξ i,j Φ(x ij) ξ i,j 0 i, j : x i x j where labels are always positive y i,j = 1 Support Vector Ranking Decision function f(x) = w T Φ(x) Standard support vector classification function (unbiased) Represents score of example for ranking it 10