Support Vector Machines

Similar documents
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Machine Learning A W VO

Perceptron Revisited: Linear Separators. Support Vector Machines

L5 Support Vector Classification

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (continued)

Machine Learning. Support Vector Machines. Manfred Huber

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Statistical Methods for SVM

Support Vector Machines for Classification: A Statistical Portrait

Statistical Machine Learning from Data

Statistical Pattern Recognition

Support vector machines

Lecture 10: A brief introduction to Support Vector Machine

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines and Speaker Verification

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Introduction to SVM and RVM

Incorporating detractors into SVM classification

Support Vector Machine

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

LECTURE 7 Support vector machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

ML (cont.): SUPPORT VECTOR MACHINES

CS798: Selected topics in Machine Learning

Lecture Notes on Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Review: Support vector machines. Machine learning techniques and image analysis

Lecture 10: Support Vector Machine and Large Margin Classifier

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear & nonlinear classifiers

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Constrained Optimization and Support Vector Machines

Support Vector Machines and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines

Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods and Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Convex Optimization and Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machines Explained

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine for Classification and Regression

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Support Vector Machines for Classification and Regression

Support Vector Machines, Kernel SVM

SUPPORT VECTOR MACHINE

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machine & Its Applications

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Methods. Outline

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Lecture 3 January 28

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Statistical Methods for Data Mining

Machine Learning A Geometric Approach

Applied inductive learning - Lecture 7

CSC 411 Lecture 17: Support Vector Machine

Announcements - Homework

Max Margin-Classifier

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Transcription:

Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21

Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21

(SVM) are learning method for (binary) classification (Vapnik, 1979,1995). find hyperplane which separates data perfectly into two classes. is often not linearly separable: map data into higher dimensional space. kernel-trick and kernel induced feature space. Hofmarcher/Theussl SVM 3/21

Hyperplanes Hyperplanes Hyperplanes II Given n trainingsdata {x i, y i } for i = 1,...n with x i R k and y i { 1, 1} Define a hyperplane by L = {x : f(x) = x T β + β 0 = 0}. (1) A classification rule induced by f(x) is given by G(x) = sign[x T β + β 0 ] Hofmarcher/Theussl SVM 4/21

Hyperplanes II Hyperplanes Hyperplanes II f(x) in Equation 1 gives the signed distance form a point x to hyperplane For two points x 1 and x 2 in L β T (x 1 x 2 ) = 0 and hence β = β/ β and hence is the vector normal to the surface of L. for any point x 0 in L, we have β T x 0 = β 0. The signed distance of any point x to L β T (x x 0 ) = 1 β (βt x + β 0 ) = 1 f f(x). (2) (x) Hence f(x) is proportional to the signed distance from x to hyperplane f(x) = 0. Hofmarcher/Theussl SVM 5/21

linear separable I linear separable I linear separable II linear separable III linear separable IV Optimal separating hyperplane separates two classes and maximizes the distance to the closest point from either class. This is equivalent to max β 0, β C s.t. y i(x T i β + β 0 ) C β (3) 1 min β 0, β2 β 2 s.t. y i (x T i β + β 0 ) 1 i (4) In light of 2 this defines a margin around linear decision boundary of thickness 1/ β and is a convex optimization problem! Hofmarcher/Theussl SVM 6/21

linear separable II linear separable I linear separable II linear separable III linear separable IV The according Lagrangian (primal) Function is: 1 L P = min β 0, β 2 β 2 + N α i [y i (x T i β + β 0 ) 1], (5) i=1 with the derivatives: β = N α i y i x i, 0 = N α i y i (6) i=1 i=1 and the so-called Wolfe-dual L D = N α i 1 2 N N α i α k y i y k x T i x k s.t α i 0 (7) i=1 i=1 k=1 Hofmarcher/Theussl SVM 7/21

linear separable III linear separable I linear separable II linear separable III linear separable IV Solution is obtained by maximizing L D. must satisfy Karush-Kuhn-Tucker conditions ((6) and s.t.(7)) and α i [y i (x T i β + β 0 ) 1] = 0 i. (8) if α i > 0 then y i (x T i β + β 0) = 1 w.o.w x i is on the boundary. if y i (x T i β + β 0) > 1, x i is not on the boundary and α i = 0 From 6 we see that the solution β is a linear combination of the support points x i, those points defined to be on the boundary of the Hyperplane. Hofmarcher/Theussl SVM 8/21

linear separable IV linear separable I linear separable II linear separable III linear separable IV The optimal separating hyperplane produces a function ˆf(x) = x T ˆβ + ˆβ0 for classifying new observations: Ĝ(x) = sign ˆf(x). (9) The intuition is, that a large margin on the training data wil lead to good separation on the test data. Description in terms of support points suggests that the optimal hyperplane focuses more on points that count. The LDA, on the other hand depends on all of the data, even points that are far away from the decision boundery. Hofmarcher/Theussl SVM 9/21

Classifier Classifier Classifier II non-separable case non-separable case Discuss hyperplanes for cases, where the clases may not be separable by a linear boundery. training data consist of (x 1, y 1 ),...,(x N, y N ) with x i R p and y i { 1, 1}. with a unit vector β, the hyperplane is defined as {x : f(x) = x T β + β 0 = 0} (10) Classification rule is induced by f(x), G(x) = sign[x T β + β 0 ] (11) Hofmarcher/Theussl SVM 10/21

Classifier II Classifier Classifier II non-separable case non-separable case As before, by dropping the norm constraint on β and defining C := 1/ β we get min β, s.t. y i (x T i β + β 0 ) 1 ξ i i, ξ i 0, ξ i const. (12) The value ξ i (slack variable) is the proportional amount by which the prediction is on the wrong side of its margin. Misclassifications occur when ξ i > 1, so bounding ξ i at a value K, bounds the total number of training misclassifications. Hofmarcher/Theussl SVM 11/21

non-separable case Classifier Classifier II non-separable case non-separable case Equation 12 is a quadratic problem with linear constraints, hence a convex optimization problem. Equation 12 can be rewritten to min β 0, β 0.5 β 2 + γ ξ i, s.t. y i (x T i β + β 0 ) 1 ξ i, ξ i 0 i, i (13) where γ replaces the constant in 12. As before we want to derive the Lagrangian Wolfe-dual objective function: L D = α i 0.5 α i α j y i y j x T i x j, (14) i i j which gives a lower bound on the objective function 12 for any feasible point. We maximize L D subject to 0 α i γ and i α iy i = 0. Hofmarcher/Theussl SVM 12/21

non-separable case Classifier Classifier II non-separable case non-separable case By using the Karush-Kuhn Tucker conditions we get a solution ˆβ of the form ˆβ = ˆα i y i x i, (15) i with nonzero coefficients α i only for those observations i for which the constraints y i (x T i β + β 0) (1 ξ i ) = 0. These observations are the support vectors, since ˆβ is represented in terms of them alone. Given solution ˆβ, ˆβ 0, the decision function can be written as Ĝ(x) = sign[ ˆf(x)] = sign[x T ˆβ + ˆβ0 ] (16) Hofmarcher/Theussl SVM 13/21

of of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS Methods described so far, find linear boundaries in the input feature space. SVMs are more flexible by enlarging the feature space using basis expansions such as splines. Linear boundaries in the enlarged space achieve better training-class separation. select basis functions h m (x), m = 1,..., M fit the SV classifier using input features h(x i ) = (h 1 (x i ),...,h M (x i )), i = 1,...,N and produce the (nonlinear) function ˆf(x) = h(x) T ˆβ + ˆβ0 The classifier is Ĝ(x) = sign( ˆf(x)) SVMs allow the enlarged space to get very, large, even infinite dimension Problem : It might seem that computations would become prohibitive. Hofmarcher/Theussl SVM 14/21

nonlinearity by preprocessing of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS To make the SV algorithm nonlinear, this could be achieved by simply preprocessing the training patterns x i by a map h : X F into some feature space F. Example: Consider a map h : R 2 R 3 with h(x i1, x i2 ) = (x 2 i1, 2x i1 x i2, x 2 i2 ). Training a linear SV machine on the preprocessed features would yield a quadratic function. This approach can easily become computationally infeasible for both polynomial features of higher order or higher dimension. Hofmarcher/Theussl SVM 15/21

of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS The Lagrangian dual function of our new optimization problem gets the form: L D = i α i 1 2 α i α j y i y j h(x i )h(x j ). (17) i Analogous to the previous cases (separable, non-separable), the solution function f(x) can be written as f(x) = h(x) T β + β 0 = i j α i y i h(x), h(x i ) + β 0 (18) As before, α i, β 0 can be determined by solving f(x i ) = 0 for any x i which 0 < α i < γ Hofmarcher/Theussl SVM 16/21

II of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS So both, 17 and 18 involve h(x) only through inner products. And in fact we do not need to specify the transformation h(x) at all, but only require knowledge of the kernel function k(x, x ) = h(x), h(x ), (19) that computes inner products in the transformed space. Hofmarcher/Theussl SVM 17/21

Hilbert Space of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS A Hilbert Space is essentially an infinite-dimensional Euclidean space. It is a vector space, i.e., closed under addition and scalar multiplication... It is endowed with an inner product,, a bilinear form. From this inner product we get a norm via x = x, x, which allows to define notions of convergence. Adding all limit points of Cauchy sequences to our space yields a Hilbert space. We will use kernel functions k to construct Hilbert Spaces. Hofmarcher/Theussl SVM 18/21

Kernels of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS Let X be a (non-empty) set. A mapping k : X X R, (x, x ) k(x, x ), (20) is called a kernel if k is symmetric, i.e., k(x, x ) = k(x, x) A kernel k is positive definite, if its Gram Matrix K i,j := k(x i, x j ) is positive definite x. The Cauchy-Schwarz inequality holds for p.d. kernels. Define a reproducing kernel map: Φ : x k(, x), (21) i.e., to each point x in the original space we associate a function k(, x). Hofmarcher/Theussl SVM 19/21

Kernels II of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS Example: Gaussian kernel. Each point x maps to a Gaussian distribution centered at that point. k(x, x ) = e x x 2 2σ 2 (22) Construct a vector space containing all linear combinations of functions k(, x): This will be our RKHS. f( ) = i α i k(, x i ). (23) Hofmarcher/Theussl SVM 20/21

RKHS of Support Vector nonlinearity by preprocessing II Hilbert Space Kernels Kernels II RKHS We now have to define an inner product in RKHS. Let g( ) = j β jk(, x j ) and define: f, g = i α i β j k(x i, x j) (24) One needs to verify that this is in fact an inner product (Symmetry, Linearity, f, f = 0 f = 0). j Hofmarcher/Theussl SVM 21/21