Learning with kernels and SVM

Similar documents
Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Constrained Optimization and Support Vector Machines

An introduction to Support Vector Machines

Introduction to Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines and Kernel Algorithms

Applied inductive learning - Lecture 7

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

CS798: Selected topics in Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Statistical Machine Learning from Data

Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

L5 Support Vector Classification

Kernel Methods and Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Support Vector Machines

Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machines: Maximum Margin Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Polyhedral Computation. Linear Classifiers & the SVM

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines Explained

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Notes on Support Vector Machine

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Introduction to Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Statistical Pattern Recognition

Support Vector Machines

Support vector machines

Machine Learning : Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Nearest Neighbors Methods for Support Vector Machines

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

SMO Algorithms for Support Vector Machines without Bias Term

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Kernel Methods and Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Non-linear Support Vector Machines

Support Vector Machines and Kernel Methods

Support Vector Machines

Classifier Complexity and Support Vector Classifiers

Support Vector and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine Regression for Volatile Stock Market Prediction

LECTURE 7 Support vector machines

Max Margin-Classifier

Support Vector Machine & Its Applications

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Methods. Barnabás Póczos

COMP 875 Announcements

A Tutorial on Support Vector Machine

Formulation with slack variables

Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Support Vector Machines for Regression

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

ICS-E4030 Kernel Methods in Machine Learning

Statistical Methods for NLP

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines for Classification and Regression

Transcription:

Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová

Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion

Learning from data find a general rule that explains data given only as a sample of limited size data may contain measurement errors or noise supervised learning data are sample of input-output pairs find input-output mapping prediction, classification, function approximation, etc. unsupervised learning data are sample of objects find some structure clustering, etc.

Learning methods wide range of methods available statistical approaches neural networks originally biological motivation Multi-layer perceptrons, RBF networks Kohonen maps kernel methods modern and popular SVM

Trends in machine learning Articles on machine learning found by Google Source: http://yaroslavvb.blogspot.com/

Trends in machine learning Articles on neural networks found by Google Source: http://yaroslavvb.blogspot.com/

Trends in machine learning Articles on suport vector machine found by Google Source: http://yaroslavvb.blogspot.com/

Binary classification Training set {(x i, y i )} m i=1 x i X y i { 1, 1} find classifier generalization

Simple Classifier Suppose: X R n, classes linearly separable c + = 1 m + {i y i =+1} x i c = 1 m {i y i = 1} x i c = 1 2 (c + + c ) y = sgn( (x c), w ) = sgn( (x (c + + c )/2),((c + + c ) ) = sgn( x, c + x, c + b) b = 1 2 ( c 2 c + 2 )

Mapping to the feature space life is not so easy, not all problems are linearly separable what to do if X is not dot-product space? choose a mapping to some (high dimensional) dot-product space - feature space Φ : X H

Mercer s condition and Kernels If a symmetric function K(x, y) satisfies M a i a j K(x i, x j ) 0 i,j=1 for all M N, x i, and a i R, there exists a mapping function Φ that maps x into the dot-product feature space and and vice versa. Function K is called kernel. K(x, y) = Φ(x),Φ(y)

Examples of kernels Linear Kernels K(x, y) = x, y Polynomial Kernels K(x, y) = ( x, y + 1) d for d = 2 and 2-dimensional inputs K(x, y) = 1 + 2x 1 y 1 + 2x 2 y 2 + 2x 1 y 1 x 2 y 2 + x 2 1 y 2 1 + x 2 2 y 2 2 = Φ(x),Φ(x) Φ(x) = (1, 2x 1, 2x 2, 2x 1 x 2, x 2 1 x 2 2 )T

Examples of kernels RBF Kernels x y 2 K(x, y) = exp( d 2 ) Other kernels kernels on various objects, such as graphs, strings, texts, etc. enable us to use dot-product algorithms measure of similarity

Simple Classifier - kernel version Suppose: X R n, classes linearly separable c + = 1 m + {i y i =+1} x i c = 1 m {i y i = 1} x i c = 1 2 (c + + c ) y = sgn( (x c), w ) = sgn( (x (c + + c )/2),((c + + c ) ) = sgn( x, c + x, c + b) b = 1 2 ( c 2 c + 2 )

Simple Classifier - kernel version Suppose: X is any set, Φ : X H corresponding to kernel K c + = 1 m + {i y i =+1} x i c = 1 m {i y i = 1} x i c = 1 2 (c + + c ) y = sgn( (x c), w ) = sgn( (x (c + + c )/2),((c + + c ) ) = sgn( x, c + x, c + b) b = 1 2 ( c 2 c + 2 )

Simple Classifier - kernel version Suppose: X is any set, Φ : X H corresponding to kernel K y = sgn( 1 m + b = 1 2 ( 1 m 2 {i y i =+1} {i,j y i =y j = 1} K(x, x i ) 1 m K(x i, x j ) 1 m 2 + {i y i = 1} {i,j y i =y j =+1} Statistical approach Bayes classifier - special case K(x, y)dx = 1 y X; b = 0 X K(x, x i ) + b) K(x i, x j ))

Simple Classifier - kernel version Suppose: X is any set, Φ : X H corresponding to kernel K y = sgn( 1 m + {i y i =+1} K(x, x i ) 1 m {i y i = 1} = p + (x) = p (x) Parzen windows Statistical approach Bayes classifier - special case K(x, y)dx = 1 y X; b = 0 X K(x, x i ) + 0)

Separating hyperplane classifier in a form y(x) = sgn( w, x + b) w, x i + b { > 0 for y i = 1 < 0 for y i = 1 each hyperplane D(x) = w, x +b = c, 1 < c < 1 is separating optimal separating hyperplane - the one with the maximal margin

Separating hyperplane classifier in a form y(x) = sgn( w, x + b) w, x i +b { 1 for y i = 1 1 for y i = 1 each hyperplane D(x) = w, x +b = c, 1 < c < 1 is separating optimal separating hyperplane - the one with the maximal margin

Separating hyperplane classifier in a form y(x) = sgn( w, x + b) w, x i +b { 1 for y i = 1 1 for y i = 1 each hyperplane D(x) = w, x +b = c, 1 < c < 1 is separating optimal separating hyperplane - the one with the maximal margin

Classifier with maximal margin

Classifier with maximal margin where w and b are solution of y(x) = sgn( w, x + b) min Q(w), Q(w) = 1 2 w 2 with respect to constraints y i ( w, x i + b) 1, for i = 1,..., M quadratic programming problem linear separability solution exists no local minima

Classifier with maximal margin constrained optimization problem 1 min w 2 w 2 subject to y i ( w, x i + b) 1 can be handled by introducing Lagrange multipliers α i 0 L(w, b,α) = 1 2 w 2 m α i (y i ( w, x i + b) 1) i=1 minimize with respect to w and b maximize with respect to α i

Classifier with maximal margin L(w, b,α) = 1 m 2 w 2 α i (y i ( w, x i + b) 1) i=1 minimize with respect to w, b; maximize with respect to α Karush-Kuhn-Tucker (KKT) conditions we get δl(w, b,α) δw = 0 δl(w, b,α) δb = 0 w = m i=1 α iy i x i m i=1 α iy i = 0 y i ( w, x i + b) > 1 α i = 0 x i y i ( w, x i + b) = 1 α i 0 x i irrelevant support vector

by substitution to L we get Dual problem max W(α) = m α i 1 α R n 2 i=1 m α i α j y i y j x i, x j i,j=1 subject to m α i 0 α i y i = 0 resulting classifier - (hard margin) support vector machine (SVM) m f(x) = sgn( y i α i x, x i + b), i=1 i=1 b = y i w, x i

Support Vector Machine work in feature space and use kernels classifier m f(x) = sgn( y i α i K(x, x i ) + b) i=1 max W(α) = m α i 1 α R n 2 i=1 m α i α j y i y j K(x i, x j ) i,j=1 subject to m α i 0 α i y i = 0 i=1

Soft margin SVM separating hyperplane may not exists (high level of noise, overlap of classes, etc.)

Soft margin SVM separating hyperplane may not exists (high level of noise, overlap of classes, etc.) introduce slack variables ξ i minimize y i ( w,φ(x) i + b) 1 ξ i Q(w,ξ) = 1 2 w 2 + C m i=1 C > 0 trade-off between maximization of margin and minimization of training error (depands on noise level) ξ 2 i

Soft margin SVM solution has a form m f(x) = sgn( y i α i K(x, x i ) + b) max W(α) = m α i 1 α R n 2 i=1 i=1 m i,j=1 ( α i α j y i y j K(x i, x j ) + δ ) i,j C subject to α i 0 m α i y i = 0 i=1

SVM - summary input points are mapped to the feature space dot product computed by means of kernel function classification via separating hyperplane with maximal margin such hyperplane is determined by support vectors other training samples are irrelevant data not separable in feature space (noise, etc.) - use soft margin control trade-of between maximal margin and minimum training error (C)

SVM vs. Neural Networks + maximization of generalization ability + no local minima - extension to multiclass problems - long training time number of variables same as number of data points not necessarily true - many techniques to reduce time exists - selection of parameters kernel function C

References Support Vector Machines for Pattern Classification Shigoe Abe, Springer 2005 Learning with Kernels Bernhard Schölkopf and Alex Smola MIT Press, Cambridge, MA, 2002 Source of sheep vectors illustrations. Learning kernel classifiers Ralf Herbrich MIT Press, Cambridge, MA, 2002 http://www.kernel-machines.org/

Software SVMlib (by Chih-Chung Chang and Chih-Jen Lin) http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Matlab toolbox (by S. Gunn) http://www.isis.ecs.soton.ac.uk/resources/svminfo/

Questions?