Support Vector and Kernel Methods

Similar documents
Support Vector Machines: Kernels

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines.

Introduction to Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to Support Vector Machines

SUPPORT VECTOR MACHINE

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear & nonlinear classifiers

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Support Vector Machine (continued)

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear & nonlinear classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Multiclass Classification-1

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines and Kernel Methods

Learning with kernels and SVM

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine (SVM) and Kernel Methods

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Evaluation. Andrea Passerini Machine Learning. Evaluation

Support Vector Machine & Its Applications

Pattern Recognition 2018 Support Vector Machines

Brief Introduction to Machine Learning

Support Vector Machines

Support Vector Machines

COMP 562: Introduction to Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Modelli Lineari (Generalizzati) e SVM

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Introduction to Machine Learning

Statistical Methods for SVM

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine II

L5 Support Vector Classification

Support Vector Machines

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine for Classification and Regression

Lecture 3: Multiclass Classification

Neural networks and support vector machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Evaluation requires to define performance measures to be optimized

Kernel Methods and Support Vector Machines

Discriminative Models

Lecture 10: Support Vector Machine and Large Margin Classifier

CS446: Machine Learning Spring Problem Set 4

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

1 Training and Approximation of a Primal Multiclass Support Vector Machine

Nearest Neighbors Methods for Support Vector Machines

Support vector machines Lecture 4

Discriminative Models

Support Vector Machines

Dan Roth 461C, 3401 Walnut

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Introduction to SVM and RVM

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Constrained Optimization and Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Hierarchical Boosting and Filter Generation

Infinite Ensemble Learning with Support Vector Machinery

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS145: INTRODUCTION TO DATA MINING

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines. Maximizing the Margin

SVMs: nonlinearity through kernels

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning : Support Vector Machines

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Introduction to Support Vector Machines

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Statistical Methods for Data Mining

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Support Vector Machines for Classification and Regression

The definitions and notation are those introduced in the lectures slides. R Ex D [h

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines Explained

Incorporating detractors into SVM classification

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machines

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Applied Machine Learning Annalisa Marsico

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Statistical Machine Learning from Data

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Kernel Methods & Support Vector Machines

Transcription:

SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0

Linear Classifiers Rules of the Form: weight vector w, threshold hx ( ) = sign w i x i + b = i = 1 Geometric Interpretation (Hyperplane): N N 1 if w i x i i = 1 b 1 else + b > 0 w b 14

Optimal Hyperplane (SVM Type 1) Assumption: The training examples are linearly separable. 19

Maximizing the Margin δ The hyperplane with maximum margin <~ (roughly, see later) ~> The hypothesis space with minimal VC-dimension according to SRM Support Vectors: Examples with minimal distance. 21

Example: Optimal Hyperplane vs. Perceptron Percent Training/Testing Errors 30 25 20 15 10 5 Perceptron with eta=0.1 "perceptron_iter_trainerror.dat" "perceptron_iter_testerror.dat" hard_margin_svm_testerror.dat 0 1 2 3 4 5 6 7 8 9 10 Iterations Train on 1000 pos / 1000 neg examples for acq (Reuters-21578). 24

Non-Separable Training Samples For some training samples there is no separating hyperplane! Complete separation is suboptimal for many training samples! => minimize trade-off between margin and training error. 25

Soft-Margin Separation Idea: Maximize margin and minimize training error simultanously. Hard Margin: Soft Margin: minimize s. t. Pwb (, ) = 1 --w w 2 y i [ w x i + b] 1 minimize Hard Margin (separable) Pwbξ (,, ) 1 2 --w w + C ξ i s. t. + b and = n i = 1 y i [ w x i ] 1 ξ i ξ i 0 δ Soft Margin (training error) ξ j ξ i 26

Controlling Soft-Margin Separation Soft Margin: minimize ξ i Pwbξ (,, ) 1 2 --w w + C ξ i s. t. + b and i = 1 is an upper bound on the number of training errors. C is a parameter that controls trade-off between margin and error. = y i [ w x i ] 1 ξ i ξ i 0 n Large C ξ i δ Small C ξ j 27

Example Reuters acq : Varying C Percent Training/Testing Errors 4 3.5 3 2.5 2 1.5 1 0.5 0 "svm_trainerror.dat" "svm_testerror.dat" hard-margin SVM 0.1 1 10 C Observation: Typically no local optima, but not necessarily... 28

Properties of the Soft-Margin Dual OP Dual OP: maximize s. t. D( α) = α i i = 1 typically single solution (i. e. wb, is unique) one factor α i for each training example influence of single training example limited by C 0 < α i < C <=> SV with ξ i = 0 α i = C <=> SV with ξ i > 0 α i = 0 else based exclusively on inner product between training examples n n n n 1 -- α 2 i α j y i y j ( x i x j ) i = 1 j = 1 α i y i = 0 und 0 α i C i = 1 ξ j ξ i 37

Primal <=> Dual Theorem: The primal OP and the dual OP have the same solution. Given the solution α i of the dual OP, w = α i y i x i b i = 1 is the solution of the primal OP. n = 1 -- ( w 2 0 x pos + w 0 x neg ) Theorem: For any set of feasible points Pwb (, ) D( α). => two alternative ways to represent the learning result weight vector and threshold wb, vector of influences α 1, α, n 36

Non-Linear Problems ==> Problem: some tasks have non-linear structure no hyperplane is sufficiently accurate How can SVMs learn non-linear classification rules? 38

Example Input Space: = (, ) (2 Attributes) x x 1 x 2 Feature Space: Φ x = (,, 2x1, 2x 2, 2x 1 x 2, 1) (6 Attributes) ( ) x 1 2 x2 2 40

Extending the Hypothesis Space Idea: Input Space ==> Find hyperplane in feature space! Φ Feature Space Example: a b c ==> The separating hyperplane in features space is a degree two polynomial in input space. Φ a b c aa ab ac bb bc cc 39

Kernels Problem: Very many Parameters! Polynomials of degree p over N attributes in input space lead to ON ( p ) attributes in feature space! Solution: [Boser et al., 1992] The dual OP need only inner products => Kernel Functions Kx ( i, x j ) = Φ( x i ) Φ( x j ) Example: For Φ x = (,, 2x1, 2x 2, 2x 1 x 2, 1) calculating ( ) x 1 2 x2 2 Kx ( i, x j ) = [ x i x j + 1] 2 = Φ( x i ) Φ( x j ) gives inner product in feature space. We do not need to represent the feature space explicitly! 41

Training: maximize s. t. SVM with Kernels D( α) = α i n n i = 1 n n 1 -- α 2 i α j y i y j Kx ( i, x j ) i = 1 j = 1 α i y i = 0 und 0 α i C i = 1 Classification: For new example x hx ( ) = sign α i y i Kx ( i, x) x i SV + b New hypotheses spaces through new Kernels: Linear: Polynomial: Radial Basis Functions: Sigmoid: Kx ( i, x j ) = x i x j Kx ( i, x j ) = [ x i x j + 1] d Kx ( i, x j ) = tanh( γ( x i x j ) + c) 2 σ 2 Kx ( i, x j ) = exp( x i x j ) 42

Example: SVM with Polynomial of Degree 2 Kernel: Kx ( i, x j ) = [ x i x j + 1] 2 plot by Bell SVM applet 43

Example: SVM with RBF-Kernel Kernel: 2 σ 2 Kx ( i, x j ) = exp( x i x j ) plot by Bell SVM applet 44

Two Reasons for Using a Kernel (1) Turn a linear learner into a non-linear learner (e.g. RBF, polynomial, sigmoid) (2) Make non-vectorial data accessible to learner (e.g. string kernels for sequences) 51

Given: Summary What is an SVM? Training examples ( x 1, y 1 ),, ( x n, y n ) x i R N yi {, 1 1} Hypothesis space according to kernel Kx ( i, x j ) Parameter C for trading-off training error and margin size Training: Finds hyperplane in feature space generated by kernel. The hyperplane has maximum margin in feature space with minimal training error (upper bound ξ i ) given C. The result of training are,,. They determine wb,. α 1 α n Classification: For new example hx ( ) = sign α i y i Kx ( i, x) x i SV + b 52

Part 2: How to use an SVM effectively and efficiently? normalization of the input vectors selecting C handling unbalanced datasets selecting a kernel multi-class classification selecting a training algorithm 53

How to Assign Feature Values? Things to take into consideration: importance of feature is monotonic in its absolute value the larger the absolute value, the more influence the feature gets typical problem: number of doors [0-5], price [0-100000] want relevant features large / irrelevant features low (e.g. IDF) normalization to make features equally important by mean and variance: x norm = ------------------------------ x mean ( X ) var( X) by other distribution normalization to bring feature vectors onto the same scale directional data: text classification x by normalizing the length of the vector x norm = ------- according to some norm x changes whether a problem is (linearly) separable or not scale all vectors to a length that allows numerically stable training 57

Selecting a Kernel Things to take into consideration: kernel can be thought of as a similarity measure examples in the same class should have high kernel value examples in different classes should have low kernel value ideal kernel: equivalence relation Kx ( i, x j ) = sign( y i y j ) normalization also applies to kernel relative weight for implicit features normalize per example for directional data Kx ( i, x j ) = Kx ( i, x j ) ---------------------------------------------- Kx ( i, x i ) Kx ( j, x j ) potential problems with large numbers, for example polynomial kernel (, ) = [ x i x j + 1] d for large d Kx i x j 58

Common Method Selecting Regularization Parameter C a reasonable starting point and/or default value is search for C on a log-scale, for example C [ 10 4 C def,, 10 4 Cdef] selection via cross-validation or via approximation of leave-one-out [Jaakkola&Haussler,1999][Vapnik&Chapelle,2000][Joachims,2000] Note optimal value of C scales with the feature values 1 C def = -------------------------- Kx ( i, x i ) 59

Problem Selecting Kernel Parameters results often very sensitive to kernel parameters (e.g. variance γ in RBF kernel) need to simultaneously optimize C, since optimal C typically depends on kernel parameters Common Method search for combination of parameters via exhaustive search selection of kernel parameters typically via cross-validation Advanced Approach avoiding exhaustive search for improved search efficiency [Chapelle et al, 2002] 60

Handling Multi-Class / Multi-Label Problems Standard classification SVM addresses binary problems y { 1, 1} Multi-class classification: y { 1,, k} one-against-rest decomposition into k binary problems learn one binary SVM h () i per class with y i = 1 assign new example to y = argmax[ h () i ( x) ] pairwise decomposition into kk ( 1) binary problems learn one binary SVM h () i per class pair y i, j = 1 assign new example by majority vote reducing number of classifications [Platt et al., 2000] multi-class SVM [Weston & Watkins, 1998] multi-class SVM via ranking [Crammer & Singer, 2001] () 1 if( y = i) else ( ) 1 if( y = i) if( y = j) 55