Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Similar documents
Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Jeff Howbert Introduction to Machine Learning Winter

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Max Margin-Classifier

Statistical Machine Learning from Data

Learning From Data Lecture 25 The Kernel Trick

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines, Kernel SVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression

Support Vector Machine (continued)

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines.

CS798: Selected topics in Machine Learning

Support Vector Machines

Linear & nonlinear classifiers

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Applied inductive learning - Lecture 7

Introduction to Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

(Kernels +) Support Vector Machines

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Lecture Notes on Support Vector Machine

CS-E4830 Kernel Methods in Machine Learning

Kernel Methods and Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

SUPPORT VECTOR MACHINE

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines: Maximum Margin Classifiers

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines Explained

Support Vector Machines and Speaker Verification

Lecture 10: Support Vector Machine and Large Margin Classifier

Announcements - Homework

Support Vector Machines

Statistical Pattern Recognition

Learning with kernels and SVM

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CIS 520: Machine Learning Oct 09, Kernel Methods

Linear & nonlinear classifiers

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Deep Learning for Computer Vision

A Tutorial on Support Vector Machine

Machine Learning And Applications: Supervised Learning-SVM

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernel methods CSE 250B

SVM and Kernel machine

Machine Learning : Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines

Support Vector Machines

SVM optimization and Kernel methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Convex Optimization M2

Kernel Methods & Support Vector Machines

Classification and Support Vector Machine

Support Vector Machines

Statistical Methods for SVM

Machine Learning 4771

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines for Classification: A Statistical Portrait

SVMs, Duality and the Kernel Trick

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Convex Optimization and Support Vector Machine

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Non-linear Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

LECTURE 7 Support vector machines

Support Vector Machines

Support Vector Machines and Kernel Methods

Transcription:

Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1

Support Vector Machines The linearly-separable case FIS - 2013 2

A criterion to select a linear classifier: the margin? FIS - 2013 3

A criterion to select a linear classifier: the margin? FIS - 2013 4

A criterion to select a linear classifier: the margin? FIS - 2013 5

A criterion to select a linear classifier: the margin? FIS - 2013 6

A criterion to select a linear classifier: the margin? FIS - 2013 7

Largest Margin Linear Classifier? FIS - 2013 8

Support Vectors with Large Margin FIS - 2013 9

In equations The training set is a finite set of n data/class pairs: where x i R d and y i { 1,1}. T = {(x 1,y 1 ),...,(x N,y N )}, We assume (for the moment) that the data are linearly separable, i.e., that there exists (w,b) R d R such that: { w T x i +b > 0 if y i = 1, w T x i +b < 0 if y i = 1. FIS - 2013 10

How to find the largest separating hyperplane? For the linear classifier f(x) = w T x+b consider the interstice defined by the hyperplanes f(x) = w T x+b = +1 f(x) = w T x+b = 1 w.x+b=0 x1 x2 w.x+b > +1 w.x+b < 1 w w.x+b=+1 w.x+b= 1 FIS - 2013 11

The margin is 2/ w Indeed, the points x 1 and x 2 satisfy: { w T x 1 +b = 0, w T x 2 +b = 1. By subtracting we get w T (x 2 x 1 ) = 1, and therefore: where γ is the margin. γ = 2 x 2 x 1 = 2 w. FIS - 2013 12

All training points should be on the appropriate side For positive examples (y i = 1) this means: w T x i +b 1 For negative examples (y i = 1) this means: w T x i +b 1 in both cases: i = 1,...,n, y i ( w T x i +b ) 1 FIS - 2013 13

Finding the optimal hyperplane Finding the optimal hyperplane is equivalent to finding (w, b) which minimize: w 2 under the constraints: ( i = 1,...,n, y i w T x i +b ) 1 0. This is a classical quadratic program on R d+1 linear constraints - quadratic objective FIS - 2013 14

Lagrangian In order to minimize: under the constraints: 1 2 w 2 i = 1,...,n, y i ( w T x i +b ) 1 0. introduce one dual variable α i for each constraint, one constraint for each training point. the Lagrangian is, for α 0 (that is for each α i 0) L(w,b,α) = 1 2 w 2 n ( ( α i yi w T x i +b ) 1 ). i=1 FIS - 2013 15

g(α) = inf w R d,b R The Lagrange dual function { 1 2 w 2 n ( ( α i yi w T x i +b ) 1 )} the saddle point conditions give us that at the minimum in w and b i=1 w = 0 = n α i y i x i, ( derivating w.r.t w) ( ) i=1 n α i y i, (derivating w.r.t b) i=1 ( ) substituting ( ) in g, and using ( ) as a constraint, get the dual function g(α). To solve the dual problem, maximize g w.r.t. α. Strong duality holds : primal and dual problems have the same optimum. KKT gives us α i (y i ( w T x i +b ) 1) = 0,...hence, either α i = 0 or y i ( w T x i + b ) = 1. α i 0 only for points on the support hyperplanes {(x,y) y i (w T x i +b) = 1}. FIS - 2013 16

Dual optimum The dual problem is thus maximize g(α) = n i=1 α i 1 n 2 i,j=1 α iα j y i y j x T i x j such that α 0, n i=1 α iy i = 0. This is a quadratic program in R n, with box constraints. α can be computed using optimization software (e.g. built-in matlab function) FIS - 2013 17

Recovering the optimal hyperplane With α, we recover (w T,b ) corresponding to the optimal hyperplane. w T is given by w T = n i=1 y iα i x T i, b is given by the conditions on the support vectors α i > 0, y i (w T x i +b) = 1, b = 1 2 ( ) min y i =1,α i >0 (wt x i )+ max y i = 1,α i >0 (wt x i ) the decision function is therefore: f (x) = w T x+b ( n = y i α i x T i i=1 ) x+b. Here the dual solution gives us directly the primal solution. FIS - 2013 18

Interpretation: support vectors α=0 α>0 FIS - 2013 19

Another interpretation: Convex Hulls go back to 2 sets of points that are linearly separable FIS - 2013 20

Another interpretation: Convex Hulls Linearly separable = convex hulls do not intersect FIS - 2013 21

Another interpretation: Convex Hulls Find two closest points, one in each convex hull FIS - 2013 22

Another interpretation: Convex Hulls The SVM = bisection of that segment FIS - 2013 23

Another interpretation: Convex Hulls support vectors = extreme points of the faces on which the two points lie FIS - 2013 24

The non-linearly separable case (when convex hulls intersect) FIS - 2013 25

What happens when the data is not linearly separable? FIS - 2013 26

What happens when the data is not linearly separable? FIS - 2013 27

What happens when the data is not linearly separable? FIS - 2013 28

What happens when the data is not linearly separable? FIS - 2013 29

Soft-margin SVM? Find a trade-off between large margin and few errors. Mathematically: min f { 1 margin(f) } +C errors(f) C is a parameter FIS - 2013 30

Soft-margin SVM formulation? The margin of a labeled point (x,y) is margin(x,y) = y ( w T x+b ) The error is 0 if margin(x,y) > 1, 1 margin(x, y) otherwise. The soft margin SVM solves: min w,b { w 2 +C n ( max{0,1 y i w T x i +b ) } i=1 c(u,y) = max{0,1 yu} is known as the hinge loss. c(w T x i +b,y i ) associates a mistake cost to the decision w,b for example x i. FIS - 2013 31

Dual formulation of soft-margin SVM The soft margin SVM program min w,b { w 2 +C n ( max{0,1 y i w T x i +b ) } i=1 can be rewritten as minimize w 2 +C n ( i=1 ξ i such that y i w T x i +b ) 1 ξ i In that case the dual function n g(α) = α i 1 2 i=1 n i,j=1 α i α j y i y j x T i x j, which is finite under the constraints: { 0 α i C, for i = 1,...,n n i=1 α iy i = 0. FIS - 2013 32

Interpretation: bounded and unbounded support vectors α=0 α= C 0<α< C FIS - 2013 33

What about the convex hull analogy? Remember the separable case Here we consider the case where the two sets are not linearly separable, i.e. their convex hulls intersect. FIS - 2013 34

What about the convex hull analogy? Definition 1.Given a set of n points A, and 0 C 1, the set of finite combinations n n λ i x i,1 λ i C, λ i = 1, i=1 is the (C) reduced convex hull of A i=1 Using C = 1/2, the reduced convex hulls of A and B, Soft-SVM with C = closest two points of C-reduced convex hulls. FIS - 2013 35

Kernels FIS - 2013 36

Kernel trick for SVM s use a mapping φ from X to a feature space, which corresponds to the kernel k: x,x X, k(x,x ) = φ(x),φ(x ) Example: if φ(x) = φ ([ x1 x 2 ]) = [ x 2 1 x 2 2 ], then k(x,x ) = φ(x),φ(x ) = (x 1 ) 2 (x 1) 2 +(x 2 ) 2 (x 2) 2. FIS - 2013 37

Training a SVM in the feature space Replace each x T x in the SVM algorithm by φ(x),φ(x ) = k(x,x ) Reminder: the dual problem is to maximize g(α) = n α i 1 2 i=1 n i,j=1 α i α j y i y j k(x i,x j ), under the constraints: { 0 α i C, for i = 1,...,n n i=1 α iy i = 0. The decision function becomes: f(x) = w,φ(x) +b n = y i α i k(x i,x)+b. i=1 (1) FIS - 2013 38

The Kernel Trick? The explicit computation of φ(x) is not necessary. The kernel k(x,x ) is enough. the SVM optimization for α works implicitly in the feature space. the SVM is a kernel algorithm: only need to input K and y: maximize g(α) = α T 1 1 2 αt (K yy T )α such that 0 α i C, for i = 1,...,n n i=1 α iy i = 0. K s positive definite problem has an unique optimum the decision function is f( ) = n i=1 α ik(x i, )+b. FIS - 2013 39

Kernel example: polynomial kernel For x = (x 1,x 2 ) R 2, let φ(x) = (x 2 1, 2x 1 x 2,x 2 2) R 3 : K(x,x ) = x 2 1x 2 1 +2x 1 x 2 x 1x 2+x 2 2x 2 2 = {x 1 x 1 +x 2 x 2} 2 = {x T x } 2. x1 x1 2 R x2 x2 2 FIS - 2013 40

Kernels are Trojan Horses onto Linear Models With kernels, complex structures can enter the realm of linear models FIS - 2013 41

What is a kernel In the context of these lectures... A kernel k is a function k : X X R (x,y) k(x,y) which compares two objects of a space X, e.g... strings, texts and sequences, images, audio and video feeds, graphs, interaction networks and 3D structures whatever actually... time-series of graphs of images? graphs of texts?... FIS - 2013 42

Fundamental properties of a kernel symmetric k(x,y) = k(y,x). positive-(semi)definite for any finite family of points x 1,,x n of X, the matrix k(x 1,x 1 ) k(x 1,x 2 ) k(x 1,x i ) k(x 1,x n ) k(x 2,x 1 ) k(x 2,x 2 ) k(x 2,x i ) k(x 2,x n ) K =........ k(x i,x 1 ) k(x i,x 2 ) k(x i,x i ) k(x 2,x n ) 0........ k(x n,x 1 ) k(x n,x 2 ) k(x n,x i ) k(x n,x n ) is positive semidefinite (has a nonnegative spectrum). K is often called the Gram matrix of {x 1,,x n } using k FIS - 2013 43

What can we do with a kernel? FIS - 2013 44

The setting Pretty simple setting: a set of objects x 1,,x n of X Sometimes additional information on these objects labels y i { 1,1} or {1,,#(classes)}, scalar values y i R, associated object y i Y A kernel k : X X R. FIS - 2013 45

A few intuitions on the possibilities of kernel methods Important concepts and perspectives The functional perspective: represent points as functions. Nonlinearity : linear combination of kernel evaluations. Summary of a sample through its kernel matrix. FIS - 2013 46

Represent any point in X as a function For every x, the map x k(x, ) associates to x a function k(x, ) from X to R. Suppose we have a kernel k on bird images Suppose for instance k(, ) =.32 FIS - 2013 47

Represent any point in X as a function We examine one image in particular: With kernels, we get a representation of that bird as a real-valued function, defined on the space of birds, represented here as R 2 for simplicity. schematic plot of k(, ). FIS - 2013 48

Represent any point in X as a function If the bird example was confusing... ( [ ]) ( [ ) k [ x x y], y x = [x y] 2 ]+.3 y From a point in R 2 to a function defined over R 2. 2.5 ((2 x+1.5 y) +.3) 2 2 500 1.5 400 300 1 200 5 100 0 0.5 1 1.5 2 2.5 3 0 6 4 2 0 2 4 6 5 x y We assume implicitly that the functional representation will be more useful than the original representation. FIS - 2013 49

Decision functions as linear combination of kernel evaluations Linear decisions functions are a major tool in statistics, that is functions f(x) = β T x+β 0. Implicitly, a point x is processed depending on its characteristics x i, f(x) = d β i x i +β 0. i=1 the free parameters are scalars β 0,β 1,,β d. Kernel methods yield candidate decision functions f(x) = n α j k(x j,x)+α 0. j=1 the free parameters are scalars α 0,α 1,,α n. FIS - 2013 50

Decision functions as linear combination of kernel evaluations linear decision surface / linear expansion of kernel surfaces (here k G (x i, )) 4 3 0.4 2 0.3 1 0 1 0.2 0.1 0 0.1 2 20 10 3 5 y 0 5 6 4 2 0 2 4 6 x 0 10 20 20 10 0 10 20 Kernel methods are considered non-linear tools. Yet not completely nonlinear only one-layer of nonlinearity. kernel methods use the data as a functional base to define decision functions FIS - 2013 51

Decision functions as linear combination of kernel evaluations database {x i,i = 1,...,N} f(x) = N i=1 α i k(x i,x) weights α estimated with a kernel machine kernel definition f is any predictive function of interest of a new point x. Weights α are optimized with a kernel machine (e.g. support vector machine) intuitively, kernel methods provide decisions based on how similar a point x is to each instance of the training set FIS - 2013 52

The Gram matrix perspective Imagine a little task: you have read 100 novels so far. You would like to know whether you will enjoy reading a new novel. A few options: read the book... have friends read it for you, read reviews. try to guess, based on the novels you read, if you will like it FIS - 2013 53

The Gram matrix perspective Two distinct approaches Define what features can characterize a book. Map each book in the library onto vectors x = x 1 x 2. x d typically the x i s can describe... # pages, language, year 1st published, country, coordinates of the main action, keyword counts, author s prizes, popularity, booksellers ranking Challenge: find a decision function using 100 ratings and features. FIS - 2013 54

The Gram matrix perspective Define what makes two novels similar, Define a kernel k which quantifies novel similarities. Map the library onto a Gram matrix K = k(b 1,b 1 ) k(b 1,b 2 ) k(b 1,b 100 ) k(b 2,b 1 ) k(b 2,b 2 ) k(b 2,b 100 )...... k(b n,b 1 ) k(b n,b 2 ) k(b 100,b 100 ) Challenge: find a decision function that takes this 100 100 matrix as an input. FIS - 2013 55

The Gram matrix perspective Given a new novel, with the features approach, the prediction can be rephrased as what are the features of this new book? what features have I found in the past that were good indicators of my taste? with the kernel approach, the prediction is rephrased as which novels this book is similar or dissimilar to? what pool of books did I find the most influentials to define my tastes accurately? kernel methods only use kernel similarities, do not consider features. Features can help define similarities, but never considered elsewhere. FIS - 2013 56

The Gram matrix perspective in kernel methods, clear separation between the kernel... dataset x 3 x 1 x 4 k x 2 x 5 K 5 5, kernel matrix α convex optimization and Convex optimization (thanks to psdness of K, more later) to output the α s. FIS - 2013 57

Kernel Trick Given a dataset {x 1,,x N } and a new instance x new Many data analysis methods only depend on x T i x j and x T i x new Ridge regression Principal Component Analysis Linear Discriminant Analysis Canonical Correlation Analysis etc. Replace these evaluations by k(x i,x j ) and k(x i,x new ) This will even work if the x i s are not in a dot-product space! (strings, graphs, images etc.) FIS - 2013 58