Kernel methods CSE 250B

Similar documents
Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines.

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CIS 520: Machine Learning Oct 09, Kernel Methods

SVMs, Duality and the Kernel Trick

Machine Learning. Support Vector Machines. Manfred Huber

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Perceptron Revisited: Linear Separators. Support Vector Machines

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Introduction to Logistic Regression and Support Vector Machine

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear & nonlinear classifiers

Support Vector Machine (continued)

Max Margin-Classifier

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture 10: A brief introduction to Support Vector Machine

Linear & nonlinear classifiers

Kernel methods, kernel SVM and ridge regression

(Kernels +) Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Kaggle.

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Introduction to Support Vector Machines

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

CS798: Selected topics in Machine Learning

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines and Speaker Verification

Learning with kernels and SVM

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS145: INTRODUCTION TO DATA MINING

Discriminative Models

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning Lecture 6 Note

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Kernel Methods and Support Vector Machines

Support Vector Machines

Discriminative Models

Kernel Methods. Outline

Lecture 7: Kernels for Classification and Regression

Support Vector Machines and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Kernel Methods. Charles Elkan October 17, 2007

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machines

Support Vector Machine

Statistical Machine Learning from Data

Support Vector Machines

Kernel Methods. Barnabás Póczos

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Modelli Lineari (Generalizzati) e SVM

Introduction to Machine Learning

CS 188: Artificial Intelligence Spring Announcements

SVMs: nonlinearity through kernels

Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines. Maximizing the Margin

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines

Support Vector Machines and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

Statistical Methods for NLP

Introduction to Support Vector Machines

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

HOMEWORK 4: SVMS AND KERNELS

Pattern Recognition 2018 Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines for Classification: A Statistical Portrait

Linear Classification and SVM. Dr. Xin Zhang

Kernel Methods & Support Vector Machines

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Advanced Machine Learning & Perception

Jeff Howbert Introduction to Machine Learning Winter

9.2 Support Vector Machines 159

Support Vector Machines Explained

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine II

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Supervised Learning Coursework

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Transcription:

Kernel methods CSE 250B

Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression.

Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation + + + + + + + + + What to do with this?

Adding new features + + + + + + + + + Actual boundary is something like x 1 = x2 2 + 5.

Adding new features + + + + + + + + + Actual boundary is something like x 1 = x2 2 + 5. This is quadratic in x = (x 1, x 2 ) But it is linear in Φ(x) = (x 1, x 2, x1 2, x 2 2, x 1x 2 ) Basis expansion: embed data in higherdimensional feature space. Then we can use a linear classifier!

Basis expansion for quadratic boundaries How to deal with a quadratic boundary? + + + + + + + + + Idea: augment the regular features x = (x 1, x 2,..., x d ) with x1 2, x2 2,..., xd 2 x 1 x 2, x 1 x 3,..., x d 1 x d Enhanced data vectors of the form: Φ(x) = (x 1,..., x d, x1 2,..., xd 2, x 1x 2,..., x d 1 x d )

Quick question Suppose x = (x 1, x 2, x 3 ). What is the dimension of Φ(x)?

Suppose x = (x 1,..., x d ). What is the dimension of Φ(x)?

Perceptron revisited Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y

Perceptron with basis expansion: examples

Perceptron with basis expansion: examples

Perceptron with basis expansion: examples

Perceptron with basis expansion: examples

Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y

Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504.

Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504. The kernel trick: implement this without ever writing down a vector in the higherdimensional space!

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i)

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. n w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. n w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1 3 Compute Φ(x) Φ(z) without ever writing out Φ(x) or Φ(z).

Computing dot products First, in 2d. Suppose x = (x 1, x 2 ) and Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). Actually, tweak a little: Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) What is Φ(x) Φ(z)?

Computing dot products Suppose x = (x 1, x 2,..., x d ) and Φ(x) = (1, 2x 1,..., 2x d, x 2 1,..., x 2 d, 2x 1 x 2,..., 2x d 1 x d ) Φ(x) Φ(z) = (1, 2x 1,..., 2x d, x 2 1,..., x 2 d, 2x 1 x 2,..., 2x d 1 x d ) (1, 2z 1,..., 2z d, z 2 1,..., z 2 d, 2z 1 z 2,..., 2z d 1 z d ) = 1 + 2 i x i z i + i x 2 i z 2 i + 2 i j x i x j z i z j = (1 + x 1 z 1 + + x d z d ) 2 = (1 + x z) 2 For MNIST: We are computing dot products in 308504dimensional space. But it takes time proportional to 784, the original dimension!

Kernel Perceptron Learning from data (x (1), y (1) ),..., (x (n), y (n) ) X { 1, 1} Primal form: w = 0 and b = 0 while there is some i with y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) Dual form: w = j α jy (j) Φ(x (j) ), where α R n α = 0 and b = 0 ( ) while some i has y (i) j α jy (j) Φ(x (j) ) Φ(x (i) ) + b 0 : α i = α i + 1 b = b + y (i) ( ) To classify a new point x: sign j α jy (j) Φ(x (j) ) Φ(x) + b.

Does this work with SVMs? (primal) min w R d,b R,ξ R n s.t.: y (i) (w x (i) + b) 1 ξ i ξ 0 w 2 + C n i=1 ξ i for all i = 1, 2,..., n (dual) max α R n n n α i α i α j y (i) y (j) (x (i) x (j) ) i=1 s.t.: i,j=1 n α i y (i) = 0 i=1 0 α i C Solution: w = i α iy (i) x (i).

Kernel SVM 1 Basis expansion. Mapping x Φ(x). 2 Learning. Solve the dual problem: max α R n n n α i α i α j y (i) y (j) (Φ(x (i) ) Φ(x (j) )) i=1 i,j=1 s.t.: n α i y (i) = 0 i=1 0 α i C This yields w = i α iy (i) Φ(x (i) ). Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) (Φ(x (i) ) Φ(x)) + b. i

Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:

Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + +

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?)

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x).

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x). Same trick works: Φ(x) Φ(z) = (1 + x z) p.

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x). Same trick works: Φ(x) Φ(z) = (1 + x z) p. Kernel function: k(x, z) = (1 + x z) p.

String kernels Sequence data: text documents speech signals protein sequences

String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words}

String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x?

String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x? We will use an infinitedimensional embedding!

String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s).

String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive.

String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive. To compute k(x, z) = Φ(x) Φ(z): for each substring s of x: count how often s appears in z

String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive. To compute k(x, z) = Φ(x) Φ(z): for each substring s of x: count how often s appears in z Using dynamic programming, this takes time O( x z ).

The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k.

The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k. Kernel Perceptron: α = 0 and b = 0 ( ) while some i has y (i) j α jy (j) k(x (j), x (i) ) + b 0 : α i = α i + 1 b = b + y (i) ( ) To classify a new point x: sign j α jy (j) k(x (j), x) + b.

Kernel SVM, revisited 1 Kernel function. Define a similarity function k(x, z). 2 Learning. Solve the dual problem: max α R n n α i i=1 s.t.: n α i α j y (i) y (j) k(x (i), x (j) ) i,j=1 n α i y (i) = 0 i=1 0 α i C This yields α. Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) k(x (i), x) + b. i

Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b).

Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function?

Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ.

Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ. Mercer s condition: same as requiring that for any finite set of points x (1),..., x (m), the m m similarity matrix K given by K ij = k(x (i), x (j) ) is positive semidefinite.

The RBF kernel A popular similarity function: the Gaussian kernel or RBF kernel k(x, z) = e x z 2 /s 2, where s is an adjustable scale parameter.

RBF kernel: examples

RBF kernel: examples

RBF kernel: examples

RBF kernel: examples

The scale parameter Recall prediction function: F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x). For the RBF kernel, k(x, z) = e x z 2 /s 2, 1 How does this function behave as s? 2 How does this function behave as s 0? 3 As we get more data, should we increase or decrease s?

Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...)

Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well.

Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well. 3 Speeding up learning and prediction The n n kernel matrix k(x (i), x (j) ) is a bottleneck for large n. One idea: Go back to the primal space! Replace the embedding Φ by a lowdimensional mapping Φ such that Φ(x) Φ(z) Φ(x) Φ(z). This can be done, for instance, by writing Φ in the Fourier basis and then randomly sampling features.