Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Similar documents
Kernel methods CSE 250B

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines.

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CIS 520: Machine Learning Oct 09, Kernel Methods

SVMs, Duality and the Kernel Trick

Machine Learning. Support Vector Machines. Manfred Huber

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Perceptron Revisited: Linear Separators. Support Vector Machines

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear & nonlinear classifiers

Support Vector Machine (continued)

Max Margin-Classifier

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernel methods, kernel SVM and ridge regression

Linear & nonlinear classifiers

(Kernels +) Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Kaggle.

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines

Support Vector Machines and Speaker Verification

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS798: Selected topics in Machine Learning

Introduction to Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Learning with kernels and SVM

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning Lecture 6 Note

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

CS145: INTRODUCTION TO DATA MINING

Discriminative Models

Support Vector Machines

Kernel Methods and Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Discriminative Models

Kernel Methods. Outline

Lecture 7: Kernels for Classification and Regression

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines and Kernel Methods

Kernel Methods. Charles Elkan October 17, 2007

Support Vector Machines

Statistical Machine Learning from Data

Support Vector Machine

Kernel Methods. Barnabás Póczos

Modelli Lineari (Generalizzati) e SVM

Introduction to Machine Learning

Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

SVMs: nonlinearity through kernels

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines. Maximizing the Margin

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

Introduction to Support Vector Machines

Statistical Methods for NLP

HOMEWORK 4: SVMS AND KERNELS

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines for Classification: A Statistical Portrait

Linear Classification and SVM. Dr. Xin Zhang

Kernel Methods & Support Vector Machines

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Jeff Howbert Introduction to Machine Learning Winter

9.2 Support Vector Machines 159

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Advanced Machine Learning & Perception

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines Explained

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machine II

Supervised Learning Coursework

Transcription:

Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation What to do with this? Adding new features Actual boundary is something like x 1 = x 2 2 5. This is quadratic in x = (x 1, x 2 ) But it is linear in Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ) Basis expansion: embed data in higherdimensional feature space. Then we can use a linear classifier! Basis expansion for quadratic boundaries How to deal with a quadratic boundary? Idea: augment the regular features x = (x 1, x 2,..., x d ) with x 2 1, x 2 2,..., x 2 d x 1 x 2, x 1 x 3,..., x d 1 x d Enhanced data vectors of the form: Φ(x) = (x 1,..., x d, x 2 1,..., x 2 d, x 1x 2,..., x d 1 x d )

Quick question Perceptron revisited Suppose x = (x 1,..., x d ). What is the dimension of Φ(x)? Learning in the higherdimensional feature space: while some y(w Φ(x) b) 0 : w = w y Φ(x) b = b y Perceptron with basis expansion: examples Perceptron with basis expansion: examples

Perceptron with basis expansion The kernel trick Learning in the higherdimensional feature space: while some y(w Φ(x) b) 0 : w = w y Φ(x) b = b y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504. The kernel trick: implement this without ever writing down a vector in the higherdimensional space! while some y (i) (w Φ(x (i) ) b) 0 : w = w y (i) Φ(x (i) ) b = b y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1 3 Compute Φ(x) Φ(z) without ever writing out Φ(x) or Φ(z). Computing dot products First, in 2d. Suppose x = (x 1, x 2 ) and Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). Actually, tweak a little: Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) What is Φ(x) Φ(z)? Computing dot products Suppose x = (x 1, x 2,..., x d ) and Φ(x) = (1, 2x 1,..., 2x d, x1 2,..., xd 2, 2x 1 x 2,..., 2x d 1 x d ) Φ(x) Φ(z) = (1, 2x 1,..., 2x d, x1 2,..., xd 2, 2x 1 x 2,..., 2x d 1 x d ) (1, 2z 1,..., 2z d, z 2 1,..., z 2 d, 2z 1 z 2,..., 2z d 1 z d ) = 1 2 i x i z i i x 2 i z 2 i 2 i j x i x j z i z j = (1 x 1 z 1 x d z d ) 2 = (1 x z) 2 For MNIST: We are computing dot products in 308504dimensional space. But it takes time proportional to 784, the original dimension!

Kernel Perceptron Learning from data (x (1), y (1) ),..., (x (n), y (n) ) X { 1, 1} Primal form: while there is some i with y (i) (w Φ(x (i) ) b) 0 : w = w y (i) Φ(x (i) ) b = b y (i) Dual form: w = j α jy (j) Φ(x (j) ), where α R n α = 0 and b = 0 while some i has y (i) j α jy (j) Φ(x (j) ) Φ(x (i) ) b 0 : α i = α i 1 b = b y (i) To classify a new point x: sign j α jy (j) Φ(x (j) ) Φ(x) b. Does this work with SVMs? (primal) min w R d,b R,ξ R n s.t.: y (i) (w x (i) b) 1 ξ i ξ 0 (dual) max α R n α i s.t.: Solution: w = i α iy (i) x (i). w 2 C ξ i for all i = 1, 2,..., n α i α j y (i) y (j) (x (i) x (j) ) i,j=1 α i y (i) = 0 0 α i C Kernel SVM 1 Basis expansion. Mapping x Φ(x). 2 Learning. Solve the dual problem: max α R n α i α i α j y (i) y (j) (Φ(x (i) ) Φ(x (j) )) i,j=1 s.t.: α i y (i) = 0 0 α i C This yields w = i α iy (i) Φ(x (i) ). Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) (Φ(x (i) ) Φ(x)) b. i Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:

Kernel Perceptron vs. Kernel SVM: examples Polynomial decision boundaries When decision surface is a polynomial of order p: Perceptron: SVM: Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x). Same trick works: Φ(x) Φ(z) = (1 x z) p. Kernel function: k(x, z) = (1 x z) p. String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x? We will use an infinitedimensional embedding! String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive. To compute k(x, z) = Φ(x) Φ(z): for each substring s of x: count how often s appears in z Using dynamic programming, this takes time O( x z ).

The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k. Kernel Perceptron: α = 0 and b = 0 while some i has y (i) j α jy (j) k(x (j), x (i) ) b 0 : α i = α i 1 b = b y (i) To classify a new point x: sign j α jy (j) k(x (j), x) b. Kernel SVM, revisited 1 Kernel function. Define a similarity function k(x, z). 2 Learning. Solve the dual problem: max α R n α i s.t.: α i α j y (i) y (j) k(x (i), x (j) ) i,j=1 α i y (i) = 0 0 α i C This yields α. Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) k(x (i), x) b. i Choosing the kernel function The RBF kernel The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ. Mercer s condition: same as requiring that for any finite set of points x (1),..., x (m), the m m similarity matrix K given by K ij = k(x (i), x (j) ) is positive semidefinite. A popular similarity function: the Gaussian kernel or RBF kernel k(x, z) = e x z 2 /s 2, where s is an adjustable scale parameter.

RBF kernel: examples RBF kernel: examples The scale parameter Recall prediction function: F (x) = α 1 y (1) k(x (1), x) α n y (n) k(x (n), x). For the RBF kernel, k(x, z) = e x z 2 /s 2, 1 How does this function behave as s? 2 How does this function behave as s 0? 3 As we get more data, should we increase or decrease s? Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well. 3 Speeding up learning and prediction The n n kernel matrix k(x (i), x (j) ) is a bottleneck for large n. One idea: Go back to the primal space! Replace the embedding Φ by a lowdimensional mapping Φ such that Φ(x) Φ(z) Φ(x) Φ(z). This can be done, for instance, by writing Φ in the Fourier basis and then randomly sampling features.