Lecture 11. Kernel Methods

Similar documents
Lecture 6. Notes on Linear Algebra. Perceptron

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 3. Linear Regression

Support Vector Machines.

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Dan Roth 461C, 3401 Walnut

Introduction to Machine Learning

CS249: ADVANCED DATA MINING

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

Radial Basis Function (RBF) Networks

Support Vector Machine (SVM) and Kernel Methods

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (SVM) and Kernel Methods

CS249: ADVANCED DATA MINING

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector and Kernel Methods

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machine (continued)

Terms of Use. Copyright Embark on the Journey

Kernel methods CSE 250B

General Strong Polarization

18.9 SUPPORT VECTOR MACHINES

COMS 4771 Introduction to Machine Learning. Nakul Verma

Perceptron Revisited: Linear Separators. Support Vector Machines

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Neural networks and support vector machines

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

(Kernels +) Support Vector Machines

Kernelized Perceptron Support Vector Machines

Kaggle.

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

SVMs: nonlinearity through kernels

Support Vector Machines Explained

Module 7 (Lecture 25) RETAINING WALLS

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Kernel Methods. Barnabás Póczos

Introduction to Machine Learning

Discriminative Models

CSC 578 Neural Networks and Deep Learning

Linear & nonlinear classifiers

Machine Learning Practice Page 2 of 2 10/28/13

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CSC321 Lecture 4: Learning a Classifier

Chapter 6: Classification

10.4 The Cross Product

PHY103A: Lecture # 4

Lecture 10: Logistic Regression

ML (cont.): SUPPORT VECTOR MACHINES

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

Kernel Methods and Support Vector Machines

Midterm: CS 6375 Spring 2015 Solutions

Inferring the origin of an epidemic with a dynamic message-passing algorithm

Chapter 22 : Electric potential

A new procedure for sensitivity testing with two stress factors

Machine Learning : Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

General Strong Polarization

A A A A A A A A A A A A. a a a a a a a a a a a a a a a. Apples taste amazingly good.

Nonlinear Classification

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Kernel Methods. Charles Elkan October 17, 2007

Pattern Recognition 2018 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

F.3 Special Factoring and a General Strategy of Factoring

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines and Kernel Methods

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

1. The graph of a function f is given above. Answer the question: a. Find the value(s) of x where f is not differentiable. Ans: x = 4, x = 3, x = 2,

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

SVM optimization and Kernel methods

Discriminative Models

Applied Machine Learning Annalisa Marsico

CSC321 Lecture 4: Learning a Classifier

SVMs, Duality and the Kernel Trick

PHY103A: Lecture # 9

Support Vector Machines

Support Vector Machines

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Lecture 4: Training a Classifier

Kernels and the Kernel Trick. Machine Learning Fall 2017

(2) Orbital angular momentum

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

Transcription:

Lecture 11. Kernel Methods COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne

This lecture The kernel trick Efficient computation of a dot product in transformed feature space Modular learning Separating the learning module from feature space transformation Constructing kernels An overview of popular kernels and their properties Kernel as a similarity measure Extending machine learning beyond conventional data structure 2

The Kernel Trick An approach that we introduce in the context of SVMs. However, this approach is compatible with a large number of methods 3

Handling non-linear data with SVM Method 1: Soft margin SVM Method 2: Feature space transformation Map data into a new feature space Run hard margin or soft margin SVM in new space Decision boundary is non-linear in original space φ 4

Example of feature transformation Consider a binary classification problem Huh? Each example has features [xx 1, xx 2 ] Not linearly separable Now add a feature xx 3 = xx 2 + xx 2 2 Each point is now [xx 1, xx 2, xx 1 2 + xx 2 2 ] Linearly separable! Aww ^.^ 5

Naïve workflow Choose/design a linear model Choose/design a high-dimensional transformation φφ xx Hoping that after adding a lot of various features some of them will make the data linearly separable For each training example, and for each new instance compute φφ xx Train classifier/do predictions Problem: impractical/impossible to compute φφ(xx) for high/infinite-dimensional φφ(xx) 6

Hard margin SVM Training: finding λλ that solve argmax λλ nn nn λλ ii 1 2 ii=1 ii=1 nn jj=1 dot-product λλ ii λλ jj yy ii yy jj xx ii xx jj s.t. λλ ii 0 and nn ii=1 λλ ii yy ii = 0 Making predictions: classify new instance xx based on the sign of nn ss = bb + λλ ii yy ii xx ii xx ii=1 dot-product Here bb can be found by noting that for arbitrary training example jj we must have yy jj bb + nn ii=1 λλ ii yy ii xx ii xx jj = 1 7

Hard margin SVM Training: finding λλ that solve argmax λλ nn nn nn λλ ii 1 2 λλ ii λλ jj yy ii yy jj φφ xx ii φφ xx jj ii=1 ii=1 jj=1 s.t. λλ ii 0 and nn ii=1 λλ ii yy ii = 0 Making predictions: classify new instance xx based on the sign of nn ss = bb + λλ ii yy ii φφ xx ii φφ xx ii=1 Here bb can be found by noting that for arbitrary training example jj we must have yy jj bb + nn ii=1 λλ ii yy ii φφ xx ii φφ xx jj = 1 8

Observation: Dot product representation Both parameter estimation and computing predictions depend on data only in a form of a dot product mm In original space uu vv = ii=1 uu ii vv ii In transformed space φφ uu ll φφ vv = ii=1 φφ uu ii φφ vv ii Kernel is a function that can be expressed as a dot product in some feature space KK uu, vv = φφ uu φφ vv 9

Example of a kernel For some feature maps there exists a shortcut computation of the dot product via kernels For example, consider two vectors original space uu = uu 1 and vv = vv 1 and a transformation φφ xx = [xx 1 2, 2ccxx 1, cc] So φφ uu = uu 1 2, 2ccuu 1, cc and φφ vv = vv 1 2, 2ccvv 1, cc Then φφ uu φφ vv = uu 1 2 vv 1 2 + 2ccuu 1 vv 1 + cc 2 This can be alternatively computed as φφ uu φφ vv = uu 1 vv 1 + cc 2 Here KK uu, vv = uu 1 vv 1 + cc 2 is a kernel 10

The kernel trick Consider two training points xx ii and xx jj and their dot product in the transformed space. Define this quantity as kk iiii φφ xx ii φφ xx jj This can be computed as: 1. Compute φφ xx ii 2. Compute φφ xx jj 3. Compute kk iiii = φφ xx ii φφ xx jj However, for some transformations φφ, there exists a function that gives exactly the same answer KK xx ii, xx jj = kk iiii In other words, sometimes there is a different way ( shortcut ) to compute the same quantity kk iiii This different way, does not involve steps 1 3. In particular, we do not need to compute φφ(xx ii ) and φφ(xx jj ) Usually kernels can be computed in OO mm, whereas computing φφ xx requires OO ll, where ll mm or ll = 11

Hard margin SVM Training: finding λλ that solve argmax λλ nn nn nn λλ ii 1 2 λλ ii λλ jj yy ii yy jj φφ xx ii φφ xx jj ii=1 ii=1 jj=1 s.t. λλ ii 0 and nn ii=1 λλ ii yy ii = 0 Making predictions: classify new instance xx based on the sign of nn ss = bb + λλ ii yy ii φφ xx ii φφ xx ii=1 Here bb can be found by noting that for arbitrary training example jj we must have yy jj bb + nn ii=1 λλ ii yy ii φφ xx ii φφ xx jj = 1 12

Hard margin SVM Training: finding λλ that solve argmax λλ nn nn nn λλ ii 1 2 λλ ii λλ jj yy ii yy jj KK xx ii, xx jj ii=1 ii=1 jj=1 s.t. λλ ii 0 and nn ii=1 λλ ii yy ii = 0 Making predictions: classify new instance xx based on the sign of nn ss = bb + λλ ii yy ii KK xx ii, xx jj ii=1 feature mapping is implied by kernel feature mapping is implied by kernel Here bb can be found by noting that for arbitrary training example jj we must have yy jj bb + nn ii=1 λλ ii yy ii KK xx ii, xx jj = 1 13

ANN approach to non-linearity xx 1 xx 2 xx mm uu 1 zz 1 uu pp zz 2 zz qq In this ANN, elements of uu can be thought as the transformed input uu = φφ xx This transformation is explicitly constructed by varying the ANN topology Moreover, the weights are learned from data 14

SVM approach to non-linearity Choosing a kernel implies some transformation φφ(xx). Unlike ANN case, we don t have control over relative weights of components of φφ xx However, the advantage of using kernels is that we don t need to actually compute components of φφ xx. This is beneficial when the transformed space is multidimensional. In addition, it makes it possible to transform the data into an infinitedimensional space Kernels also offer an additional advantage discussed in the last part of this lecture 15

Checkpoint Which of the following statements is always true? a) Any method that uses a feature space transformation φφ(xx) uses kernels b) Support vectors are points from the training set c) Feature mapping φφ(xx) makes data linearly separable art: OpenClipartVectors at pixabay.com (CC0) 16

Modular Learning Separating the learning module from feature space transformation 17

Representer theorem Theorem: a large class of linear methods can be formulated (represented) such that both training and making predictions require data only in a form of a dot product Hard margin SVM is one example of such a method The theorem predicts that there are many more. For example: Ridge regression Logistic regression Perceptron Principal component analysis and so on Math will work! Image: ClkerFreeVectorImages@ pixabay.com (CC0) 18

Kernelised perceptron (1/3) When classified correctly, weights are unchanged When misclassified: ww kk+1 (ηη > 0 is called learning rate) = ηη(±xx) If yy = 1, but ss < 0 ww ii ww ii + ηηxx ii ww 0 ww 0 + ηη If yy = 1, but ss 0 ww ii ww ii ηηxx ii ww 0 ww 0 ηη Suppose weights are initially set to 0 First update: ww = ηηyy ii1 xx ii1 Second update: ww = ηηyy ii1 xx ii1 + ηηyy ii2 xx ii2 Third update ww = ηηyy ii1 xx ii1 + ηηyy ii2 xx ii2 + ηηyy ii3 xx ii3 etc. 19

Kernelised perceptron (2/3) Weights always take the form ww = nn ii=1 αα ii yy ii xx ii, where αα some coefficients Perceptron weights are always a linear combination of data! Recall that prediction for a new point xx is based on sign of ww 0 + ww xx Substituting ww we get ww 0 + nn ii=1 αα ii yy ii xx ii xx The dot product xx ii xx can be replaced with a kernel 20

Kernelised perceptron (3/3) Choose initial guess ww (0), kk = 0 Set αα = 00 For tt from 1 to TT (epochs) For each training example xx ii, yy ii nn Predict based on ww 0 + jj=1 αα jj yy jj xx ii xx jj If misclassified, update αα ii αα ii + 1 21

Modular learning All information about feature mapping is concentrated within the kernel In order to use a different feature mapping, simply change the kernel function Algorithm design decouples into choosing a learning method (e.g., SVM vs logistic regression) and choosing feature space mapping, i.e., kernel 22

Constructing Kernels An overview of popular kernels and kernel properties 23

A large variety of kernels In this section, we review polynomial and Gaussian kernels 24

Polynomial kernel Function KK uu, vv = uu vv + cc dd is called polynomial kernel Here uu and vv are vectors with mm components dd 0 is an integer and cc 0 is a constant Without the loss of generality, assume cc = 0 If it s not, add cc as a dummy feature to uu and vv uu vv dd = uu 1 vv 1 + + uu mm vv mm uu 1 vv 1 + + uu mm vv mm uu 1 vv 1 + + uu mm vv mm ll = ii=1 uu 1 vv 1 aa iii uu mm vv mm aa iiii Here 0 aa iiii dd and ll are integers ll = ii=1 uu 1 aa iii uu mm aa iiii vv 1 aa iii vv mm aa iiii ll = ii=1 φφ uu ii φφ vv ii Feature map φφ: R mm R ll, where φφ ii xx = xx 1 aa ii1 xx mm aa iimm 25

Identifying new kernels Method 1: Using identities, such as below. Let KK 1 uu, vv, KK 2 uu, vv be kernels, cc > 0 be a constant, and ff xx be a real-valued function. Then each of the following is also a kernel: KK uu, vv = KK 1 uu, vv + KK 2 uu, vv KK uu, vv = cckk 1 uu, vv KK uu, vv = ff uu KK 1 uu, vv ff vv See Bishop s book for more identities Method 2: Using Mercer s theorem Prove these! art: OpenClipartVectors at pixabay.com (CC0) 26

Radial basis function kernel Function KK uu, vv = exp γγ uu vv 2 is called radial basis function kernel (aka Gaussian kernel) Here γγ > 0 is the spread parameter exp γγ uu vv 2 = exp γγ uu vv uu vv = exp γγ uu uu 2uu vv + vv vv = exp γγuu uu exp 2γγuu vv exp γγvv vv = ff uu exp 2γγuu vv ff vv = ff uu dd=0 rr dd uu vv dd ff vv Power series expansion Here, each uu vv dd is a polynomial kernel. Using kernel identities, we conclude that the middle term is a kernel, and hence the whole expression is a kernel 27

Mercer s Theorem Question: given φ uu, is there a good kernel to use? Inverse question: given some function KK(uu, vv), is this a valid kernel? In other words, is there a mapping φ uu implied by the kernel? Mercer s theorem: Consider a finite sequences of objects xx 1,, xx nn Construct nn nn matrix of pairwise values KK(xx ii, xx jj ) KK(xx ii, xx jj ) is a kernel if this matrix is positivesemidefinite, and this holds for all possible sequences xx 1,, xx nn 28

Kernel as a Similarity Measure Extending machine learning beyond conventional data structure 29

Yet another use of kernels Remember how (re-parameterised) SVM makes predictions. The prediction depends on the sign of bb + nn ii=1 λλ ii yy ii xx ii xx So point xx is dot-producted with each training support vector This term can be re-written using a kernel bb + nn ii=1 λλ ii yy ii KK xx ii, xx This can be seen as comparing xx to each of the support vectors E.g., consider Gaussian kernel KK xx ii, xx = exp γγ xx ii xx 2 Here xx ii xx is the distance between the points and KK xx ii, xx is monotonically decreasing with the distance KK xx ii, xx can be interpreted as a similarity measure 30

Kernel as a similarity measure More generally, any kernel KK uu, vv can be viewed as a similarity measure: it maps two objects to a real number In other words, choosing/designing a kernel can be viewed as defining how to compare the objects This is a very powerful idea, because we can extend kernel methods to objects that are not vectors This is the first time in this course, when we are going to encounter a notion of a different data type So far, we ve been concerned with vectors of fixed dimensionality, e.g., xx = xx 1,, xx mm 31

Data comes in a variety of shapes But what if we wanted to do machine learning on Graphs Facebook, Twitter, Sequences of variable lengths science is organized knowledge, wisdom is organized life *, CATTC, AAAGAGA Songs, movies, etc. * Both quotations are from Immanuel Kant 32

Handling arbitrary data structures Kernels offer a way to deal with the variety of data types For example, we could define a function that somehow measures similarity of variable length strings K( science is organized knowledge, wisdom is organized life ) However, not every function on two objects is a valid kernel Remember that we need that function KK uu, vv to imply a dot product in some feature space 33

This lecture The kernel trick Efficient computation of a dot product in transformed feature space Modular learning Separating the learning module from feature space transformation Constructing kernels An overview of popular kernels and their properties Kernel as a similarity measure Extending machine learning beyond conventional data structure 34