Advanced Introduction to Machine Learning

Similar documents
Recap: the SVM problem

Which Separator? Spring 1

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Lecture 3: Dual problems and Kernels

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines CS434

Support Vector Machines

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Lecture 10 Support Vector Machines. Oct

Support Vector Machines CS434

CSE 252C: Computer Vision III

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Lecture 10 Support Vector Machines II

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Lagrange Multipliers Kernel Trick

10-701/ Machine Learning, Fall 2005 Homework 3

Kernel Methods and SVMs Extension

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines

Feature Selection: Part 1

Support Vector Machines

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Chapter 6 Support vector machine. Séparateurs à vaste marge

APPENDIX A Some Linear Algebra

1 Convex Optimization

Maximal Margin Classifier

Support Vector Machines

Support Vector Machines

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Intro to Visual Recognition

Nonlinear Classifiers II

Assortment Optimization under MNL

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Generalized Linear Methods

COS 521: Advanced Algorithms Game Theory and Linear Programming

Solutions to exam in SF1811 Optimization, Jan 14, 2015

PHYS 705: Classical Mechanics. Calculus of Variations II

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

Week 5: Neural Networks

Pattern Classification

Linear Approximation with Regularization and Moving Least Squares

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Affine and Riemannian Connections

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Kristin P. Bennett. Rensselaer Polytechnic Institute

Kernel Methods and SVMs

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 8, 2015

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Errors for Linear Systems

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Singular Value Decomposition: Theory and Applications

CSC 411 / CSC D11 / CSC C11

Orthogonal Functions and Fourier Series. University of Texas at Austin CS384G - Computer Graphics Spring 2010 Don Fussell

6.854J / J Advanced Algorithms Fall 2008

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Lecture 6: Support Vector Machines

Lecture Notes on Linear Regression

The exam is closed book, closed notes except your one-page cheat sheet.

Report on Image warping

Multilayer Perceptron (MLP)

The Minimum Universal Cost Flow in an Infeasible Flow Network

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

UVA CS / Introduc8on to Machine Learning and Data Mining

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Generative classification models

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Limited Dependent Variables

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Bayesian epistemology II: Arguments for Probabilism

Classification as a Regression Problem

Learning Theory: Lecture Notes

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Natural Language Processing and Information Retrieval

Affine transformations and convexity

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Orthogonal Functions and Fourier Series. University of Texas at Austin CS384G - Computer Graphics Fall 2010 Don Fussell

Representation theory and quantum mechanics tutorial Representation theory and quantum conservation laws

NOTES FOR QUANTUM GROUPS, CRYSTAL BASES AND REALIZATION OF ŝl(n)-modules

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

Math 217 Fall 2013 Homework 2 Solutions

Lecture 17: Lee-Sidford Barrier

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

18-660: Numerical Methods for Engineering Design and Optimization

Fisher Linear Discriminant Analysis

A how to guide to second quantization method.

Feb 14: Spatial analysis of data fields

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Composite Hypotheses testing

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Transcription:

Advanced Introducton to Machne Learnng 10715, Fall 2014 The Kernel Trck, Reproducng Kernel Hlbert Space, and the Representer Theorem Erc Xng Lecture 6, September 24, 2014 Readng: Erc Xng @ CMU, 2014 1

Recap: the SVM problem We solve the followng constraned opt problem: Ths s a quadratc programmng problem. A global mamum of can always be found. The soluton: How to predct: m y w 1 m j j T j j m y y 1 1 2 1, ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C Erc Xng @ CMU, 2014 2

Kernel Pont rule or average rule Can we predct vec(y)? m j j T j j m y y 1 1 2 1, ) ( ) ( ma J Erc Xng @ CMU, 2014 3

Outlne The Kernel trck Mamum entropy dscrmnaton Structured SVM, aka, Mamum Margn Markov Networks Erc Xng @ CMU, 2014 4

(1) Non-lnear Decson Boundary So far, we have only consdered large-margn classfer wth a lnear decson boundary How to generalze t to become nonlnear? Key dea: transform to a hgher dmensonal space to make lfe easer Input space: the space the pont are located Feature space: the space of ( ) after transformaton Why transform? Lnear operaton n the feature space s equvalent to non-lnear operaton n nput space Classfcaton can become easer wth a proper transformaton. In the XOR problem, for eample, addng a new feature of 1 2 make the problem lnearly separable (homework) Erc Xng @ CMU, 2014 5

Non-lnear Decson Boundary Erc Xng @ CMU, 2014 6

Transformng the Data Input space (.) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Feature space Note: feature space s of hgher dmenson than the nput space n practce Computaton n the feature space can be costly because t s hgh dmensonal The feature space s typcally nfnte-dmensonal! The kernel trck comes to rescue Erc Xng @ CMU, 2014 7

The Kernel Trck Recall the SVM optmzaton problem The data ponts only appear as nner product As long as we can calculate the nner product n the feature space, we do not need the mappng eplctly Many common geometrc operatons (angles, dstances) can be epressed by nner products Defne the kernel functon K by m j j T j j m y y 1 1 2 1, ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C ) ( ) ( ), ( j T j K Erc Xng @ CMU, 2014 8

An Eample for feature mappng and kernels Consder an nput =[ 1, 2 ] Suppose (.) s gven as follows An nner product n the feature space s So, f we defne the kernel functon as follows, there s no need to carry out (.) eplctly 2 1 2 2 2 1 2 1 2 1 2 2 2 1,,,,, ' ', 2 1 2 1 2 1 ' ) ', ( T K Erc Xng @ CMU, 2014 9

More eamples of kernel functons Lnear kernel (we've seen t) T K (, ' ) ' Polynomal kernel (we just saw an eample) K where p = 2, 3, To get the feature vectors we concatenate all pth order polynomal terms of the components of (weghted approprately) T (, ' ) 1 ' p Radal bass kernel 1 K (, ' ) ep ' 2 In ths case the feature space conssts of functons and results n a nonparametrc classfer. 2 Erc Xng @ CMU, 2014 10

The essence of kernel Feature mappng, but wthout payng a cost E.g., polynomal kernel How many dmensons we ve got n the new space? How many operatons t takes to compute K()? Kernel desgn, any prncple? K(,z) can be thought of as a smlarty functon between and z Ths ntuton can be well reflected n the followng Gaussan functon (Smlarly one can easly come up wth other K() n the same sprt) Is ths necessarly lead to a legal kernel? (n the above partcular case, K() s a legal one, do you know how many dmenson () s? Erc Xng @ CMU, 2014 11

Kernel matr Suppose for now that K s ndeed a vald kernel correspondng to some feature mappng, then for 1,, m, we can compute an mm matr, where Ths s called a kernel matr! Now, f a kernel functon s ndeed a vald kernel, and ts elements are dot-product n the transformed feature space, t must satsfy: Symmetry K=K T proof Postve semdefnte proof? Erc Xng @ CMU, 2014 12

Mercer kernel Erc Xng @ CMU, 2014 13

SVM eamples Erc Xng @ CMU, 2014 14

Eamples for Non Lnear SVMs Gaussan Kernel Erc Xng @ CMU, 2014 15

Remember the Kernel Trck!!! Prmal Formulaton: Infnte, cannot be drectly computed But the dot product s easy to compute Dual Formulaton: Erc Xng @ CMU, 2014 16

Overvew of Hlbert Space Embeddng Create an nfnte dmensonal statstc for a dstrbuton. Two Requrements: Map from dstrbutons to statstcs s one-to-one Although statstc s nfnte, t s cleverly constructed such that the kernel trck can be appled. Perform Belef Propagaton as f these statstcs are the condtonal probablty tables. We wll now make ths constructon more formal by ntroducng the concept of Hlbert Spaces Erc Xng @ CMU, 2014 17

Vector Space A set of objects closed under lnear combnatons (e.g., addton and scalar multplcaton): Obeys dstrbutve and assocatve laws, Normally, you thnk of these objects as fnte dmensonal vectors. However, n general the objects can be functons. Nonrgorous Intuton: A functon s lke an nfnte dmensonal vector. Erc Xng @ CMU, 2014 18

Hlbert Space A Hlbert Space s a complete vector space equpped wth an nner product. The nner product has the followng propertes: Symmetry Lnearty Nonnegatvty Zero Bascally a nce nfnte dmensonal vector space, where lots of thngs behave lke the fnte case e.g. usng nner product we can defne norm or orthogonalty e.g. a norm can be defned, allows one to defne notons of convergence Erc Xng @ CMU, 2014 19

Hlbert Space Inner Product Eample of an nner product (just an eample, nner product not requred to be an ntegral) Inner product of two functons s a number Tradtonal fnte vector space nner product scalar Erc Xng @ CMU, 2014 20

Recall the SVM kernel Intuton Maps data ponts to Feature Functons, whch corresponds to some vectors n a vector space. Erc Xng @ CMU, 2014 21

The Feature Functon Consder holdng one element of the kernel fed. We get a functon of one varable whch we call the feature functon. The collecton of feature functons s called the feature map. For a Gaussan Kernel the feature functons are unnormalzed Gaussans: Erc Xng @ CMU, 2014 22

Reproducng Kernel Hlbert Space Gven a kenel k(, ), we now construct a Hlbert space such that k defnes an nner product n that space We begn wth a kernel map: We now construct a vector space contanng all lnear combnatons of the functons k(,): We now defne an nner product. Let we have :! k( ;) f ( ) = P m =1 k( ; ) hf;g = P m =1 g( ) = P m 0 j=1 jk( ; 0 j ) P m 0 j=1 jk( ; 0 j ) please verfy ths n fact s an nner product: satsfyng symmetry, lnearty, and zero-norm law : hf;f =0 ) f =0 (here we need reproducng property, and Cauchy-Schwartz nequaly Erc Xng @ CMU, 2014 23

Reproducng Kernel Hlbert Space The k(,) s a reproducng kernel map: hk( ;);f = P m =1 k( )=f () Ths shows that the kernel s a representer of evaluaton (or, evaluaton functon) Ths s analogous to the Drac delta functon. If we plug n the kernel n for f: hk( ;);k( ; 0 ) = k(; 0 ) Wth such a defnton of nner product, we have constructed a subspace of the Hlbert space --- a reproducng kernel Hlbert space (RKHS) Erc Xng @ CMU, 2014 24

Back to Feature Map The collecton of evaluaton functons s the feature map!!! The Feature Map s the collecton of Evaluaton Functons! Intuton: A more complcated feature map/kernel corresponds to ``rcher RKHS Bascally, a really nce nfnte dmensonal vector space where even more thngs behave lke the fnte case Erc Xng @ CMU, 2014 25

Inner Product of Feature Maps Defne the Inner Product as: scalar Note that: Erc Xng @ CMU, 2014 26

Mercer s theorem and RKHS Recall the followng condton for Mercer s theorem for K We can also construct our Reproducng Kernel Hlbert Space wth a Mercer Kernel, as a lnear combnaton of ts egen-functons: R k(; 0 )Á ( 0 )= P 1 j=1 Á j() whch can be shown to ental reproducng property (homework?) Erc Xng @ CMU, 2014 27

Summary: RKHS Consder the set of functons that can be formed wth lnear combnatons of these feature functons: We defne the Reproducng Kernel Hlbert Space to the completon of (lke wth the holes flled n) Intutvely, the feature functons are lke an over-complete bass for the RKHS Erc Xng @ CMU, 2014 28

Summary: Reproducng Property It can now be derved that the nner product of a functon f wth, evaluates a functon at pont : Lnearty of nner product Defnton of kernel Remember that scalar Erc Xng @ CMU, 2014 29

Summary: Evaluaton Functon A Reproducng Kernel Hlbert Space s an Hlbert Space where for any X, the evaluaton functonal ndeed by X takes the followng form: Evaluaton Functon, must be a functon n the RKHS Same evaluaton functon for dfferent functons (but same pont) Dfferent ponts are assocated wth dfferent evaluaton functons Equvalent (More Techncal) Defnton: An RKHS s a Hlbert Space where the evaluaton functonals are bounded. (The prevous defnton then follows from Resz Representaton Theorem) Erc Xng @ CMU, 2014 30

RKHS or Not? Is the vector space of 3 dmensonal real valued vectors an RKHS? Yes!!! Homework! Erc Xng @ CMU, 2014 31

RKHS or Not? Is the space of functons such that an RKHS? No!!!! Homework! But, can t the evaluaton functonal be an nner product wth the delta functon? The problem s that the delta functon s not n my space! Erc Xng @ CMU, 2014 32

The Kernel I can evaluate my evaluaton functon wth another evaluaton functon! Dong ths for all pars n my dataset gves me the Kernel Matr K: There may be nfntely many evaluaton functons, but I only have a fnte number of tranng ponts, so the kernel matr s fnte!!!! Erc Xng @ CMU, 2014 33

Correspondence between Kernels and RKHS A kernel s postve sem-defnte f the kernel matr s postve semdefnte for any choce of fnte set of observatons. Theorem (Moore-Aronszajn): Every postve sem-defnte kernel corresponds to a unque RKHS, and every RKHS s assocated wth a unque postve sem-defnte kernel. Note that the kernel does not unquely defne the feature map (but we don t really care snce we never drectly evaluate the feature map anyway). Erc Xng @ CMU, 2014 34

RKHS norm and SVM Recall that n SVM: f ( ) =hw; = P m =1 y k( ; ) Therefore f( ) 2H Moreover: kf( )k 2 H = h mx y k( ; ); mx j y j k( ; j ) = =1 j=1 Erc Xng @ CMU, 2014 35

Prmal and dual SVM objectve In our prmal problem, we mnmze w T w subject to constrants. Ths s equvalent to: kwk 2 = w T w = mx mx j y y j ( ) ( j ) = mx =1 mx j=1 =1 j=1 j y y j k( ; j ) = kfk 2 H whch s equvalent to mnmzng the Hlbert norm of f subject to constrants Erc Xng @ CMU, 2014 36

The Representer Theorem In the general case, for a prmal problem P of the form: mn fc(f;f ;y g)+ð(kfk H )g f2h where f ;y g) m =1 are the tranng data. If the followng condtons are satsfed: The loss functon C s pont-wse,.e., s monotoncally ncreasng Ð( ) C(f;f ;y g)=c(f ;y ;f( )g) The representer theorem (Kmeldorf and Wahba, 1971): every mnmzer of P admts a representaton of the form f( ) = mx K( ; ) =1.e., a lnear combnaton of (a fnte set of) functon gven by the data Erc Xng @ CMU, 2014 37

Proof of Representer Theorem Erc Xng @ CMU, 2014 38

Another vew of SVM Q: why SVM s dual-sparse,.e., havng a few support vectors (most of the s are zero). The SVM loss w T w does not seem to mply that And the representer theorem does not ether! Erc Xng @ CMU, 2014 39

Another vew of SVM: L 1 regularzaton The bass-pursut denosng cost functon (chen & Donoho): J( ) = 1 2 kf( ) N X =1 Á ()k 2 L 2 + k k L1 Instead we consder the followng modfed cost: J( ) = 1 2 X kf( ) N X =1 K( ; )k 2 H + k k L 1 Erc Xng @ CMU, 2014 40

RKHS norm nterpretaton of SVM J( ) = 1 2 X X N kf( ) K( ; )k 2 H + k k L 1 The RKHS norm of the frst term can now be computed eactly! =1 Erc Xng @ CMU, 2014 41

RKHS norm nterpretaton of SVM Now we have the followng optmzaton problem: n mn X y + 1 2 X ;j j K( ; j )+ X o j j Ths s eactly the dual problem of SVM! Erc Xng @ CMU, 2014 42

Take home message Kernel s a (nonlnear) feature map nto a Hlbert space Mercer kernels are legal RKHS s a Hlbert equpped wth an nner product operator defned by mercer kernel Reproducng property make kernel works lke an evaluaton functon Representer theorem ensures optmal soluton to a general class of loss functon to be n the Hlbert space SVM can be recast as an L1-regularzed mnmzaton problem n the RKHS Erc Xng @ CMU, 2014 43