Advanced Introduction to Machine Learning

Advanced Introducton to Machne Learnng 10715, Fall 2014 The Kernel Trck, Reproducng Kernel Hlbert Space, and the Representer Theorem Erc Xng Lecture 6, September 24, 2014 Readng: Erc Xng @ CMU, 2014 1

Recap: the SVM problem We solve the followng constraned opt problem: Ths s a quadratc programmng problem. A global mamum of can always be found. The soluton: How to predct: m y w 1 m j j T j j m y y 1 1 2 1, ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C Erc Xng @ CMU, 2014 2

Kernel Pont rule or average rule Can we predct vec(y)? m j j T j j m y y 1 1 2 1, ) ( ) ( ma J Erc Xng @ CMU, 2014 3

Outlne The Kernel trck Mamum entropy dscrmnaton Structured SVM, aka, Mamum Margn Markov Networks Erc Xng @ CMU, 2014 4

(1) Non-lnear Decson Boundary So far, we have only consdered large-margn classfer wth a lnear decson boundary How to generalze t to become nonlnear? Key dea: transform to a hgher dmensonal space to make lfe easer Input space: the space the pont are located Feature space: the space of ( ) after transformaton Why transform? Lnear operaton n the feature space s equvalent to non-lnear operaton n nput space Classfcaton can become easer wth a proper transformaton. In the XOR problem, for eample, addng a new feature of 1 2 make the problem lnearly separable (homework) Erc Xng @ CMU, 2014 5

Non-lnear Decson Boundary Erc Xng @ CMU, 2014 6

Transformng the Data Input space (.) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Feature space Note: feature space s of hgher dmenson than the nput space n practce Computaton n the feature space can be costly because t s hgh dmensonal The feature space s typcally nfnte-dmensonal! The kernel trck comes to rescue Erc Xng @ CMU, 2014 7

The Kernel Trck Recall the SVM optmzaton problem The data ponts only appear as nner product As long as we can calculate the nner product n the feature space, we do not need the mappng eplctly Many common geometrc operatons (angles, dstances) can be epressed by nner products Defne the kernel functon K by m j j T j j m y y 1 1 2 1, ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C ) ( ) ( ), ( j T j K Erc Xng @ CMU, 2014 8

An Eample for feature mappng and kernels Consder an nput =[ 1, 2 ] Suppose (.) s gven as follows An nner product n the feature space s So, f we defne the kernel functon as follows, there s no need to carry out (.) eplctly 2 1 2 2 2 1 2 1 2 1 2 2 2 1,,,,, ' ', 2 1 2 1 2 1 ' ) ', ( T K Erc Xng @ CMU, 2014 9

More eamples of kernel functons Lnear kernel (we've seen t) T K (, ' ) ' Polynomal kernel (we just saw an eample) K where p = 2, 3, To get the feature vectors we concatenate all pth order polynomal terms of the components of (weghted approprately) T (, ' ) 1 ' p Radal bass kernel 1 K (, ' ) ep ' 2 In ths case the feature space conssts of functons and results n a nonparametrc classfer. 2 Erc Xng @ CMU, 2014 10

The essence of kernel Feature mappng, but wthout payng a cost E.g., polynomal kernel How many dmensons we ve got n the new space? How many operatons t takes to compute K()? Kernel desgn, any prncple? K(,z) can be thought of as a smlarty functon between and z Ths ntuton can be well reflected n the followng Gaussan functon (Smlarly one can easly come up wth other K() n the same sprt) Is ths necessarly lead to a legal kernel? (n the above partcular case, K() s a legal one, do you know how many dmenson () s? Erc Xng @ CMU, 2014 11

Kernel matr Suppose for now that K s ndeed a vald kernel correspondng to some feature mappng, then for 1,, m, we can compute an mm matr, where Ths s called a kernel matr! Now, f a kernel functon s ndeed a vald kernel, and ts elements are dot-product n the transformed feature space, t must satsfy: Symmetry K=K T proof Postve semdefnte proof? Erc Xng @ CMU, 2014 12

Mercer kernel Erc Xng @ CMU, 2014 13

SVM eamples Erc Xng @ CMU, 2014 14

Eamples for Non Lnear SVMs Gaussan Kernel Erc Xng @ CMU, 2014 15

Remember the Kernel Trck!!! Prmal Formulaton: Infnte, cannot be drectly computed But the dot product s easy to compute Dual Formulaton: Erc Xng @ CMU, 2014 16

Overvew of Hlbert Space Embeddng Create an nfnte dmensonal statstc for a dstrbuton. Two Requrements: Map from dstrbutons to statstcs s one-to-one Although statstc s nfnte, t s cleverly constructed such that the kernel trck can be appled. Perform Belef Propagaton as f these statstcs are the condtonal probablty tables. We wll now make ths constructon more formal by ntroducng the concept of Hlbert Spaces Erc Xng @ CMU, 2014 17

Vector Space A set of objects closed under lnear combnatons (e.g., addton and scalar multplcaton): Obeys dstrbutve and assocatve laws, Normally, you thnk of these objects as fnte dmensonal vectors. However, n general the objects can be functons. Nonrgorous Intuton: A functon s lke an nfnte dmensonal vector. Erc Xng @ CMU, 2014 18

Hlbert Space A Hlbert Space s a complete vector space equpped wth an nner product. The nner product has the followng propertes: Symmetry Lnearty Nonnegatvty Zero Bascally a nce nfnte dmensonal vector space, where lots of thngs behave lke the fnte case e.g. usng nner product we can defne norm or orthogonalty e.g. a norm can be defned, allows one to defne notons of convergence Erc Xng @ CMU, 2014 19

Hlbert Space Inner Product Eample of an nner product (just an eample, nner product not requred to be an ntegral) Inner product of two functons s a number Tradtonal fnte vector space nner product scalar Erc Xng @ CMU, 2014 20

Recall the SVM kernel Intuton Maps data ponts to Feature Functons, whch corresponds to some vectors n a vector space. Erc Xng @ CMU, 2014 21

The Feature Functon Consder holdng one element of the kernel fed. We get a functon of one varable whch we call the feature functon. The collecton of feature functons s called the feature map. For a Gaussan Kernel the feature functons are unnormalzed Gaussans: Erc Xng @ CMU, 2014 22

Reproducng Kernel Hlbert Space Gven a kenel k(, ), we now construct a Hlbert space such that k defnes an nner product n that space We begn wth a kernel map: We now construct a vector space contanng all lnear combnatons of the functons k(,): We now defne an nner product. Let we have :! k( ;) f ( ) = P m =1 k( ; ) hf;g = P m =1 g( ) = P m 0 j=1 jk( ; 0 j ) P m 0 j=1 jk( ; 0 j ) please verfy ths n fact s an nner product: satsfyng symmetry, lnearty, and zero-norm law : hf;f =0 ) f =0 (here we need reproducng property, and Cauchy-Schwartz nequaly Erc Xng @ CMU, 2014 23

Reproducng Kernel Hlbert Space The k(,) s a reproducng kernel map: hk( ;);f = P m =1 k( )=f () Ths shows that the kernel s a representer of evaluaton (or, evaluaton functon) Ths s analogous to the Drac delta functon. If we plug n the kernel n for f: hk( ;);k( ; 0 ) = k(; 0 ) Wth such a defnton of nner product, we have constructed a subspace of the Hlbert space --- a reproducng kernel Hlbert space (RKHS) Erc Xng @ CMU, 2014 24

Back to Feature Map The collecton of evaluaton functons s the feature map!!! The Feature Map s the collecton of Evaluaton Functons! Intuton: A more complcated feature map/kernel corresponds to ``rcher RKHS Bascally, a really nce nfnte dmensonal vector space where even more thngs behave lke the fnte case Erc Xng @ CMU, 2014 25

Inner Product of Feature Maps Defne the Inner Product as: scalar Note that: Erc Xng @ CMU, 2014 26

Mercer s theorem and RKHS Recall the followng condton for Mercer s theorem for K We can also construct our Reproducng Kernel Hlbert Space wth a Mercer Kernel, as a lnear combnaton of ts egen-functons: R k(; 0 )Á ( 0 )= P 1 j=1 Á j() whch can be shown to ental reproducng property (homework?) Erc Xng @ CMU, 2014 27

Summary: RKHS Consder the set of functons that can be formed wth lnear combnatons of these feature functons: We defne the Reproducng Kernel Hlbert Space to the completon of (lke wth the holes flled n) Intutvely, the feature functons are lke an over-complete bass for the RKHS Erc Xng @ CMU, 2014 28

Summary: Reproducng Property It can now be derved that the nner product of a functon f wth, evaluates a functon at pont : Lnearty of nner product Defnton of kernel Remember that scalar Erc Xng @ CMU, 2014 29

Summary: Evaluaton Functon A Reproducng Kernel Hlbert Space s an Hlbert Space where for any X, the evaluaton functonal ndeed by X takes the followng form: Evaluaton Functon, must be a functon n the RKHS Same evaluaton functon for dfferent functons (but same pont) Dfferent ponts are assocated wth dfferent evaluaton functons Equvalent (More Techncal) Defnton: An RKHS s a Hlbert Space where the evaluaton functonals are bounded. (The prevous defnton then follows from Resz Representaton Theorem) Erc Xng @ CMU, 2014 30

RKHS or Not? Is the vector space of 3 dmensonal real valued vectors an RKHS? Yes!!! Homework! Erc Xng @ CMU, 2014 31

RKHS or Not? Is the space of functons such that an RKHS? No!!!! Homework! But, can t the evaluaton functonal be an nner product wth the delta functon? The problem s that the delta functon s not n my space! Erc Xng @ CMU, 2014 32

The Kernel I can evaluate my evaluaton functon wth another evaluaton functon! Dong ths for all pars n my dataset gves me the Kernel Matr K: There may be nfntely many evaluaton functons, but I only have a fnte number of tranng ponts, so the kernel matr s fnte!!!! Erc Xng @ CMU, 2014 33

Correspondence between Kernels and RKHS A kernel s postve sem-defnte f the kernel matr s postve semdefnte for any choce of fnte set of observatons. Theorem (Moore-Aronszajn): Every postve sem-defnte kernel corresponds to a unque RKHS, and every RKHS s assocated wth a unque postve sem-defnte kernel. Note that the kernel does not unquely defne the feature map (but we don t really care snce we never drectly evaluate the feature map anyway). Erc Xng @ CMU, 2014 34

RKHS norm and SVM Recall that n SVM: f ( ) =hw; = P m =1 y k( ; ) Therefore f( ) 2H Moreover: kf( )k 2 H = h mx y k( ; ); mx j y j k( ; j ) = =1 j=1 Erc Xng @ CMU, 2014 35

Prmal and dual SVM objectve In our prmal problem, we mnmze w T w subject to constrants. Ths s equvalent to: kwk 2 = w T w = mx mx j y y j ( ) ( j ) = mx =1 mx j=1 =1 j=1 j y y j k( ; j ) = kfk 2 H whch s equvalent to mnmzng the Hlbert norm of f subject to constrants Erc Xng @ CMU, 2014 36

The Representer Theorem In the general case, for a prmal problem P of the form: mn fc(f;f ;y g)+ð(kfk H )g f2h where f ;y g) m =1 are the tranng data. If the followng condtons are satsfed: The loss functon C s pont-wse,.e., s monotoncally ncreasng Ð( ) C(f;f ;y g)=c(f ;y ;f( )g) The representer theorem (Kmeldorf and Wahba, 1971): every mnmzer of P admts a representaton of the form f( ) = mx K( ; ) =1.e., a lnear combnaton of (a fnte set of) functon gven by the data Erc Xng @ CMU, 2014 37

Proof of Representer Theorem Erc Xng @ CMU, 2014 38

Another vew of SVM Q: why SVM s dual-sparse,.e., havng a few support vectors (most of the s are zero). The SVM loss w T w does not seem to mply that And the representer theorem does not ether! Erc Xng @ CMU, 2014 39

Another vew of SVM: L 1 regularzaton The bass-pursut denosng cost functon (chen & Donoho): J( ) = 1 2 kf( ) N X =1 Á ()k 2 L 2 + k k L1 Instead we consder the followng modfed cost: J( ) = 1 2 X kf( ) N X =1 K( ; )k 2 H + k k L 1 Erc Xng @ CMU, 2014 40

RKHS norm nterpretaton of SVM J( ) = 1 2 X X N kf( ) K( ; )k 2 H + k k L 1 The RKHS norm of the frst term can now be computed eactly! =1 Erc Xng @ CMU, 2014 41

RKHS norm nterpretaton of SVM Now we have the followng optmzaton problem: n mn X y + 1 2 X ;j j K( ; j )+ X o j j Ths s eactly the dual problem of SVM! Erc Xng @ CMU, 2014 42

Take home message Kernel s a (nonlnear) feature map nto a Hlbert space Mercer kernels are legal RKHS s a Hlbert equpped wth an nner product operator defned by mercer kernel Reproducng property make kernel works lke an evaluaton functon Representer theorem ensures optmal soluton to a general class of loss functon to be n the Hlbert space SVM can be recast as an L1-regularzed mnmzaton problem n the RKHS Erc Xng @ CMU, 2014 43