Advanced Introducton to Machne Learnng 10715, Fall 2014 The Kernel Trck, Reproducng Kernel Hlbert Space, and the Representer Theorem Erc Xng Lecture 6, September 24, 2014 Readng: Erc Xng @ CMU, 2014 1
Recap: the SVM problem We solve the followng constraned opt problem: Ths s a quadratc programmng problem. A global mamum of can always be found. The soluton: How to predct: m y w 1 m j j T j j m y y 1 1 2 1, ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C Erc Xng @ CMU, 2014 2
Kernel Pont rule or average rule Can we predct vec(y)? m j j T j j m y y 1 1 2 1, ) ( ) ( ma J Erc Xng @ CMU, 2014 3
Outlne The Kernel trck Mamum entropy dscrmnaton Structured SVM, aka, Mamum Margn Markov Networks Erc Xng @ CMU, 2014 4
(1) Non-lnear Decson Boundary So far, we have only consdered large-margn classfer wth a lnear decson boundary How to generalze t to become nonlnear? Key dea: transform to a hgher dmensonal space to make lfe easer Input space: the space the pont are located Feature space: the space of ( ) after transformaton Why transform? Lnear operaton n the feature space s equvalent to non-lnear operaton n nput space Classfcaton can become easer wth a proper transformaton. In the XOR problem, for eample, addng a new feature of 1 2 make the problem lnearly separable (homework) Erc Xng @ CMU, 2014 5
Non-lnear Decson Boundary Erc Xng @ CMU, 2014 6
Transformng the Data Input space (.) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Feature space Note: feature space s of hgher dmenson than the nput space n practce Computaton n the feature space can be costly because t s hgh dmensonal The feature space s typcally nfnte-dmensonal! The kernel trck comes to rescue Erc Xng @ CMU, 2014 7
The Kernel Trck Recall the SVM optmzaton problem The data ponts only appear as nner product As long as we can calculate the nner product n the feature space, we do not need the mappng eplctly Many common geometrc operatons (angles, dstances) can be epressed by nner products Defne the kernel functon K by m j j T j j m y y 1 1 2 1, ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C ) ( ) ( ), ( j T j K Erc Xng @ CMU, 2014 8
An Eample for feature mappng and kernels Consder an nput =[ 1, 2 ] Suppose (.) s gven as follows An nner product n the feature space s So, f we defne the kernel functon as follows, there s no need to carry out (.) eplctly 2 1 2 2 2 1 2 1 2 1 2 2 2 1,,,,, ' ', 2 1 2 1 2 1 ' ) ', ( T K Erc Xng @ CMU, 2014 9
More eamples of kernel functons Lnear kernel (we've seen t) T K (, ' ) ' Polynomal kernel (we just saw an eample) K where p = 2, 3, To get the feature vectors we concatenate all pth order polynomal terms of the components of (weghted approprately) T (, ' ) 1 ' p Radal bass kernel 1 K (, ' ) ep ' 2 In ths case the feature space conssts of functons and results n a nonparametrc classfer. 2 Erc Xng @ CMU, 2014 10
The essence of kernel Feature mappng, but wthout payng a cost E.g., polynomal kernel How many dmensons we ve got n the new space? How many operatons t takes to compute K()? Kernel desgn, any prncple? K(,z) can be thought of as a smlarty functon between and z Ths ntuton can be well reflected n the followng Gaussan functon (Smlarly one can easly come up wth other K() n the same sprt) Is ths necessarly lead to a legal kernel? (n the above partcular case, K() s a legal one, do you know how many dmenson () s? Erc Xng @ CMU, 2014 11
Kernel matr Suppose for now that K s ndeed a vald kernel correspondng to some feature mappng, then for 1,, m, we can compute an mm matr, where Ths s called a kernel matr! Now, f a kernel functon s ndeed a vald kernel, and ts elements are dot-product n the transformed feature space, t must satsfy: Symmetry K=K T proof Postve semdefnte proof? Erc Xng @ CMU, 2014 12
Mercer kernel Erc Xng @ CMU, 2014 13
SVM eamples Erc Xng @ CMU, 2014 14
Eamples for Non Lnear SVMs Gaussan Kernel Erc Xng @ CMU, 2014 15
Remember the Kernel Trck!!! Prmal Formulaton: Infnte, cannot be drectly computed But the dot product s easy to compute Dual Formulaton: Erc Xng @ CMU, 2014 16
Overvew of Hlbert Space Embeddng Create an nfnte dmensonal statstc for a dstrbuton. Two Requrements: Map from dstrbutons to statstcs s one-to-one Although statstc s nfnte, t s cleverly constructed such that the kernel trck can be appled. Perform Belef Propagaton as f these statstcs are the condtonal probablty tables. We wll now make ths constructon more formal by ntroducng the concept of Hlbert Spaces Erc Xng @ CMU, 2014 17
Vector Space A set of objects closed under lnear combnatons (e.g., addton and scalar multplcaton): Obeys dstrbutve and assocatve laws, Normally, you thnk of these objects as fnte dmensonal vectors. However, n general the objects can be functons. Nonrgorous Intuton: A functon s lke an nfnte dmensonal vector. Erc Xng @ CMU, 2014 18
Hlbert Space A Hlbert Space s a complete vector space equpped wth an nner product. The nner product has the followng propertes: Symmetry Lnearty Nonnegatvty Zero Bascally a nce nfnte dmensonal vector space, where lots of thngs behave lke the fnte case e.g. usng nner product we can defne norm or orthogonalty e.g. a norm can be defned, allows one to defne notons of convergence Erc Xng @ CMU, 2014 19
Hlbert Space Inner Product Eample of an nner product (just an eample, nner product not requred to be an ntegral) Inner product of two functons s a number Tradtonal fnte vector space nner product scalar Erc Xng @ CMU, 2014 20
Recall the SVM kernel Intuton Maps data ponts to Feature Functons, whch corresponds to some vectors n a vector space. Erc Xng @ CMU, 2014 21
The Feature Functon Consder holdng one element of the kernel fed. We get a functon of one varable whch we call the feature functon. The collecton of feature functons s called the feature map. For a Gaussan Kernel the feature functons are unnormalzed Gaussans: Erc Xng @ CMU, 2014 22
Reproducng Kernel Hlbert Space Gven a kenel k(, ), we now construct a Hlbert space such that k defnes an nner product n that space We begn wth a kernel map: We now construct a vector space contanng all lnear combnatons of the functons k(,): We now defne an nner product. Let we have :! k( ;) f ( ) = P m =1 k( ; ) hf;g = P m =1 g( ) = P m 0 j=1 jk( ; 0 j ) P m 0 j=1 jk( ; 0 j ) please verfy ths n fact s an nner product: satsfyng symmetry, lnearty, and zero-norm law : hf;f =0 ) f =0 (here we need reproducng property, and Cauchy-Schwartz nequaly Erc Xng @ CMU, 2014 23
Reproducng Kernel Hlbert Space The k(,) s a reproducng kernel map: hk( ;);f = P m =1 k( )=f () Ths shows that the kernel s a representer of evaluaton (or, evaluaton functon) Ths s analogous to the Drac delta functon. If we plug n the kernel n for f: hk( ;);k( ; 0 ) = k(; 0 ) Wth such a defnton of nner product, we have constructed a subspace of the Hlbert space --- a reproducng kernel Hlbert space (RKHS) Erc Xng @ CMU, 2014 24
Back to Feature Map The collecton of evaluaton functons s the feature map!!! The Feature Map s the collecton of Evaluaton Functons! Intuton: A more complcated feature map/kernel corresponds to ``rcher RKHS Bascally, a really nce nfnte dmensonal vector space where even more thngs behave lke the fnte case Erc Xng @ CMU, 2014 25
Inner Product of Feature Maps Defne the Inner Product as: scalar Note that: Erc Xng @ CMU, 2014 26
Mercer s theorem and RKHS Recall the followng condton for Mercer s theorem for K We can also construct our Reproducng Kernel Hlbert Space wth a Mercer Kernel, as a lnear combnaton of ts egen-functons: R k(; 0 )Á ( 0 )= P 1 j=1 Á j() whch can be shown to ental reproducng property (homework?) Erc Xng @ CMU, 2014 27
Summary: RKHS Consder the set of functons that can be formed wth lnear combnatons of these feature functons: We defne the Reproducng Kernel Hlbert Space to the completon of (lke wth the holes flled n) Intutvely, the feature functons are lke an over-complete bass for the RKHS Erc Xng @ CMU, 2014 28
Summary: Reproducng Property It can now be derved that the nner product of a functon f wth, evaluates a functon at pont : Lnearty of nner product Defnton of kernel Remember that scalar Erc Xng @ CMU, 2014 29
Summary: Evaluaton Functon A Reproducng Kernel Hlbert Space s an Hlbert Space where for any X, the evaluaton functonal ndeed by X takes the followng form: Evaluaton Functon, must be a functon n the RKHS Same evaluaton functon for dfferent functons (but same pont) Dfferent ponts are assocated wth dfferent evaluaton functons Equvalent (More Techncal) Defnton: An RKHS s a Hlbert Space where the evaluaton functonals are bounded. (The prevous defnton then follows from Resz Representaton Theorem) Erc Xng @ CMU, 2014 30
RKHS or Not? Is the vector space of 3 dmensonal real valued vectors an RKHS? Yes!!! Homework! Erc Xng @ CMU, 2014 31
RKHS or Not? Is the space of functons such that an RKHS? No!!!! Homework! But, can t the evaluaton functonal be an nner product wth the delta functon? The problem s that the delta functon s not n my space! Erc Xng @ CMU, 2014 32
The Kernel I can evaluate my evaluaton functon wth another evaluaton functon! Dong ths for all pars n my dataset gves me the Kernel Matr K: There may be nfntely many evaluaton functons, but I only have a fnte number of tranng ponts, so the kernel matr s fnte!!!! Erc Xng @ CMU, 2014 33
Correspondence between Kernels and RKHS A kernel s postve sem-defnte f the kernel matr s postve semdefnte for any choce of fnte set of observatons. Theorem (Moore-Aronszajn): Every postve sem-defnte kernel corresponds to a unque RKHS, and every RKHS s assocated wth a unque postve sem-defnte kernel. Note that the kernel does not unquely defne the feature map (but we don t really care snce we never drectly evaluate the feature map anyway). Erc Xng @ CMU, 2014 34
RKHS norm and SVM Recall that n SVM: f ( ) =hw; = P m =1 y k( ; ) Therefore f( ) 2H Moreover: kf( )k 2 H = h mx y k( ; ); mx j y j k( ; j ) = =1 j=1 Erc Xng @ CMU, 2014 35
Prmal and dual SVM objectve In our prmal problem, we mnmze w T w subject to constrants. Ths s equvalent to: kwk 2 = w T w = mx mx j y y j ( ) ( j ) = mx =1 mx j=1 =1 j=1 j y y j k( ; j ) = kfk 2 H whch s equvalent to mnmzng the Hlbert norm of f subject to constrants Erc Xng @ CMU, 2014 36
The Representer Theorem In the general case, for a prmal problem P of the form: mn fc(f;f ;y g)+ð(kfk H )g f2h where f ;y g) m =1 are the tranng data. If the followng condtons are satsfed: The loss functon C s pont-wse,.e., s monotoncally ncreasng Ð( ) C(f;f ;y g)=c(f ;y ;f( )g) The representer theorem (Kmeldorf and Wahba, 1971): every mnmzer of P admts a representaton of the form f( ) = mx K( ; ) =1.e., a lnear combnaton of (a fnte set of) functon gven by the data Erc Xng @ CMU, 2014 37
Proof of Representer Theorem Erc Xng @ CMU, 2014 38
Another vew of SVM Q: why SVM s dual-sparse,.e., havng a few support vectors (most of the s are zero). The SVM loss w T w does not seem to mply that And the representer theorem does not ether! Erc Xng @ CMU, 2014 39
Another vew of SVM: L 1 regularzaton The bass-pursut denosng cost functon (chen & Donoho): J( ) = 1 2 kf( ) N X =1 Á ()k 2 L 2 + k k L1 Instead we consder the followng modfed cost: J( ) = 1 2 X kf( ) N X =1 K( ; )k 2 H + k k L 1 Erc Xng @ CMU, 2014 40
RKHS norm nterpretaton of SVM J( ) = 1 2 X X N kf( ) K( ; )k 2 H + k k L 1 The RKHS norm of the frst term can now be computed eactly! =1 Erc Xng @ CMU, 2014 41
RKHS norm nterpretaton of SVM Now we have the followng optmzaton problem: n mn X y + 1 2 X ;j j K( ; j )+ X o j j Ths s eactly the dual problem of SVM! Erc Xng @ CMU, 2014 42
Take home message Kernel s a (nonlnear) feature map nto a Hlbert space Mercer kernels are legal RKHS s a Hlbert equpped wth an nner product operator defned by mercer kernel Reproducng property make kernel works lke an evaluaton functon Representer theorem ensures optmal soluton to a general class of loss functon to be n the Hlbert space SVM can be recast as an L1-regularzed mnmzaton problem n the RKHS Erc Xng @ CMU, 2014 43