Advanced Introduction to Machine Learning

Size: px

Start display at page:

Download "Advanced Introduction to Machine Learning"

Alfred Wheeler
5 years ago
Views:

1 Advanced Introducton to Machne Learnng 10715, Fall 2014 The Kernel Trck, Reproducng Kernel Hlbert Space, and the Representer Theorem Erc Xng Lecture 6, September 24, 2014 Readng: Erc CMU,

2 Recap: the SVM problem We solve the followng constraned opt problem: Ths s a quadratc programmng problem. A global mamum of can always be found. The soluton: How to predct: m y w 1 m j j T j j m y y , ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C Erc CMU,

3 Kernel Pont rule or average rule Can we predct vec(y)? m j j T j j m y y , ) ( ) ( ma J Erc CMU,

4 Outlne The Kernel trck Mamum entropy dscrmnaton Structured SVM, aka, Mamum Margn Markov Networks Erc CMU,

5 (1) Non-lnear Decson Boundary So far, we have only consdered large-margn classfer wth a lnear decson boundary How to generalze t to become nonlnear? Key dea: transform to a hgher dmensonal space to make lfe easer Input space: the space the pont are located Feature space: the space of ( ) after transformaton Why transform? Lnear operaton n the feature space s equvalent to non-lnear operaton n nput space Classfcaton can become easer wth a proper transformaton. In the XOR problem, for eample, addng a new feature of 1 2 make the problem lnearly separable (homework) Erc CMU,

6 Non-lnear Decson Boundary Erc CMU,

7 Transformng the Data Input space (.) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Feature space Note: feature space s of hgher dmenson than the nput space n practce Computaton n the feature space can be costly because t s hgh dmensonal The feature space s typcally nfnte-dmensonal! The kernel trck comes to rescue Erc CMU,

8 The Kernel Trck Recall the SVM optmzaton problem The data ponts only appear as nner product As long as we can calculate the nner product n the feature space, we do not need the mappng eplctly Many common geometrc operatons (angles, dstances) can be epressed by nner products Defne the kernel functon K by m j j T j j m y y , ) ( ) ( ma J 0., 1,, 0 s.t. 1 m y m C ) ( ) ( ), ( j T j K Erc CMU,

9 An Eample for feature mappng and kernels Consder an nput =[ 1, 2 ] Suppose (.) s gven as follows An nner product n the feature space s So, f we defne the kernel functon as follows, there s no need to carry out (.) eplctly ,,,,, ' ', ' ) ', ( T K Erc CMU,

10 More eamples of kernel functons Lnear kernel (we've seen t) T K (, ' ) ' Polynomal kernel (we just saw an eample) K where p = 2, 3, To get the feature vectors we concatenate all pth order polynomal terms of the components of (weghted approprately) T (, ' ) 1 ' p Radal bass kernel 1 K (, ' ) ep ' 2 In ths case the feature space conssts of functons and results n a nonparametrc classfer. 2 Erc CMU,

11 The essence of kernel Feature mappng, but wthout payng a cost E.g., polynomal kernel How many dmensons we ve got n the new space? How many operatons t takes to compute K()? Kernel desgn, any prncple? K(,z) can be thought of as a smlarty functon between and z Ths ntuton can be well reflected n the followng Gaussan functon (Smlarly one can easly come up wth other K() n the same sprt) Is ths necessarly lead to a legal kernel? (n the above partcular case, K() s a legal one, do you know how many dmenson () s? Erc CMU,

12 Kernel matr Suppose for now that K s ndeed a vald kernel correspondng to some feature mappng, then for 1,, m, we can compute an mm matr, where Ths s called a kernel matr! Now, f a kernel functon s ndeed a vald kernel, and ts elements are dot-product n the transformed feature space, t must satsfy: Symmetry K=K T proof Postve semdefnte proof? Erc CMU,

13 Mercer kernel Erc CMU,

14 SVM eamples Erc CMU,

15 Eamples for Non Lnear SVMs Gaussan Kernel Erc CMU,

16 Remember the Kernel Trck!!! Prmal Formulaton: Infnte, cannot be drectly computed But the dot product s easy to compute Dual Formulaton: Erc CMU,

17 Overvew of Hlbert Space Embeddng Create an nfnte dmensonal statstc for a dstrbuton. Two Requrements: Map from dstrbutons to statstcs s one-to-one Although statstc s nfnte, t s cleverly constructed such that the kernel trck can be appled. Perform Belef Propagaton as f these statstcs are the condtonal probablty tables. We wll now make ths constructon more formal by ntroducng the concept of Hlbert Spaces Erc CMU,

18 Vector Space A set of objects closed under lnear combnatons (e.g., addton and scalar multplcaton): Obeys dstrbutve and assocatve laws, Normally, you thnk of these objects as fnte dmensonal vectors. However, n general the objects can be functons. Nonrgorous Intuton: A functon s lke an nfnte dmensonal vector. Erc CMU,

19 Hlbert Space A Hlbert Space s a complete vector space equpped wth an nner product. The nner product has the followng propertes: Symmetry Lnearty Nonnegatvty Zero Bascally a nce nfnte dmensonal vector space, where lots of thngs behave lke the fnte case e.g. usng nner product we can defne norm or orthogonalty e.g. a norm can be defned, allows one to defne notons of convergence Erc CMU,

20 Hlbert Space Inner Product Eample of an nner product (just an eample, nner product not requred to be an ntegral) Inner product of two functons s a number Tradtonal fnte vector space nner product scalar Erc CMU,

21 Recall the SVM kernel Intuton Maps data ponts to Feature Functons, whch corresponds to some vectors n a vector space. Erc CMU,

22 The Feature Functon Consder holdng one element of the kernel fed. We get a functon of one varable whch we call the feature functon. The collecton of feature functons s called the feature map. For a Gaussan Kernel the feature functons are unnormalzed Gaussans: Erc CMU,

23 Reproducng Kernel Hlbert Space Gven a kenel k(, ), we now construct a Hlbert space such that k defnes an nner product n that space We begn wth a kernel map: We now construct a vector space contanng all lnear combnatons of the functons k(,): We now defne an nner product. Let we have :! k( ;) f ( ) = P m =1 k( ; ) hf;g = P m =1 g( ) = P m 0 j=1 jk( ; 0 j ) P m 0 j=1 jk( ; 0 j ) please verfy ths n fact s an nner product: satsfyng symmetry, lnearty, and zero-norm law : hf;f =0 ) f =0 (here we need reproducng property, and Cauchy-Schwartz nequaly Erc CMU,

24 Reproducng Kernel Hlbert Space The k(,) s a reproducng kernel map: hk( ;);f = P m =1 k( )=f () Ths shows that the kernel s a representer of evaluaton (or, evaluaton functon) Ths s analogous to the Drac delta functon. If we plug n the kernel n for f: hk( ;);k( ; 0 ) = k(; 0 ) Wth such a defnton of nner product, we have constructed a subspace of the Hlbert space --- a reproducng kernel Hlbert space (RKHS) Erc CMU,

25 Back to Feature Map The collecton of evaluaton functons s the feature map!!! The Feature Map s the collecton of Evaluaton Functons! Intuton: A more complcated feature map/kernel corresponds to ``rcher RKHS Bascally, a really nce nfnte dmensonal vector space where even more thngs behave lke the fnte case Erc CMU,

26 Inner Product of Feature Maps Defne the Inner Product as: scalar Note that: Erc CMU,

27 Mercer s theorem and RKHS Recall the followng condton for Mercer s theorem for K We can also construct our Reproducng Kernel Hlbert Space wth a Mercer Kernel, as a lnear combnaton of ts egen-functons: R k(; 0 )Á ( 0 )= P 1 j=1 Á j() whch can be shown to ental reproducng property (homework?) Erc CMU,

28 Summary: RKHS Consder the set of functons that can be formed wth lnear combnatons of these feature functons: We defne the Reproducng Kernel Hlbert Space to the completon of (lke wth the holes flled n) Intutvely, the feature functons are lke an over-complete bass for the RKHS Erc CMU,

29 Summary: Reproducng Property It can now be derved that the nner product of a functon f wth, evaluates a functon at pont : Lnearty of nner product Defnton of kernel Remember that scalar Erc CMU,

30 Summary: Evaluaton Functon A Reproducng Kernel Hlbert Space s an Hlbert Space where for any X, the evaluaton functonal ndeed by X takes the followng form: Evaluaton Functon, must be a functon n the RKHS Same evaluaton functon for dfferent functons (but same pont) Dfferent ponts are assocated wth dfferent evaluaton functons Equvalent (More Techncal) Defnton: An RKHS s a Hlbert Space where the evaluaton functonals are bounded. (The prevous defnton then follows from Resz Representaton Theorem) Erc CMU,

31 RKHS or Not? Is the vector space of 3 dmensonal real valued vectors an RKHS? Yes!!! Homework! Erc CMU,

32 RKHS or Not? Is the space of functons such that an RKHS? No!!!! Homework! But, can t the evaluaton functonal be an nner product wth the delta functon? The problem s that the delta functon s not n my space! Erc CMU,

33 The Kernel I can evaluate my evaluaton functon wth another evaluaton functon! Dong ths for all pars n my dataset gves me the Kernel Matr K: There may be nfntely many evaluaton functons, but I only have a fnte number of tranng ponts, so the kernel matr s fnte!!!! Erc CMU,

34 Correspondence between Kernels and RKHS A kernel s postve sem-defnte f the kernel matr s postve semdefnte for any choce of fnte set of observatons. Theorem (Moore-Aronszajn): Every postve sem-defnte kernel corresponds to a unque RKHS, and every RKHS s assocated wth a unque postve sem-defnte kernel. Note that the kernel does not unquely defne the feature map (but we don t really care snce we never drectly evaluate the feature map anyway). Erc CMU,

35 RKHS norm and SVM Recall that n SVM: f ( ) =hw; = P m =1 y k( ; ) Therefore f( ) 2H Moreover: kf( )k 2 H = h mx y k( ; ); mx j y j k( ; j ) = =1 j=1 Erc CMU,

36 Prmal and dual SVM objectve In our prmal problem, we mnmze w T w subject to constrants. Ths s equvalent to: kwk 2 = w T w = mx mx j y y j ( ) ( j ) = mx =1 mx j=1 =1 j=1 j y y j k( ; j ) = kfk 2 H whch s equvalent to mnmzng the Hlbert norm of f subject to constrants Erc CMU,

37 The Representer Theorem In the general case, for a prmal problem P of the form: mn fc(f;f ;y g)+ð(kfk H )g f2h where f ;y g) m =1 are the tranng data. If the followng condtons are satsfed: The loss functon C s pont-wse,.e., s monotoncally ncreasng Ð( ) C(f;f ;y g)=c(f ;y ;f( )g) The representer theorem (Kmeldorf and Wahba, 1971): every mnmzer of P admts a representaton of the form f( ) = mx K( ; ) =1.e., a lnear combnaton of (a fnte set of) functon gven by the data Erc CMU,

38 Proof of Representer Theorem Erc CMU,

39 Another vew of SVM Q: why SVM s dual-sparse,.e., havng a few support vectors (most of the s are zero). The SVM loss w T w does not seem to mply that And the representer theorem does not ether! Erc CMU,

40 Another vew of SVM: L 1 regularzaton The bass-pursut denosng cost functon (chen & Donoho): J( ) = 1 2 kf( ) N X =1 Á ()k 2 L 2 + k k L1 Instead we consder the followng modfed cost: J( ) = 1 2 X kf( ) N X =1 K( ; )k 2 H + k k L 1 Erc CMU,

41 RKHS norm nterpretaton of SVM J( ) = 1 2 X X N kf( ) K( ; )k 2 H + k k L 1 The RKHS norm of the frst term can now be computed eactly! =1 Erc CMU,

42 RKHS norm nterpretaton of SVM Now we have the followng optmzaton problem: n mn X y X ;j j K( ; j )+ X o j j Ths s eactly the dual problem of SVM! Erc CMU,

43 Take home message Kernel s a (nonlnear) feature map nto a Hlbert space Mercer kernels are legal RKHS s a Hlbert equpped wth an nner product operator defned by mercer kernel Reproducng property make kernel works lke an evaluaton functon Representer theorem ensures optmal soluton to a general class of loss functon to be n the Hlbert space SVM can be recast as an L1-regularzed mnmzaton problem n the RKHS Erc CMU,

Recap: the SVM problem

Recap: the SVM problem Machne Learnng 0-70/5-78 78 Fall 0 Advanced topcs n Ma-Margn Margn Learnng Erc Xng Lecture 0 Noveber 0 Erc Xng @ CMU 006-00 Recap: the SVM proble We solve the follong constraned opt proble: a s.t. J 0