6.867 Machine learning, lecture 11 (Jaakkola)

Size: px

Start display at page:

Download "6.867 Machine learning, lecture 11 (Jaakkola)"

Peregrine Campbell
6 years ago
Views:

1 6.867 Machie learig, lecture 11 (Jaakkola) 1 Lecture topics: moel selectio criteria Miimum escriptio legth (MDL) Feature (subset) selectio Moel selectio criteria: Miimum escriptio legth (MDL) The miimum escriptio legth criterio (MDL) turs the moel selectio problem ito a commuicatio problem. The basic iea is that the best moel shoul lea to the best way of compressig the available ata. This view equates learig with the ability to compress ata. This is iee a sesible view sice learig has to o with the ability to make preictios a effective compressio is precisely base o such ability. So, the more we have leare, the better we ca compress the observe ata. I a classificatio cotext the commuicatio problem is efie as follows. We have a seer a a receiver. The seer has access to the traiig ata, both examples x 1,..., x a labels y 1,..., y, a ees to commuicate the labels to the receiver. The receiver alreay has the examples but ot the labels. Ay classifier that perfectly classifies the traiig examples woul permit the receiver to reprouce the labels for the traiig examples. But the seer (say we) woul have to commuicate the classifier to the receiver before they ca ru it o the traiig examples. It is this part that pertais to the complexity of the set of classifiers we are fittig to the ata. The more choices we have, the more bits it takes to commuicate the specific choice that perfectly classifie the traiig examples. We ca also cosier classifiers that are ot perfectly accurate o the traiig examples by commuicatig the errors that they make. As a result, the commuicatio cost, the umber of bits we have to se to the receiver, epes o two thigs: the cost of errors o the traiig examples (escriptio legth of the ata give the moel), a the cost of escribig the selecte classifier (escriptio legth of the moel), both measure i bits. Simple two-part MDL. Let s start by efiig the basic cocepts i a simpler cotext where we just have a sequece of labels. You coul view the examples x i i this case just specifyig the iexes of the correspoig labels. Cosier the followig sequece of mostly 1 s: labels {}}{ (1) Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

2 6.867 Machie learig, lecture 11 (Jaakkola) 2 If we assume that each label i the sequece is a iepeet sample from the same istributio, the each y t { 1, 1} is a sample from P (y t ) = δ(yt,1) (1 ) δ(yt, 1) (2) for some. Here δ(y 1, 1) is a iicator such that δ(y t, 1) = 1 if y t = 1 a zero otherwise. I our moel P (y t = 1 ) =. We o t ecessarily kow that this is the correct moel for the ata. But whatever moel (here a set of istributios) we choose, we ca evaluate the resultig commuicatio cost. Now, the questio is: give this moel for the ata, how may bits o we ee to se the observe sequece of labels to the receiver? This ca be aswere geerally. Ay istributio over the ata ca be use as a basis for a coig sceme. Moreover, the optimal coig scheme, assumig the ata really i follow our istributio, woul require log 2 P (y 1,..., y ) bits (base two logarithm) or log P (y 1,..., y ) ats (atural logarithm). We will use the atural logarithm for cosistecy with other material. You ca always covert the aswer to bits by multiplyig ats by log(2). Sice our moel assumes that the labels are iepeet of each other, we get log P (y 1,..., y ) = log P (y t ) (3) It is ow temptig to miimize this ecoig cost with respect to (maximizig the loglikelihoo): { } mi log P (y t ) = log P (y t ˆ) (4) t=1 t=1 t=1 = l(y 1,..., y ; ˆ) (5) where l(d; ) is agai the log-likelihoo of ata D a ˆ is the maximum likelihoo settig of the parameters. This correspos to fiig the istributio i our moel (oe correspoig to ˆ) that miimizes the ecoig legth of the ata. The problem is that the receiver ees to be able to ecoe what we se. This is oly possible if they have access to the same istributio (be able to recreate the coe we use). So the receiver woul have to kow ˆ (i aitio to the moel we are cosierig but we will omit this part, assumig that the receiver alreay kows the type of istributios we are have). The solutio is to ecoe the choice of ˆ as well. This way the receiver ca recreate our steps to ecoe the bits we have set. So how o we ecoe ˆ? It is a cotiuous parameter Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

3 6.867 Machie learig, lecture 11 (Jaakkola) 3 a it might seem that we ee a ifiite umber of bits to ecoe a real umber. But ot all ca arise as ML estimates of the parameters. I our case, the ML estimates of the parameters have the followig form ˆ = ˆ(1) (6) where ˆ(1) is the umber of 1 s i the observe sequece. There are thus oly +1 possible values of ˆ correspoig to ˆ(1) {0,..., }. Sice the receiver alreay kows that we are ecoig a sequece of legth, the we ca just se by efiig a istributio over its 1 possible iscrete values, say Q( = k/) = for simplicity (i.e., a uiform istributio +1 over possible iscrete values). As before the cost i bits (ats) is log Q(ˆ) = log( + 1). The total cost we have to se to the receiver is therefore DL of ata give moel {}}{ DL of moel {}}{ DL = l(y 1,..., y ; ˆ) + log( + 1) (7) This is kow as the escriptio legth of the sequece. The miimum escriptio legth criterio simple fis the moel that leas to the shortest escriptio legth. Asymptotically, the simple two-part MDL priciple is essetially the same as BIC. Uiversal coig, ormalize maximum likelihoo. The problem with the simple two part coig is that the escriptio legth we associate with the seco part, the moel, may seem a bit arbitrary. We ca correct this by fiig a istributio that is i some sese uiversally closest to just ecoig the ata with the best fittig istributio i our moel: log P NML (y 1,..., y ) mi l(y 1,..., y ; ) = max l(y 1,..., y ; ) (8) I other wors, we o t wish to efie a prior over the parameters i our moel so as to ecoe the parameter values. While it wo t be possible to achieve the above approximate equality with equality sice the right ha sie, if expoetiate, woul t efie a vali istributio (because of the ML fittig). The istributio that miimizes the maximum eviatio i Eq.(8) is give by exp (max l(y 1,..., y ; )) P NML (y 1,..., y ) = (9) exp (max l(y 1,..., y ; )) y 1,...,y a is kow as the ormalize maximum likelihoo istributio. Usig this istributio Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

4 6.867 Machie learig, lecture 11 (Jaakkola) 4 to ecoe the sequece leas to a miimax optimal coig legth Parametric complexity of moel { ( }} ) { log P NML (y 1,..., y ) = max l(y 1,..., y ; ) + log exp max l(y 1,..., y ; ) (10) y 1,...,y Note that the result is agai i the same form, the egative log-likelihoo with a ML parameter settig a a complexity pealty. The ew parametric complexity pealty epes o the legth of the sequece as well as the moel we use. While the criterio is theoretically ice, it is har to evaluate the parametric complexity pealty i practice. Feature subset selectio Feature selectio is a problem where we wish to ietify compoets of feature vectors most useful for classificatio. For example, some compoets may be very oisy a therefore ureliable. Depeig o how such features 1 woul be treate i the classificatio metho, it may be best ot to iclue them. Here we assume that we have o prior kowlege of which features might be useful for solvig the classificatio task a istea have to provie a automate solutio. This is a type of moel selectio problem, where the moels are ietifie with subsets of the features we choose to iclue (or the umber of features we wish to iclue). Note that kerels o ot solve or avoi this problem. By takig a ier prouct betwee feature vectors that may cotai may oisy or irrelevat features results i a uecessarily varyig kerel value. Istea of selectig a subset of the features we coul also try to weight them accorig to how useful they are. This is a relate feature weightig problem a we will get to this later o. This is also the metho typically use with kerels (cf. kerel optimizatio). Let s aress the feature selectio problem a bit more cocretely. Suppose our iput features are imesioal biary vectors x = [x 1,..., x ] where x i { 1, 1} a labels are biary y { 1, 1} as before. We will begi by usig a particular type of probability moel kow as Naive Bayes moel to solve the classificatio problem. This will help tie our MDL iscussio to the feature selectio problem. I the Naive Bayes moel, we assume that the features are coitioally iepeet give the label so that P (x, y) = P (x i y) P (y) (11) 1 We will refer to the compoets of the feature/iput vectors as features. Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

5 6.867 Machie learig, lecture 11 (Jaakkola) 5 Put aother way, i this moel, the features are correlate oly through the labels. We will cotrast this moel with a ull moel that assumes that oe of the features are relevat for the label: P 0 (x, y) = P (x i ) P (y) (12) I other wors, i this moel we assume that the features a the label are all iepeet of each other. The Naive Bayes moel (as well as the ull moel) are geerative moels i the sese that they ca geerate all the ata, ot just the labels. These moels are typically traie usig the log-likelihoo of all the ata as the estimatio criterio. Let us follow this here as well. We will iscuss later o how this affects the results. So, we aim to maximize log P (x t, y t ) = log P (x ti y t ) + log P (y t ) (13) t=1 t=1 = log P (x ti y t ) + log P (y t ) (14) t=1 t=1 ( ) = ˆiy (x i, y) log P (x i y) + ˆy(y) log P (y t ) (15) x i,y y where we have use couts ˆ iy (x i, y) a ˆ y (y) to summarize the available ata for the purpose of estimatig the moel parameters. ˆiy (x i, y) iicates the umber of traiig istaces that have value x i for the i th feature a y for the label. These are calle sufficiet statistics as they are the oly thigs from the ata that the moel cares about (the oly umbers compute from the ata that are require to estimate the moel parameters). The ML parameter estimates, values that maximize the above expressio, are simply fractios of these couts: Pˆ(y) = Pˆ(x i y) = ˆy(y) ˆiy (x i, y) ˆy(y) (16) (17) As a result, we also have Pˆ(x i, y) = Pˆ(x i y)pˆ(y) = ˆiy (x i, y)/. If we plug these back ito Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

6 6.867 Machie learig, lecture 11 (Jaakkola) 6 the expressio for the log-likelihoo, we get ˆl(S ) = log Pˆ(x t, y t ) (18) t=1 [ ( ) ] ˆiy ( x i, y) = log Pˆ(xi y) + ˆy (y) log Pˆ(yt ) (19) x i,y y Ĥ(X i Y ) H(Y ˆ ) { }}{{ }}{ = Pˆ(x i, y) log Pˆ(x i y) + Pˆ(y) log Pˆ(y t ) (20) x i,y y = Ĥ(X i Y ) Ĥ(Y ) (21) where Ĥ(Y ) is the Shao etropy (ucertaity) of y relative to istributio Pˆ(y) a Ĥ(X i Y ) is the coitioal etropy of X i give Y. Etropy H(Y ) measures how may bits (here ats) we woul ee o average to ecoe a value y chose at raom from Pˆ(y). It is zero whe y is etermiistic a takes the largest value (oe bit) whe Pˆ(y) = 1/2 i case of biary labels. Similarly, the coitioal etropy Ĥ(X i Y ) measures how ucertai we woul be about X i if we alreay kew the value for y a the values for these variables were sample from Pˆ(x i, y). If x i is perfectly preictable from y accorig to Pˆ(x i, y), the Ĥ(X i Y ) = 0. We ca perform a similar calculatio for the ull moel a obtai ˆl0 (S ) = log Pˆ0(x t, y t ) = Ĥ(X i ) Ĥ(Y ) (22) t=1 Note that the coitioig o y is goe sice the features were assume to be iepeet of the label. By takig a ifferece of these two, we ca gauge how much we gai i terms of the resultig log-likelihoo if we assume the features to epe o the label: The expressio ( ) ˆl(S ) lˆ0(s ) = Ĥ(X i ) Ĥ(X i Y ) (23) Î(X i ; Y ) = Ĥ(X i ) Ĥ(X i Y ) (24) Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

7 6.867 Machie learig, lecture 11 (Jaakkola) 7 is kow as the mutual iformatio betwee x i a y, relative to istributio Pˆ(x i, y). Mutual iformatio is o-egative a symmetric. I other wors, Î(X i ; Y ) = Ĥ(X i ) Ĥ(X i Y ) = Ĥ(Y ) Ĥ(Y X i ) (25) To uersta mutual iformatio i more etail, please cosult, e.g., Cover a Tomas liste o the course website. So, as far as feature selectio is cocere, our calculatio so far suggests that we shoul select the features to iclue i the ecreasig orer of Î(X i ; Y ) (the higher the mutual iformatio, the more we gai i terms of log-likelihoo). It may seem a bit o, however, that we ca select the features iiviually, i.e., ot cosier them i specific combiatios. This is ue to the simple Naive Bayes assumptio, a our use of the log-likelihoo of all the ata to erive the selectio criterio. The mutual iformatio criterio is evertheless very commo a it is useful to uersta how it comes about. We will ow tur to improvig this a bit with MDL. Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall MIT OpeCourseWare ( Massachusetts Istitute of Techology. Dowloae o [DD Moth YYYY].

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector