6.867 Machine Learning

Size: px

Start display at page:

Download "6.867 Machine Learning"

Osborn Stanley
5 years ago
Views:

1 6.867 Mache Learg Problem set Due Frday, September 9, rectato Please address all questos ad commets about ths problem set to You do ot eed to use MATLAB for ths problem set though you ca certaly do so. We wll provde helpful hts alog the way but f you are ot famlar wth MATLAB ad wsh to use MATLAB ths problem set, please cosult ad the lks there. Part : Least-Squares Regresso Referece: Lectures 2 ad 3, chapters The goal of ths secto s to soldfy basc cocepts least squares regresso. Suppose we have some smple dataset, {(x, y ), =,..., }, where x ad y are real umbers. Our model of how y s related to x s gve by y = f(x; w) + e () f(x; w) = w φ(x) (2) where φ : R R d s a specfed fucto (see below) whch maps x to a d-dmesoal feature vector, φ(x) = (φ (x),..., φ d (x)) ; w s a d-dmesoal parameter vector w = (w,..., w d ) ; e s the predcto error, whch we do ot model explctly. We wll use w to deote the traspose of ay vector w, as s doe MATLAB. Note that our formulato above does ot explctly clude the offset parameter, or w 0, as was doe the lectures. We ca corporate the offset by defg φ (x) =. I the followg, we wsh to determe the least squares optmal parameters or ŵ. I other words, we mmze the followg squared predcto error: J(w) = (y f(x ; w)) 2 (3) By a smlar argumet as gve Lecture 2, the soluto to ths problem s ŵ = (X X) X y (4)

2 where X = (φ(x ),..., φ(x )) s a d matrx whose frst row s φ (x ),..., φ d (x ) ad the last row s gve by φ (x ),..., φ d (x ); The output vector y s defed as y = (y, y 2,..., y ). Note: We assume that the matrx (X X) s vertble so that the problem s well-posed,.e. there exsts a uque mmzer. Ths s true whe the feature vectors φ(x ),..., φ(x ) assocated wth the trag examples spa the d dmesoal feature space. Whe the feature vectors are log ad the umber of trag pots s small, ths s ot at all ecessarly the case. For example, t caot be the case wheever d >. Now, for ths estmate ŵ the resultg predcto errors ê = y f(x ; ŵ) should be ucorrelated wth the features: ê φ k (x ) = 0, k =,..., d (5) These codtos are obtaed by takg the dervatve of J(w) wth respect to each w, =,..., d, ad settg them to zero. Note that the predcto error eed ot be zero mea uless oe of the features s a costat,.e., f say φ (x) = for all x, so that ê φ (x ) = ê = 0 (6) The error s guarateed to be ucorrelated wth oly features actually cluded the predcto. You may woder why we are talkg about correlato the frst place. Let s explore ths a bt further, here ad the problems that follow. For jot samples (u, v ), =,...,, the sample covarace s defed as Σ u,v = (u ū)(v v) (7) where ū = (/) u s the sample mea of u ad smlarly for v. The samples are ucorrelated f the sample covarace s exactly zero, Σ u,v = 0. Covarace measures how well oe varable (or oe set of samples) s learly predctable from the other. Problem. (5pts) Assumg that the frst compoet of the feature vector s a costat,.e., φ (x) =, show that the jot samples (ê, φ k (x )), =,...,, are deed ucorrelated for all k =,..., d accordg to our defto above. 2. (5pts) Show that all lear fuctos of the bass fuctos,.e., fuctos of the form w φ(x) for some w R d, are also ucorrelated wth the predcto errors ê assocated wth the least squares optmal parameters ŵ. I other words, show that (ê, w φ(x )), =,...,, are ucorrelated for ay w. 2

3 3. (5pts) Yet aother way of uderstadg ths result s that f we try to ft a lear fucto (usg the same bass fuctos) to the predcto errors, we ca oly get zero. Let ŵ ad ê, =,...,, be defed as above. If we ow use ỹ = ê as the ew target outputs ad repeat the parameter estmato step usg these ew outputs ad the same set of bass fuctos, show that the resultg ew least squares parameters are deed detcally zero. 4. (5pts) Suppose we chage our feature represetato of examples by rescalg the bass fuctos,.e., use φ(x) = (a φ (x),..., a d φ d (x)) as the feature vector, where a, =,..., d are ay o-zero real umbers. Show that the uscaled soluto, the fucto ŵ φ(x), s stll optmal the sese that ŵ φ(x) = ˆ w φ(x), where ˆ w are the least squares optmal parameters for the scaled feature vectors. (Ht. use correlato). 5. (Optoal) Let s go through a small umercal example to get started wth MATLAB. We wll use the followg data (expressed MATLAB otato): x = [-2-0 2] ; y = [ ] ; (both are colum vectors). Let φ(x) = (, x, x 2 ). To fd the least squares parameters, say wh MATLAB, we ca costruct the X matrx smply as X = [oes(sze(x)),x,x.^2]; where the dot refers to a elemetwse operato. v(a) for ay vertble A. Matrx verse MATLAB s Fd the least squares optmal parameters wh ths case. Plot the sample pots ad the resultg fucto correspodg to the parameters. Verfy that the predcto error s deed ucorrelated wth the bass fuctos. Repeat the procedure for φ(x) = s(πx) (oly oe bass fucto). (ote: π MATLAB s smply a costat p). Does the result look reasoable? What should the aswer be? Problem 2 The predctos we make the regresso formulato eed ot be oe dmesoal. We ca just as easly make predctos that are vector valued. Cosder a smple example where the put x takes oly bary values x {0, } ad y s a two-dmesoal measuremet y R 2. Here, the model s y = f(x; W ) + e (8) f(x; W ) = W φ(x) (9) 3

4 where the feature vector s defed by φ(0) = (, 0) ad φ() = (0, ) ; W s a twoby-two matrx of model parameters; ad both the predcto f(x; W ) ad the predcto errors e = y f(x ; W ) are two-dmesoal vectors. We ow wsh to determe Ŵ that mmzes the squared error J(W ) = e 2 = e e (0). (0pts) Show that the least-squares estmate of W s Ŵ = (X X) X Y () where X = (φ(x )... φ(x )) ad Y = (y... y ). Ht. Show that the objectve decomposes so that each colum of W may be obtaed depedetly; you are essetally solvg two separate -dmesoal regresso problems. Next, cosder the data set: x y 0 (, ) 0 (, 2) 0 ( 2, ) (, ) (, 2) (2, ) 2. (5pts) Compute Ŵ. Plot the data pots y ad the colums of Ŵ = (ŷ 0 ŷ ) (ote the traspose). 3. (5pts) Verfy that ê φ(x ) = 0 (a 2x2 matrx ths case). What s the terpretato of the colums of Ŵ? Part 2: Probablstc Modelg ad Lkelhood Referece: Lecture 3, chapter 4 (up to eq 4.20) Frst a bt of backgroud. Suppose we have a probablty dstrbuto or desty p(x; θ), where x may be dscrete or cotuous depedg o the problem we are terested. θ specfes the parameters of ths dstrbuto such as the mea ad the varace of a oe dmesoal Gaussa. Dfferet settgs of the parameters mply dfferet dstrbutos over x. The avalable data, whe terpreted as samples x,..., x from oe such dstrbuto, should favor oe settg of the parameters over aother. We eed a formal crtero for gaugg how well ay potetal dstrbuto p( θ) explas or fts the data. Sce 4

5 p(x θ) s the probablty of reproducg ay observato x, t seems atural to try to maxmze ths probablty. Ths gves rse to the Maxmum Lkelhood estmato crtero for the parameters θ: ˆθ ML = argmax θ L(x,..., x ; θ) = argmax θ p(x θ) (2) = where we have assumed that each data pot x s draw depedetly from the same dstrbuto so that the lkelhood of the data s L(x,..., x ; θ) = = p(x ; θ). Lkelhood s vewed prmarly as a fucto of the parameters, a fucto that depeds o the data. The above expresso ca be qute complcated (depedg o the famly of dstrbutos we are cosderg), ad make maxmzato techcally challegg. However, ay mootocally creasg fucto of the lkelhood wll have the same maxma. Oe such fucto s log-lkelhood log L(x,..., x ; θ); takg the log turs the product to a sum, makg dervatves sgfcatly smpler. We wll maxmze the log-lkelhood stead of lkelhood. Problem 4 Let x {0, } deote the result of a co flp (x = 0 for tals, ad x = for heads ). The co s potetally based so that heads occurs wth probablty θ. Suppose also that someoe else observes the co flp ad reports to you heads or tals (deote ths report by y). But ths perso s urelable ad oly reports the result correctly wth probablty θ 2 (the correctess of the report s depedet of the co toss).. (5pts) Wrte dow the jot probablty dstrbuto P (x, y θ) for all x, y (a 2x2 matrx) as a fucto of the parameters θ = (θ, θ 2 ). Suppose we have access to the followg (jot) observatos of x ad y: x y (0pts) What are the maxmum-lkelhood (ML) values of θ ad θ 2? Provde the detals of your dervato as well as the aswer. Ht. You ca frst cofrm that P (x, y θ) = P (y x, θ 2 )P (x θ ), where the key observato s that the parameters ca be separated to the dfferet compoets. After all the dstrbuto of the co toss, govered by P (x θ ), s depedet of the accuracy of the report, cotaed P (y x, θ 2 ). Ths separato helps you to solate the estmato of each parameter the log-lkelhood crtero. 5

6 3. (0pts) Let ˆθ (x,..., x ) be the ML estmator of θ based o the observed data x,..., x, where the data s vewed as depedet samples from P (x θ ) for some fxed θ. We ca try to assess how well the estmator recovers the parameters θ. Oe useful measure s the bas of the estmator. Ths s defed as the expectato E [ ˆθ (X,..., X ) θ ], take wth respect to the true dstrbuto of X,..., X or P (X θ ). The bas measures whether the estmator systematcally devates from the true parameters θ that were used to geerate the data. A estmator s called ubased f ts bas s zero. Show that the ML estmator ˆθ s deed ubased ths sese. 4. (0pts) We have thus far used oly two parameters θ ad θ 2 to specfy the jot dstrbuto over (x, y). Ths was possble because of the assumpto that the accuracy of the report (whether y = x) s depedet of the co toss (what x s). It takes three parameters to specfy a ucostraed jot dstrbuto over (x, y). Whle there are four possble cofguratos of the varables, there are oly three parameters that ca be set depedetly (the fourth oe s determed due to ormalzato, x,y P (x, y) = ). We ca parameterze the jot dstrbuto symmetrcally terms of four umbers P (x, y) = θ x,y, that sum to oe x,y θ x,y =. Whe we estmate the maxmum lkelhood jot dstrbuto, we estmate the ML settg of the parameters ˆθ x,y. What s the maxmum lkelhood estmate of P (x, y) ths case? Whch model has the hgher log-lkelhood? 5. (Optoal) Show that the ML parameters ˆθ x,y are ubased estmates of θ x,y. 6. (Optoal) Suppose we are ot sure whch model s correct. Ca you exted the leave-oe-out cross-valdato procedure descrbed the lear regresso cotext to our settg here? Whch model would the resultg cross-valdato crtero choose ths case? Problem 5 Cosder a bvarate Gaussa dstrbuto x = (x, x 2 ) N(µ, Σ) wth probablty desty p(x; µ, Σ) = 2π Σ /2 exp{ 2 (x µ) Σ (x µ)} (3) where µ = E{x} s the two-dmesoal mea vector ad Σ = E{(x µ)(x µ) } s the two-by-two covarace matrx ( Σ s the determat of the covarace matrx). The Gaussa s fully specfed by the parameters (µ, Σ).. (0pts) Gve a collecto of depedet samples x, =,...,, we wsh to estmate the model parameters (µ, Σ). The maxmum-lkelhood estmates are chose so as to 6

7 maxmze the log-lkelhood J(µ, Σ) = log p(x,..., x ; µ, Σ) (4) = log p(x ; µ, Σ) (5) Show that the ML estmates based o data x,..., x are gve by the sample mea ad sample covarace: ˆµ = x (6) ˆΣ = (x ˆµ)(x ˆµ) (7) Hts. Start wth the mea estmate. Express the Gaussa dstrbuto terms of the verse covarace matrx A = Σ ad use the followg matrx dervatves d da (x µ) A(x µ) = (x µ)(x µ) d da log A = A (8) 2. (0pts) The b-varate Gaussa dstrbuto allows the two varables to be depedet o each other (the values of the varables co-vary). Ths depedece s fully descrbed by the covarace matrx. I lght of problem we suspect that ths depedece s captured by learly predctg oe from the other. Suppose we have access to x part of the samples from a Gaussa model (µ, Σ) ad wsh to use them to estmate x 2. Derve the least squares optmal estmate ˆx 2 (x ) that mmzes the expected squared error E{(x 2 ˆx 2 (x )) 2 } (the exectato s over samples (x, x 2 ) N(µ, Σ)). Ht. Use the fact that the best estmate s of the form ˆx 2 (x ) = E{x 2 x }, as dscussed the lecture. 7

Bayes (Naïve or not) Classifiers: Generative Approach

Bayes (Naïve or not) Classifiers: Generative Approach Logstc regresso Bayes (Naïve or ot) Classfers: Geeratve Approach What do we mea by Geeratve approach: Lear p(y), p(x y) ad the apply bayes rule to compute p(y x) for makg predctos Ths s essetally makg