Homework Assignment 3 Due in class, Thursday October 15

Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data. More nformaton about ths dataset can be found n http: //statweb.stanford.edu/~tbs/elemstatlearn/datasets/prostate.nfo.txt. (a) (2.5 pts) (In class we learned about Rdge regresson wth tunng parameter λ. Defne ( ) df(λ) = tr X(X X T X + λi) λ X T. Plot the coeffcents of the covarates as λ s ncreased from 0 to 1000. A smlar plot can be found n fgure 3.8 n H-T-F. Ths fgure essentally plots the rdge regresson coeffcents of the covarates as dfλ s ncreased. (b) (2.5 pts) Now plot the coeffcents learned by Lasso as λ s ncreased from 0 to 100. For ths you can use the LARS package. (c) (5 pts) Fnally reproduce columns 4 and 5 for Rdge regresson and Lasso n Table 3.3. Remember to reproduce the test set predcton errors as well. 2 Dscrmnatve vs. Generatve Classfers A very common debate n statstcal learnng has been over generatve versus dscrmnatve models for classfcaton. In ths queston we wll explore ths ssue, both theoretcally and practcally. We wll consder Nave Bayes and logstc regresson classfcaton algorthms. To answer ths queston, you mght want to read: On Dscrmnatve vs. Generatve Classfers: A comparson of logstc regresson and Nave Bayes, Andrew Y. Ng and Mchael Jordan. In NIPS 14, 2002. http://www.robotcs.stanford.edu/~ang/papers/nps01-dscrmnatvege pdf 2.1 Logstc regresson and Nave Bayes (a) (3 pts) The dscrmnatve analog of Nave Bayes s logstc regresson. Ths means that the parametrc form of P (Y X) used by Logstc regresson s mpled by the assumptons of a Nave Bayes classfer, for some specfc class-condtonal denstes. In the readng you wll see how to prove ths for a Gaussan nave bayes classfer for contnuous nput values. Can you prove the same for bnary nputs?assume X and Y are both bnary. Assume that X Y = j s Bernoull(θ j ), where j {0, 1}, and Y s Bernoull(π). 1

2.2 (1+2+3+1 pts) Double countng the evdence (a) Consder the two class problem where class label y {T, F } and each tranng example X has 2 bnary attrbutes X 1, X 2 {T, F }. How many parameters wll you need to know/evaluate f you are to classfy an example usng the Nave Bayes classfer? Let the class pror be P (Y = T ) = 0.5 and also let P (X 1 = T Y = T ) = 0.8 and P (X 1 = F Y = F ) = 0.7., P (X 2 = T Y = T ) = 0.5 and P (X 2 = F Y = F ) = 0.9. So, attrbute X 1 provdes a slghtly stronger evdence about the class label than X 2.. Assume X 1 and X 2 are truly ndependent gven Y. Wrte down the Nave Bayes decson rule. SOLUTION: Gven an example wth attrbute values (x 1, x 2 ): Y = T f log P (Y =T ) log P (Y =F ) + log P (X 1=x 1 Y =T ) log P (X 1 =x 1 Y =F ) + log P (X 2=x 2 Y =T ) log P (X 2 =x 2 Y =F ) > 0 else Y = F.. Show that f Nave Bayes uses both attrbutes, X 1 and X 2, the error rate s 0.235. Is t better than usng only a sngle attrbute (X 1 or X 2 )? Why? The error rate s defned as the probablty that each class generates an observaton where the decson rule s ncorrect. SOLUTION: The expected error rate s the probablty that each class generates an observaton where the decson rule s ncorrect: f Y s the true label, let Ỹ (X 1, X 2 ) be the result of classfcaton (predcted class label), then the expected error rate s E e = P (X 1 =, X 2 = j, Y = k) I[k NB-Predcton(X 1 =, X 2 = j)] {T,F } j {T,F } k {T,F } (1) where I[z], s the ndcator functon. If condton z s true, then I[z] = 1, else I[z] = 0. Gven X 1, Nave Bayes wll make the followng predctons: X 1 = T Y = T, f X 1 = F Y = F. We only need to calculate P (X 1 = F, Y = T ) and P (X 1 = T, Y = F ), snce ths are the 2 cases when NB makes wrong predcton. In other 2 cases, when NB does not make mstake, the value of ndcator functon I[z] from equaton 1 wll be 0 and the whole term wll be 0. We obtan: E e = P (X 1 = F, Y = T ) + P (X 1 = T, Y = F ) = P (Y = T )P (X 1 = F Y = T ) +P (Y = F )P (X 1 = T Y = F ) = 0.5 0.2 + 0.5 0.3 = 0.25 Smlarly, for X 2 we obtan E e = 0.5 0.1 + 0.5 0.5 = 0.3 Smlarly as n 1 attrbute case: Frst, we obtan predctons of Nave Bayes, whch are: X 1 = T and X 2 = T Y = T X 1 = T and X 2 = F Y = T X 1 = F and X 2 = T Y = T X 1 = F and X 2 = F Y = F (2) 2

Agan to make calculaton shorter, we only consder cases when predcted class dffers from true class. We get: E e = P (X 1 = T, X 2 = T, Y = F ) + P (X 1 = T, X 2 = F, Y = F ) + P (X 1 = F, X 2 = T, Y = F ) + P (X 1 = F, X 2 = F, Y = T ) remember, probabltes factorze: P (X 1, X 2, Y ) = P (Y )P (X 1 Y )P (X 2 Y ) = 0.5 0.3 0.1 + 0.5 0.3 0.9 + 0.5 0.7 0.1 + 0.5 0.5 0.1 = 0.235 (3). Now, suppose that we create new attrbute X 3, whch s an exact copy of X 2. So, for every tranng example, attrbutes X 2 and X 3 have the same value, X 2 = X 3. What s the error rate of Nave Bayes now? SOLUTION: As n prevous case: Frst, we obtan predctons of Nave Bayes. Notce that now NB also consders attrbute X 3 for whch t assumes that t s condtonally ndependent of X 1, X 2 gven Y. The predctons now change, snce double countng the evdence from X 2 makes the dfference: X 1 = T and X 2 = T Y = T X 1 = T and X 2 = F Y = T X 1 = F and X 2 = T Y = F X 1 = F and X 2 = F Y = F ***predcton changed! (4) However, the nature (the true model) remans the same. We dd not ntroduce any new nformaton to the system by creatng X 3. So P (X 1, X 2, X 3, Y ) = 0 f X 2 X 3 (ths never happens), and also P (X 1, X 2, X 3, Y ) = P (X 1, X 2, Y ), snce X 3 s exact (determnst) copy of X 2. So the calculaton of expected error s the same as n queston (c), but now wth dfferent values of Y (remember, predctons of NB changed): E e = P (X 1 = T, X 2 = T, Y = F ) + P (X 1 = T, X 2 = F, Y = F ) + P (X 1 = F, X 2 = T, Y = T ) + P (X 1 = F, X 2 = F, Y = T ) probabltes then factorze as n prevous case... and after a bt of work... = 0.3 (5) COMMON MISTAKE 1: Many people calculated P (X 1, X 2, X 3, Y ) as P (X 1, X 2, X 3, Y ) = P (Y ) P (X 1 Y ) P (X 2 Y ) P (X 3 Y ), but ths s not true, snce X 2 and X 3 are not condtonally ndependent gven Y. Also, X 3 s a determnstc copy of X 2 so P (X 1, X 2, X 3, Y ) = 0 f X 2 X 3. 3

The nature, the true model, that generates the data dd not change. By duplcatng an attrbute we dd not ntroduce any new nformaton to the system, so t makes no sense for performance to mprove. Some people got the result where the error decreased, whch does not make much sense, snce we could get error rate gong to 0 by just duplcatng attrbute X 2 many tmes. v. Explan what s happenng wth Nave Bayes? SOLUTION: NB assumes attrbutes are condtonally ndependent gven the class. Ths s not true, when we ntroduce X 3, so NB over-counts the evdence from attrbute X 2 and the error ncreases. v. (extra credt 10 pts) In spte of the above fact we wll see that n some examples Nave Bayes doesn t do too badly. Consder the above example.e. your features are X 1, X 2 whch are truly ndependent gven Y and a thrd feature X 3 = X 2. Suppose you are now gven an example wth X 1 = T and X 2 = F. You are also gven the probabltes P (Y = T X 1 = T ) = p and P (Y = T X 2 = F ) = q, and P (Y = T ) =.5. Prove that the decson rule s p (1 q)2 (Hnt : use Bayes q 2 +(1 q) 2 rule agan). What s the true decson rule? Plot the two decson boundares (vary q between 0 1) and show where Nave Bayes makes mstakes. SOLUTION: Nave Bayes The decson rule for Nave Bayes s : Predct Y = T f Pr {Y = T } Pr {X = x Y = T } Pr {Y = F } Pr {X = x Y = F } But we know that Pr {Y = T } = Pr {Y = F } Pr {X = x Y = T } Pr {X = x Y = F } Usng Bayes rule agan we get, Pr {Y = T X = x } Pr {Y = T } Pr {X = x } Pr {Y = F X = x } Pr {Y = F } Pr {X = x } Now, the Pr {X = x } parts cancel from both sdes. Also Pr {Y = T } = Pr {Y = F }. Hence we now have Pr {Y = T X = x } Settng x 1 = T and x 3 = x 2 = F, we fnd : Pr {Y = F X = x } Pr {Y = T X 1 = T } Pr {Y = T X 2 = F } Pr {Y = T X 3 = F } Pr {Y = F X 1 = T } Pr {Y = F X 2 = F } Pr {Y = F X 3 = F } 4

Pluggng n the values gven n queston we now obtan: pq 2 (1 p)(1 q) 2 Thus the Nave Bayes decson rule s gven by (1 q)2 Predct Y = T p p 2 + (1 q) 2. Actual Decson Rule. The actual decson rule doesn t take n consderaton X 3, as we know that X 1 and X 2 are truly ndependent gven Y and they are enough to predct Y. Therefore, we predct true f and only f Pr {Y = T X 1 = T, X 2 = F } Pr {Y = F X1 T, X2 F }. Followng smlar calculatons as before, we have pq (1 p)(1 q) p 1 q Thus the real decson rule s gven by Predct Y = T p 1 q 1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 Fgure 1: True and Nave-Bayes decson boundary The curved lne s for Nave bayes, and the straght lne shows the true decson boundary. The shaded part n fgure 1 shows the regon where Nave-Bayes decson dffers from the true decson. COMMON MISTAKE 2: Some students started wth Pr {Y = T X 1 } Pr {Y = T X 2 } Pr {Y Pr {Y = F X 1 } Pr {Y = F X 2 } Pr {Y = F X 3 }. Note that condtonal ndependence does not mply ths. However for ths partcular parameters t works out to be of ths form. 2.3 Learnng Curves of Nave Bayes and Logstc Regresson Compare the two approaches on the Breast Cancer dataset you can download from course webpage. You can fnd the descrpton of ths dataset from the course webpage. We have removed the records wth mssng values for you. Obtan the learnng curves smlar to Fgure 1 n the paper. Implement a Nave Bayes classfer and a logstc regresson classfer wth the assumpton that each attrbute value for a partcular record s ndependently generated. 5

For the NB classfer, assume that P (x y), where x s a feature n the breast cancer data (that s, s the number of column n the data fle) and y s the label, of the followng multnomal dstrbuton form: For x {v 1, v 2,..., v n }, n p(x = v k y = j) = θjk s.t., j : k=1 θ jk = 1 where 0 θ jk 1 It may be easer to thnk of ths as a normalzed hstogram or as a mult-value extenson of the Bernoull. Use 2/3 of the examples for tranng and the remanng 1/3 for testng. Be sure to use 2/3 of each class, not just the frst 2/3 of data ponts. For each algorthm: 1. (3 pts) Implement the IRLS algorthm for Logstc regresson. 2. (6 pts) Plot a learnng curve: the accuracy vs. the sze of the tranng data. Generate sx ponts on the curve, usng [.01.02.03.125.625 1] fractons of your tranng set and testng on the full test set each tme. Average your results over 5 random splts of the data nto a tranng and test set (always keep 2/3 of the data for tranng and 1/3 for testng, but randomze over whch ponts go to tranng set and whch to testng). Ths averagng wll make your results less dependent on the order of records n the fle. Plot both the Nave Bayes and Logstc Regresson, learnng curves on the same plot. Use the plot(x,y) functon n Matlab snce the tranng data fractons are not equally spaced. Specfy your choce of pror/regularzaton parameters and keep those parameters constant for these tests. A typcal choce of constants would be to add 1 to each bn before normalzng (for NB) and λ = 0 (for LR). SOLUTION: Please look at fgure 2. Somethng to remember : many students treated the dscrete values of each attrbute categorcally,.e. have a dfferent weght for each possble value of a feature. However note that here the values of the features mean somethng,.e. strength of that feature, so ts better to use them numercally. I have not penalzed any of these solutons. 3. (1 pt) What conclusons can you draw from your experments? Specfcally, what can you say about speed of convergence of the classfers? Are these consstent wth the results n the NIPS paper that we have mentoned? If yes, state that. If no, explan why not. SOLUTION: From fgure 2 t can be observed that NB gets to ts asymptotc error much faster than LR. For some of you the two curves wll cross, and for some of you they wll come very close but not cross. Both of these are fne because the data set s too small compared to the number of covarates to actually observe the effect reported n Ng et al s paper. 6

0.7 0.6 LR NB 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Fgure 2: SOLUTION: Your plot should look somethng smlar to ths. 7