STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

STATS216v Intrductin t Statistical Learning Stanfrd University, Summer 2016 Practice Final (Slutins) Duratin: 3 hurs Instructins: (This is a practice final and will nt be graded.) Remember the university hnr cde. Write yur name and SUNet ID (ThisIsYurSUNetID@stanfrd.edu) n each page. There are 25 questins in ttal. All questins are f equal value and are meant t elicit fairly shrt answers: each questin can be answered using 1-5 sentences. Yu may nt access the internet during the eam. Yu are allwed t use a calculatr, thugh any calculatins in the eam, if any, d nt have t be carried thrugh t btain full credit. Yu may refer t yur curse tetbk and ntes, and yu may use yur laptp prvided that internet access is disabled. Please write neatly.

1. An ecnmics firm is trying t classify whether the GDP f the United States will increase r decrease based n the stck market inde, the unemplyment rate, and the cnsumer price inde. The firm uses K-nearest neighbrs t run the classificatin. Can the firm determine the impact that the unemplyment rate has n the respnse? Eplain. N. The rle f individual predictrs cannt be determined frm K-nearest neighbrs; the firm wuld have t use a different apprach. 2. There has been recent debate in bilgy n whether genersity is hereditary. T investigate the questin, a researcher runs a linear regressin using the amunt f mney dnated by a given persn as the respnse, and a certain cllectin f predictrs. Later he reruns the regressin, but nw includes the amunt f mney dnated by the persn s parents as a predictr. He finds that with the additinal predictr the RSS f the mdel ges dwn, and therefre claims there is evidence t cnclude that genersity is hereditary. Is his reasning sund? Eplain. It is nt. The training RSS can never increase when we include anther regressr in ur linear regressin mdel, and almst always decreases. Furthermre, even if parent dnatins are predictive f child dnatins, this wuld nt imply that genersity is hereditary because it des nt take a researcher t establish that wealth, and therefre pprtunity t dnate, is hereditary. Nte: Either eplanatin is valid. It is nt necessary t prvide bth.

3. Suppse yu run a simple linear regressin f a respnse Y against a single predictr X. Yu find that the R 2 is 0.862. What d yu epect wuld happen t the R 2 if we instead treated X as the respnse and Y as the predictr? Eplain. The R 2 value wuld still be 0.862. This is because in simple linear regressin the R 2 is simply the square f the sample crrelatin cefficient between X and Y. 4. A drug cmpany hires yu t estimate the effect that ne f their drugs has n a persn s strength. Yu run a linear regressin, but when yu lk at the plt f the data belw, yu realize that ne f the basic assumptins f the linear regressin mdel is being vilated. Which assumptin is it? Prpse a slutin. 2.5 5.0 7.5 0.00 0.25 0.50 0.75 1.00 dsage f drug strength The assumptin being vilated is that the errr terms have cnstant variance. One way t slve fr heterscedasticity is transfrm the respnse with a cncave functin, r, if pssible, t use weighted least squares.

5. A gelgist is having truble classifying several different types f stne, s he brings yu sme data and asks fr help. Yu decide t perfrm three different methds: lgistic regressin, QDA and a linear SVM. Befre infrming the gelgist f yur results, he tells yu that he was able t gather even mre data, and when yu inspect them yu realize that they happen t be far frm the decisin bundaries fr each f the methds yu tried. Which f the three methds abve wuld likely be mst sensitive t the new bservatins? QDA. Bth lgistic regressin and linear SVMs have lw sensitivity t bservatins far frm the decisin bundary. 6. Yur clleague is studying a cllectin f 100 manuscripts, 40 f which are signed and authred by Aleander Hamiltn, 30 f which are signed and authred by James Madisn, and 20 f which are signed and authred by Jhn Jay. The remaining manuscripts are f unknwn authrship, but each was written by ne f these three individuals. Yur clleague has identified a cllectin f stylistic features that can be etracted frm each dcument that she feels shuld be indicative f authrship. She wuld like t use these features t identify the authr f each f the unknwn dcuments. Suggest tw ways f carrying ut this analysis, and describe ne advantage that each has ver the ther. One ptin is t use multiclass LDA; a benefit f this methd ver the ne t fllw is that this methd prduces prbability estimates. A secnd is t use One-vs.-all SVMs; a benefit f this apprach is that it shuld wrk well even when the Gaussianity assumptin f LDA is a pr apprimatin f reality.

7. Is each f the fllwing statements TRUE r FALSE? Justify yur answer. (a) If instead f perfrming a linear regressin f y i n 1,..., 20 yu decide t run a principle cmpnents regressin (PCR) using all 20 cmpnents, yu will get the same predictins as if yu had run the riginal linear regressin. (b) Unlike linear regressin, ging frm a PCR with 5 cmpnents t a PCR with 6 cmpnents might decrease yur RSS. (a) True, since PCR is applying linear transfrmatins t yur predictrs, and yu can then adjust the regressin cefficients accrdingly t yield the best predictins as if yu had used the riginal 20 regressrs. (b) False, fr the same reasn as in linear regressin. 8. Eplain hw yu culd use the btstrap t estimate the test MSE f an arbitrary regressin prcedure. I wuld prduce an OOB estimate! That is, I wuld repeatedly sample btstrap datasets, train my prcedure n each dataset, and, fr each pint in my riginal dataset nt included in a btstrap dataset, cmpute the squared predictin errr fr that btstrap dataset mdel n that ut-f-bag datapint. Averaging ver all f thse squared predictin errrs and taking the square rt yields an OOB estimate f test MSE.

9. Lass and ridge regressins invlve minimizing similar bjective functins, but the tw methds can yield different results. What wuld happen t the Lass and ridge if yu applied it t a linear regressin with relevant but highly crrelated variables? The Lass will pick ne f the crrelated variables and drp the ther nes, since we are using a l 1 penalty. Ridge, hwever, will keep all the crrelated variables in the mdel due t the l 2 penalty, and decrease them tgether as λ grws. 10. Assume that yu have p predictrs available in yur dataset. (a) What is a (nn-cmputatinal) mtivatin fr cnsidering m = p predictrs ver m = p predictrs at each split in a randm frest? By nt using all predictrs at each split, we prduce a mre diverse set f mdels with less crrelated predictins; averaging less crrelated predictins leads t greater variance reductin. (b) What is an advantage f cnsidering m = p predictrs ver m = 1 predictr at each split in a randm frest? If there are many irrelevant predictrs in the dataset and few relevant nes, using m = 1 can lead t larger, lwer quality decisin trees that d nt generalize well, since the tree is simply required t split n a randmly selected (and likely irrelevant) feature at each decisin nde.

11. Suppse yu run a linear regressin with 22 predictrs, but yu epect that many f the predictrs are highly crrelated. (a) Why is this a prblem? (b) Suggest a methd that will apprpriately fi this prblem. (a) Cllinearity reduces the accuracy f the estimates f the regressin cefficient. (b) PCR is a suitable way t perfrm the apprpriate dimensin reductin. Nte: Other slutins t (b) are pssible. 12. TRUE r FALSE: The first principal cmpnent f a dataset with tw variables and y can be btained by running a linear regressin f y n, since bth methds find the (ne-dimensinal) line that is clsest t the data. Eplain. False. The tw methds use different measures f clseness. PCA minimizes squared Euclidean distance frm pints ( i, y i ) t the PCA line (that is, the perpendicular distance frm the regressin line), whereas linear regressin minimizes the distance ( i y i ) 2 fr i = 1,..., n (that is, the vertical distance frm the regressin line).

13. Suppse we fit a linear spline, but we have the cnstraint that at the knts the fitted curve must be bth cntinuus and have cntinuus first derivative. What simpler methd des this becme? The cntinuus first derivative cnstraint means we will have simply a straight line. This becmes rdinary linear regressin. 14. Fr the data pltted belw, find tw functins f (let s call them f() and g()) such that y is well apprimated as a linear functin f f() and g(). That is, find f() and g() such that y can be reasnably mdeled as y = β 0 + β 1 f() + β 2 g() + ɛ fr ɛ small Gaussian nise. Eplain yur answer. y 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 It appears that y is prprtinal t 1 when > 0 and y is prprtinal t + 1 when < 0. Hence, we can prpse the linear mdel y = β 0 + β 1 I ( > 0) ( 1) + β 2 I ( < 0) ( + 1) + ɛ.

15. Yu decide t slightly alter the tree-building algrithm t allw nt nly fr tw-way ( binary ) splits, but als three-way splits. Yur classmate says, This is nt useful! It increases the cmputatinal csts but yields the same decisin tree as befre. Each three-way split f the predictr space int A, B, and C can be btained by tw cnsecutive tw-way splits f the predictr space: first int A B and C, and then subsequently splitting A B int A and B. Cmment. The cmputatinal csts indeed g up quite a bit, but it is nt true that the tw treebuilding algrithms will lead t the same tree. This is because we cnstruct trees in a greedy way. After making the first split int A B and C, there is n guarantee that the algrithm will net chse t split A B int A and B; it may split A B sme ther way, r it may chse t split C instead. 16. Yu build a classifier using a cmbinatin f autmatic variable screening and 5-nearest neighbrs using the selected variables. Yu want t reprt its classificatin perfrmance, s yu write a script t run 10-fld crss-validatin, and yu get an errr estimate f 13%. When yu write up yur paper, yu run yur script again, and find t yur hrrr that the CV errr is nw 22%. Why has this happened, and what shuld yu d? There is variance in the fld selectin (called Mnte Carl variance). The best thing t d wuld be t run yu CV prcedure a number f times, say 100, and reprt the average errr and a standard errr fr the average.

17. Suppse yu are given the fllwing data. 250 label Y 0 A B 250 300 0 300 X We want t create a classificatin algrithm based n the data abve. Suggest a classifier that wuld wrk well n this prblem. Radial SVM and K-nearest neighbrs wuld wrk. Lgistic regressin and LDA dn t wrk as they have linear decisin bundaries. Belw is the decisin bundary fr a radial SVM. 400 200 0 200 400 400 0 200 400 Y SVM classificatin plt X A B 18. Suppse yu wuld like t apply a radial SVM t a classificatin task. Yu split yur data int training and validatin sets, select values f γ and C via crss-validatin, and find an estimated test errr f 0.21 using the validatin set. Then yu remember that rescaling yur variables is usually a gd idea when running SVM, but, after rescaling yur variables, yu run the same mdel and realize that, cntrary t yur epectatins, the estimated test errr has gne up t 0.38. What happened? After rescaling the variables, it is likely the ptimal values f γ and C have changed. Yu must use CV again t pick the prper values f γ and C, and then estimate the test errr.

19. A cllectr is intent n predicting the sale prices f paintings based n varius qualitative characteristics (including the identity f the artist, the style f the painting, and the cuntry f rigin). The cllectr trains a bsted decisin stump mdel n a dataset f past painting sales and characteristics and is surprised t find that n matter hw many trees he adds t the mdel, the training MSE is never smaller than the training MSE f a least squares linear regressin mdel fit with dummy variables encding the qualitative predictr values. Why shuld he nt be surprised? Since every variable is qualitative, the bsted decisin stump mdel is building a predictin rule that is linear in the dummy indicatrs crrespnding t each value that each predictr can take n. Since the linear regressin mdel has the minimum training set MSE ver this mdel class, its MSE can never be greater than that prduced by the stump mdel. 20. Eplain hw an unsupervised methd culd be useful even when yu are trying t make predictins in a supervised envirnment. There are several pssible answers. One is t use PCR t perfrm dimensin reductin befre applying linear regressin.

21. TRUE r FALSE: The sequence f clusterings prduced by running hierarchical clustering with centrid linkage is equivalent t the sequence f clusterings btained by running K-means clustering fr K = n, n 1,..., 2, 1. Justify yur answer. FALSE. The clusterings prduced in hierarchical clustering are nested (each clustering is frmed by merging tw clusters frm the previus clustering), but thse prduced by k-means need nt be. 22. Yu are cnsidering a binary classificatin prblem in which the decisin bundary separating yur classes is a cubic plynmial in yur p = 10, 000 input predictrs. Hwever, it is cmputatinally prhibitive fr yu t eplicitly cnstruct the 167 billin cubic interactin terms ij ik il assciated with each datapint. Suggest a way t find a classificatin rule that separates yur classes withut eplicitly frming cubic interactin terms. I wuld fit an SVM with a cubic plynmial kernel k(, y) = (1 +, y ) 3.

23. Suppse in a regressin setting yu use a bsted decisin tree with d = 1, als knwn as a bsted decisin stump, s that the utput f the mdel is additive in its features. Is this equivalent t linear regressin? Eplain. N. A linear mdel is additive and linear in the input features, while bsted decisin stumps are additive but nnlinear in the input features. 24. Principal cmpnents analysis is smetimes used as a frm f dimensin reductin in rder t imprve the results f linear regressin. Suggest a way t use clustering t achieve a similar result. We shuld first standardize each f the vectrs X 1,..., X p by dividing by their nrms. Then if p is large we can cluster the p predictrs in R n space using K-means clustering, and use the K means as the features instead. Nte that we have t specify K, but in PCR we had t specify the K largest principal cmpnents anyway.

25. After learning all the methds in STATS216v, a student writes an R prgram that takes a dataset and runs every single methd cvered in the curse, each with 100 pssible values fr the tuning parameter in the methd being cnsidered. He uses a validatin set t pick the best amng all the methds and tuning parameters pssible. Hwever, he is surprised t learn that the methd selected by the prgram perfrms wrse n the actual test data than many ther methds he tried. What went wrng? Overfitting. By trying every single methd with s many pssible parameters, it is likely the methd picked by the prgram perfrms better than all thers in the validatin set simply due t verfitting (t the validatin set). The methd might very well perfrm prly with the actual test set.