STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Similar documents
Pattern Recognition 2014 Support Vector Machines

Resampling Methods. Chapter 5. Chapter 5 1 / 52

What is Statistical Learning?

IAML: Support Vector Machines

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Chapter 3: Cluster Analysis

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Tree Structured Classifier

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Support-Vector Machines

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

AP Statistics Notes Unit Two: The Normal Distributions

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Differentiation Applications 1: Related Rates

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Hypothesis Tests for One Population Mean

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Lab 1 The Scientific Method

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

, which yields. where z1. and z2

ENSC Discrete Time Systems. Project Outline. Semester

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Simple Linear Regression (single variable)

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Part 3 Introduction to statistical classification techniques

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Statistical Learning. 2.1 What Is Statistical Learning?

Computational modeling techniques

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Elements of Machine Intelligence - I

Support Vector Machines and Flexible Discriminants

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

Least Squares Optimal Filtering with Multirate Observations

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Linear programming III

Lead/Lag Compensator Frequency Domain Properties and Design Methods

SPH3U1 Lesson 06 Kinematics

1 The limitations of Hartree Fock approximation

Kinetic Model Completeness

Chapter 15 & 16: Random Forests & Ensemble Learning

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

A Matrix Representation of Panel Data

We can see from the graph above that the intersection is, i.e., [ ).

READING STATECHART DIAGRAMS

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

3. Classify the following Numbers (Counting (natural), Whole, Integers, Rational, Irrational)

Floating Point Method for Solving Transportation. Problems with Additional Constraints

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

Churn Prediction using Dynamic RFM-Augmented node2vec

CHM112 Lab Graphing with Excel Grading Rubric

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

o o IMPORTANT REMINDERS Reports will be graded largely on their ability to clearly communicate results and important conclusions.

You need to be able to define the following terms and answer basic questions about them:

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Lecture 10, Principal Component Analysis

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

IN a recent article, Geary [1972] discussed the merit of taking first differences

Experiment #3. Graphing with Excel

Activity Guide Loops and Random Numbers

Five Whys How To Do It Better

More Tutorial at

The blessing of dimensionality for kernel methods

Sequential Allocation with Minimal Switching

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

x x

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

BASD HIGH SCHOOL FORMAL LAB REPORT

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

NUMBERS, MATHEMATICS AND EQUATIONS

Assessment Primer: Writing Instructional Objectives

Functional Form and Nonlinearities

Smoothing, penalized least squares and splines

Pipetting 101 Developed by BSU CityLab

MATHEMATICS Higher Grade - Paper I

Transcription:

STATS216v Intrductin t Statistical Learning Stanfrd University, Summer 2016 Practice Final (Slutins) Duratin: 3 hurs Instructins: (This is a practice final and will nt be graded.) Remember the university hnr cde. Write yur name and SUNet ID (ThisIsYurSUNetID@stanfrd.edu) n each page. There are 25 questins in ttal. All questins are f equal value and are meant t elicit fairly shrt answers: each questin can be answered using 1-5 sentences. Yu may nt access the internet during the eam. Yu are allwed t use a calculatr, thugh any calculatins in the eam, if any, d nt have t be carried thrugh t btain full credit. Yu may refer t yur curse tetbk and ntes, and yu may use yur laptp prvided that internet access is disabled. Please write neatly.

1. An ecnmics firm is trying t classify whether the GDP f the United States will increase r decrease based n the stck market inde, the unemplyment rate, and the cnsumer price inde. The firm uses K-nearest neighbrs t run the classificatin. Can the firm determine the impact that the unemplyment rate has n the respnse? Eplain. N. The rle f individual predictrs cannt be determined frm K-nearest neighbrs; the firm wuld have t use a different apprach. 2. There has been recent debate in bilgy n whether genersity is hereditary. T investigate the questin, a researcher runs a linear regressin using the amunt f mney dnated by a given persn as the respnse, and a certain cllectin f predictrs. Later he reruns the regressin, but nw includes the amunt f mney dnated by the persn s parents as a predictr. He finds that with the additinal predictr the RSS f the mdel ges dwn, and therefre claims there is evidence t cnclude that genersity is hereditary. Is his reasning sund? Eplain. It is nt. The training RSS can never increase when we include anther regressr in ur linear regressin mdel, and almst always decreases. Furthermre, even if parent dnatins are predictive f child dnatins, this wuld nt imply that genersity is hereditary because it des nt take a researcher t establish that wealth, and therefre pprtunity t dnate, is hereditary. Nte: Either eplanatin is valid. It is nt necessary t prvide bth.

3. Suppse yu run a simple linear regressin f a respnse Y against a single predictr X. Yu find that the R 2 is 0.862. What d yu epect wuld happen t the R 2 if we instead treated X as the respnse and Y as the predictr? Eplain. The R 2 value wuld still be 0.862. This is because in simple linear regressin the R 2 is simply the square f the sample crrelatin cefficient between X and Y. 4. A drug cmpany hires yu t estimate the effect that ne f their drugs has n a persn s strength. Yu run a linear regressin, but when yu lk at the plt f the data belw, yu realize that ne f the basic assumptins f the linear regressin mdel is being vilated. Which assumptin is it? Prpse a slutin. 2.5 5.0 7.5 0.00 0.25 0.50 0.75 1.00 dsage f drug strength The assumptin being vilated is that the errr terms have cnstant variance. One way t slve fr heterscedasticity is transfrm the respnse with a cncave functin, r, if pssible, t use weighted least squares.

5. A gelgist is having truble classifying several different types f stne, s he brings yu sme data and asks fr help. Yu decide t perfrm three different methds: lgistic regressin, QDA and a linear SVM. Befre infrming the gelgist f yur results, he tells yu that he was able t gather even mre data, and when yu inspect them yu realize that they happen t be far frm the decisin bundaries fr each f the methds yu tried. Which f the three methds abve wuld likely be mst sensitive t the new bservatins? QDA. Bth lgistic regressin and linear SVMs have lw sensitivity t bservatins far frm the decisin bundary. 6. Yur clleague is studying a cllectin f 100 manuscripts, 40 f which are signed and authred by Aleander Hamiltn, 30 f which are signed and authred by James Madisn, and 20 f which are signed and authred by Jhn Jay. The remaining manuscripts are f unknwn authrship, but each was written by ne f these three individuals. Yur clleague has identified a cllectin f stylistic features that can be etracted frm each dcument that she feels shuld be indicative f authrship. She wuld like t use these features t identify the authr f each f the unknwn dcuments. Suggest tw ways f carrying ut this analysis, and describe ne advantage that each has ver the ther. One ptin is t use multiclass LDA; a benefit f this methd ver the ne t fllw is that this methd prduces prbability estimates. A secnd is t use One-vs.-all SVMs; a benefit f this apprach is that it shuld wrk well even when the Gaussianity assumptin f LDA is a pr apprimatin f reality.

7. Is each f the fllwing statements TRUE r FALSE? Justify yur answer. (a) If instead f perfrming a linear regressin f y i n 1,..., 20 yu decide t run a principle cmpnents regressin (PCR) using all 20 cmpnents, yu will get the same predictins as if yu had run the riginal linear regressin. (b) Unlike linear regressin, ging frm a PCR with 5 cmpnents t a PCR with 6 cmpnents might decrease yur RSS. (a) True, since PCR is applying linear transfrmatins t yur predictrs, and yu can then adjust the regressin cefficients accrdingly t yield the best predictins as if yu had used the riginal 20 regressrs. (b) False, fr the same reasn as in linear regressin. 8. Eplain hw yu culd use the btstrap t estimate the test MSE f an arbitrary regressin prcedure. I wuld prduce an OOB estimate! That is, I wuld repeatedly sample btstrap datasets, train my prcedure n each dataset, and, fr each pint in my riginal dataset nt included in a btstrap dataset, cmpute the squared predictin errr fr that btstrap dataset mdel n that ut-f-bag datapint. Averaging ver all f thse squared predictin errrs and taking the square rt yields an OOB estimate f test MSE.

9. Lass and ridge regressins invlve minimizing similar bjective functins, but the tw methds can yield different results. What wuld happen t the Lass and ridge if yu applied it t a linear regressin with relevant but highly crrelated variables? The Lass will pick ne f the crrelated variables and drp the ther nes, since we are using a l 1 penalty. Ridge, hwever, will keep all the crrelated variables in the mdel due t the l 2 penalty, and decrease them tgether as λ grws. 10. Assume that yu have p predictrs available in yur dataset. (a) What is a (nn-cmputatinal) mtivatin fr cnsidering m = p predictrs ver m = p predictrs at each split in a randm frest? By nt using all predictrs at each split, we prduce a mre diverse set f mdels with less crrelated predictins; averaging less crrelated predictins leads t greater variance reductin. (b) What is an advantage f cnsidering m = p predictrs ver m = 1 predictr at each split in a randm frest? If there are many irrelevant predictrs in the dataset and few relevant nes, using m = 1 can lead t larger, lwer quality decisin trees that d nt generalize well, since the tree is simply required t split n a randmly selected (and likely irrelevant) feature at each decisin nde.

11. Suppse yu run a linear regressin with 22 predictrs, but yu epect that many f the predictrs are highly crrelated. (a) Why is this a prblem? (b) Suggest a methd that will apprpriately fi this prblem. (a) Cllinearity reduces the accuracy f the estimates f the regressin cefficient. (b) PCR is a suitable way t perfrm the apprpriate dimensin reductin. Nte: Other slutins t (b) are pssible. 12. TRUE r FALSE: The first principal cmpnent f a dataset with tw variables and y can be btained by running a linear regressin f y n, since bth methds find the (ne-dimensinal) line that is clsest t the data. Eplain. False. The tw methds use different measures f clseness. PCA minimizes squared Euclidean distance frm pints ( i, y i ) t the PCA line (that is, the perpendicular distance frm the regressin line), whereas linear regressin minimizes the distance ( i y i ) 2 fr i = 1,..., n (that is, the vertical distance frm the regressin line).

13. Suppse we fit a linear spline, but we have the cnstraint that at the knts the fitted curve must be bth cntinuus and have cntinuus first derivative. What simpler methd des this becme? The cntinuus first derivative cnstraint means we will have simply a straight line. This becmes rdinary linear regressin. 14. Fr the data pltted belw, find tw functins f (let s call them f() and g()) such that y is well apprimated as a linear functin f f() and g(). That is, find f() and g() such that y can be reasnably mdeled as y = β 0 + β 1 f() + β 2 g() + ɛ fr ɛ small Gaussian nise. Eplain yur answer. y 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 It appears that y is prprtinal t 1 when > 0 and y is prprtinal t + 1 when < 0. Hence, we can prpse the linear mdel y = β 0 + β 1 I ( > 0) ( 1) + β 2 I ( < 0) ( + 1) + ɛ.

15. Yu decide t slightly alter the tree-building algrithm t allw nt nly fr tw-way ( binary ) splits, but als three-way splits. Yur classmate says, This is nt useful! It increases the cmputatinal csts but yields the same decisin tree as befre. Each three-way split f the predictr space int A, B, and C can be btained by tw cnsecutive tw-way splits f the predictr space: first int A B and C, and then subsequently splitting A B int A and B. Cmment. The cmputatinal csts indeed g up quite a bit, but it is nt true that the tw treebuilding algrithms will lead t the same tree. This is because we cnstruct trees in a greedy way. After making the first split int A B and C, there is n guarantee that the algrithm will net chse t split A B int A and B; it may split A B sme ther way, r it may chse t split C instead. 16. Yu build a classifier using a cmbinatin f autmatic variable screening and 5-nearest neighbrs using the selected variables. Yu want t reprt its classificatin perfrmance, s yu write a script t run 10-fld crss-validatin, and yu get an errr estimate f 13%. When yu write up yur paper, yu run yur script again, and find t yur hrrr that the CV errr is nw 22%. Why has this happened, and what shuld yu d? There is variance in the fld selectin (called Mnte Carl variance). The best thing t d wuld be t run yu CV prcedure a number f times, say 100, and reprt the average errr and a standard errr fr the average.

17. Suppse yu are given the fllwing data. 250 label Y 0 A B 250 300 0 300 X We want t create a classificatin algrithm based n the data abve. Suggest a classifier that wuld wrk well n this prblem. Radial SVM and K-nearest neighbrs wuld wrk. Lgistic regressin and LDA dn t wrk as they have linear decisin bundaries. Belw is the decisin bundary fr a radial SVM. 400 200 0 200 400 400 0 200 400 Y SVM classificatin plt X A B 18. Suppse yu wuld like t apply a radial SVM t a classificatin task. Yu split yur data int training and validatin sets, select values f γ and C via crss-validatin, and find an estimated test errr f 0.21 using the validatin set. Then yu remember that rescaling yur variables is usually a gd idea when running SVM, but, after rescaling yur variables, yu run the same mdel and realize that, cntrary t yur epectatins, the estimated test errr has gne up t 0.38. What happened? After rescaling the variables, it is likely the ptimal values f γ and C have changed. Yu must use CV again t pick the prper values f γ and C, and then estimate the test errr.

19. A cllectr is intent n predicting the sale prices f paintings based n varius qualitative characteristics (including the identity f the artist, the style f the painting, and the cuntry f rigin). The cllectr trains a bsted decisin stump mdel n a dataset f past painting sales and characteristics and is surprised t find that n matter hw many trees he adds t the mdel, the training MSE is never smaller than the training MSE f a least squares linear regressin mdel fit with dummy variables encding the qualitative predictr values. Why shuld he nt be surprised? Since every variable is qualitative, the bsted decisin stump mdel is building a predictin rule that is linear in the dummy indicatrs crrespnding t each value that each predictr can take n. Since the linear regressin mdel has the minimum training set MSE ver this mdel class, its MSE can never be greater than that prduced by the stump mdel. 20. Eplain hw an unsupervised methd culd be useful even when yu are trying t make predictins in a supervised envirnment. There are several pssible answers. One is t use PCR t perfrm dimensin reductin befre applying linear regressin.

21. TRUE r FALSE: The sequence f clusterings prduced by running hierarchical clustering with centrid linkage is equivalent t the sequence f clusterings btained by running K-means clustering fr K = n, n 1,..., 2, 1. Justify yur answer. FALSE. The clusterings prduced in hierarchical clustering are nested (each clustering is frmed by merging tw clusters frm the previus clustering), but thse prduced by k-means need nt be. 22. Yu are cnsidering a binary classificatin prblem in which the decisin bundary separating yur classes is a cubic plynmial in yur p = 10, 000 input predictrs. Hwever, it is cmputatinally prhibitive fr yu t eplicitly cnstruct the 167 billin cubic interactin terms ij ik il assciated with each datapint. Suggest a way t find a classificatin rule that separates yur classes withut eplicitly frming cubic interactin terms. I wuld fit an SVM with a cubic plynmial kernel k(, y) = (1 +, y ) 3.

23. Suppse in a regressin setting yu use a bsted decisin tree with d = 1, als knwn as a bsted decisin stump, s that the utput f the mdel is additive in its features. Is this equivalent t linear regressin? Eplain. N. A linear mdel is additive and linear in the input features, while bsted decisin stumps are additive but nnlinear in the input features. 24. Principal cmpnents analysis is smetimes used as a frm f dimensin reductin in rder t imprve the results f linear regressin. Suggest a way t use clustering t achieve a similar result. We shuld first standardize each f the vectrs X 1,..., X p by dividing by their nrms. Then if p is large we can cluster the p predictrs in R n space using K-means clustering, and use the K means as the features instead. Nte that we have t specify K, but in PCR we had t specify the K largest principal cmpnents anyway.

25. After learning all the methds in STATS216v, a student writes an R prgram that takes a dataset and runs every single methd cvered in the curse, each with 100 pssible values fr the tuning parameter in the methd being cnsidered. He uses a validatin set t pick the best amng all the methds and tuning parameters pssible. Hwever, he is surprised t learn that the methd selected by the prgram perfrms wrse n the actual test data than many ther methds he tried. What went wrng? Overfitting. By trying every single methd with s many pssible parameters, it is likely the methd picked by the prgram perfrms better than all thers in the validatin set simply due t verfitting (t the validatin set). The methd might very well perfrm prly with the actual test set.