Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Similar documents
Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

CSE 546 Final Exam, Autumn 2013

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

STAT FINAL EXAM

Stat 5102 Final Exam May 14, 2015

ECON 497 Midterm Spring

Introduction to Machine Learning Midterm, Tues April 8

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Lecture 9: Classification, LDA

FINAL: CS 6375 (Machine Learning) Fall 2014

Simple Linear Regression: One Qualitative IV

Exam Applied Statistical Regression. Good Luck!

Homework 2: Simple Linear Regression

10-701/ Machine Learning - Midterm Exam, Fall 2010

Unit 6 - Introduction to linear regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

Midterm exam CS 189/289, Fall 2015

Section 3: Simple Linear Regression

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

28. SIMPLE LINEAR REGRESSION III

Unit 6 - Simple linear regression

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

4. Nonlinear regression functions

ISQS 5349 Final Exam, Spring 2017.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

Classification: Linear Discriminant Analysis

Bayesian Classification Methods

Midterm Exam, Spring 2005

Statistics 135 Fall 2008 Final Exam

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Classification 2: Linear discriminant analysis (continued); logistic regression

Inferences for Regression

AMS 7 Correlation and Regression Lecture 8

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Machine Learning Practice Page 2 of 2 10/28/13

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

ST430 Exam 1 with Answers

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning Linear Regression. Prof. Matteo Matteucci

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

DISCRIMINANT ANALYSIS: LDA AND QDA

Multiple Linear Regression

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Assignment 2: K-Nearest Neighbors and Logistic Regression

Swarthmore Honors Exam 2012: Statistics

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Introduction to Data Science

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

LDA, QDA, Naive Bayes

2011 Pearson Education, Inc

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Linear Regression Model. Badr Missaoui

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

Stat 500 Midterm 2 8 November 2007 page 0 of 4

Final Exam. Name: Solution:

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

SCHOOL OF MATHEMATICS AND STATISTICS

Machine Learning (CS 567) Lecture 5

Machine Learning, Fall 2009: Midterm

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Machine Learning for Signal Processing Bayes Classification

Simple Linear Regression

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

Inference with Simple Regression

Lecture #11: Classification & Logistic Regression

Confidence Intervals for Comparing Means

Correlation & Simple Regression

Solutions to First Midterm Exam, Stat 371, Spring those values: = Or, we can use Rule 6: = 0.63.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

STA 6167 Exam 3 Spring 2016 PRINT Name

ST 305: Final Exam ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) ( ) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ Y. σ X. σ n.

Masters Comprehensive Examination Department of Statistics, University of Florida

Lecture 18: Simple Linear Regression

Sociology 593 Exam 2 Answer Key March 28, 2002

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Thursday, August 30, 2018

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

ISQS 5349 Spring 2013 Final Exam

Linear Regression With Special Variables

Institute of Actuaries of India

9. Linear Regression and Correlation

Transcription:

STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your work. Define your own notations. You do not need to simplify or evaluate unless indicated. Partial credit will be assigned based upon the correctness, completeness and clarity of your answers. Correct answers without proper justification will not receive full credit. The exam is closed book, closed notes. Calculators and other electronic devices are not allowed.

Problem 1.[6 points] Suppose that we use the K-nearest neighbors (KNN) method for a classification problem using different values of K. (a) [3 points] Provide a sketch of typical training error rate, test error rate, and Bayes error rate, on a single plot. The x-axis should represent 1/K, and the y-axis should represent the values for each curve. There should be three curves. Make sure to label each one. Answer. Error Rate 0.00 0.05 0.10 0.15 0.20 Training Errors Test Errors 0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K Figure 1: The black dashed line represents the Bayes error rate. (b) [2 points] As K decreases, does the level of flexibility increase or decrease? Justify your answer. Answer. As K decreases, the method becomes more flexible. For very low values of K, the method may find patterns in the data that do not correspond to the Bayes decision boundary, and thus overfits. For the extreme case K = 1, the training error is 0, but the test error rate may be quite high. (c) [1 point] Draw a vertical line on the previous plot and show the part of the graph where overfitting occurs. Answer. Overfitting occurs when training error is low and test error is large. In the above graph, this corresponds to values of K below 10.

Problem 2.[8 points] I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ε. (a) [2 points] Suppose that the true relationship between X and Y is linear, i.e. Y = β 0 +β 1 X +ε. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. Answer. The cubic regression model is more flexible than the linear regression model. Therefore, we would expect the cubic model to fit better the data, and thus to have lower training RSS. (b) [2 points] Answer (a) using test rather than training RSS. Answer. If the true relationship between X and Y is linear, a cubic regression model is excessively flexible, and we would expect the method to fit test data poorly. Therefore, we would expect the cubic model to have a higher test RSS. (c) [2 points] Suppose that the true relationship between X and Y is not linear, but we do not know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. Answer. Same answer as (a). (d) [2 points] Answer (c) using test rather than training RSS. Answer. In this case, we do not know the right amount of flexibility to fit the true underlying model. So there is not enough information to tell which model would give the lower test RSS.

Problem 3.[14 points] Data for 51 U.S. states (50 states, plus the District of Columbia) was used to examine the relationship between violent crime rate (violent crimes per 100,000 persons per year) and the predictor variables of urbanization (percentage of the population living in urban areas) and poverty rate. A predictor variable indicating whether or not a state is classified as a Southern state (1 = Southern, 0 = not) was also included. Finally, we include two interaction terms {Urban-South} and {Poverty-South}. Some output for the analysis of this data is shown below. ## C o e f f i c i e n t s : ## Estimate Std. Error t v a l u e Pr( > t ) ## ( I n t e r c e p t ) 321.90 148.20 2.17 0.035 ## Urban 4.689 1.654 2.83 0.007 ## Poverty 39.34 13.52 2.91 0.006 ## South 649.30 266.96 2.43 0.019 ## Urban : South 12.05 2.871 4.20 0.000 ## Poverty : South 5.838 16.671 0.35 0.728 ## ## ## Residual standard e r r o r : 140.01 on 45 d e g r e e s o f freedom ## F s t a t i s t i c : 21.02 on 5 and 45 DF, p v a l u e : <2e 16 (a) [2 points] Write the multiple linear regression model. Answer. The multiple linear regression model reads as follows: violent crime rate = β 0 +β 1 Urban+β 2 Poverty+β 3 South+β 4 (Urban South)+β 5 (Poverty South)+ε (1) (b) [2 points] Predict the violent crime rate for a Southern state with an urbanization of 55.4 and a poverty rate of 13.7. Answer. Given the least-squares coefficient estimates, we can make the following prediction for the violent crime rate: 321.9 + 4.689 55.4 + 39.34 13.7 649.30 1 + 12.05 (55.4 1) 5.838 (13.7 1) (c) [2 points] Give an approximate 95% confidence interval for the coefficient related to the poverty rate. Answer. An approximate 95% confidence interval for β 2 is ˆβ 2 ± 2 SE( ˆβ 2 ). Therefore, the interval [39.34 2 13.52, 39.34 + 2 13.52] contains β 2 with an approximate probability of 0.95. (d) [2 points] Is there a relationship between the predictors and the response? Answer. To answer this question, we perform the hypothesis testing: H 0 : all coefficients β j = 0 for j = 1,..., 5 against H a : at least one β j 0 for j = 1,..., 5. Under H 0, the p-value is very small. Therefore, we can reject the null hypothesis, and conclude that there a relationship between the predictors and the response. (e) [2 points] Which predictors appear to have a statistically significant relationship to the response? Answer. To answer this question, we perform five individual hypothesis testings: for j = 1,..., 5, H 0 : β j = 0 for j = 1,..., 5 against H a : β j 0. We can reject the null hypothesis for a given predictor, if the corresponding p-value is small enough. If we choose a p-value cutoff of 5%, then all predictors appear to have a statistically significant relationship to the response, except for the interaction term {Poverty-South}. (f) [2 points] What does the coefficient for the interaction term {Urban-South} suggest?

Answer. According to the least-squares fit, we obtain the following prediction for the violent crime rate 321.9 + 4.689 Urban + 39.34 Poverty 649.30 South + 12.05 (Urban South) 5.838 (Poverty South) = 321.9 + (4.689 + 12.05 South) Urban + 39.34 Poverty 649.30 South 5.838 (Poverty South) In other words, in Southern states, a 1% increase in the population living in urban areas will increase the violent crime rate by 4.689 + 12.05 on average. Hence the effect of urbanization is amplified in Southern states. (g) [2 points] To what extent do you think the R 2 -statistic and the residual standard error would change if we remove the interaction term {Poverty-South} from the model? Answer. According to (e), the interaction term {Poverty-South} is not significantly related to the response. Therefore, one would expect a tiny decrease of the R 2 -statistic, and either a slight increase or decrease of the residual standard error.

Problem 4.[10 points] A scientific foundation wanted to evaluate the relation between Y = salary of researcher (in thousands of dollars), and four predictors, X 1 = number of years of experience, X 2 = an index of publication quality, X 3 = sex (1 for Male and 0 for Female), and X 4 = an index of success in obtaining grant support. A sample of 35 randomly selected researchers was used to fit the multiple linear regression model. Parts of the computer output appear below. ## C o e f f i c i e n t s : ## Estimate Std. Error t v a l u e Pr( > t ) ## ( I n t e r c e p t ) 17.846931 2.001876 8.915 0.0001 ## Years 1.103130 0.359573 3.068 0.0032 ## Papers 0.321520 0.037109 8.664? ## Sex 1.593400 0.687724 2.317 0.0083 ## Grants 1.288941 0.298479 4.318 0.0003 ## ## ## Residual standard e r r o r : 1.75 on 30 d e g r e e s o f freedom ## M u l t i p l e R squared : 0.923 ## F s t a t i s t i c :? on 30 and 4 DF, p v a l u e :? (a) [2 points] Explain how the t-statistic for the number of years of experience was computed. Answer. The multiple linear regression model of interest is Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε. (2) The t-statistic for the number of years of experience is obtained as follows: ˆβ 1 /SE( ˆβ 1 ) = 1.103130/0.359573. (b) [2 points] The 97.5% quantile of a t-distribution with 30 degrees of freedom is 2.042. Do you expect the p-value associated with the index of publication quality to be greater than or less than 0.05? Answer. The p-value of interest is computed for the hypothesis test: H 0 : β 2 = 0 against H a : β 2 0. We know that under H 0, the t-statistic follows a t-distribution with 30 degrees of freedom. Since the computed t-value 8.664 is larger than 2.042, we reject the null hypothesis at the level of significance 5%. Therefore, we expect the p-value to be less than 0.05. (c) [2 points] How well does the regression model explain the variability in the salary of a researcher? Answer. This is answered by reading the R 2 -statistic. Here, 92.3% of the variability in the salary of a researcher is explained by performing the multiple linear regression (2), which corresponds to a very good fit. (d) [2 points] Recall that the formula for the F -statistic is F = TSS RSS RSS n p 1 p where n is the number of observations and p is the number of predictors. And the R 2 statistic is given by R 2 = 1 RSS TSS. Using basic algebra, prove the following relationship between the F -statistic and the R 2 statistic: F = R2 1 R 2 n p 1. p,

Answer. It is sufficient to prove that (TSS RSS)/RSS = R 2 /(1 R 2 ). We start from the left-hand side. TSS RSS RSS = TSS RSS 1 = 1 1 R 2 1 = 1 (1 R2 ) 1 R 2 = R2 1 R 2. (e) [2 points] From (d), do you expect the value of the F -statistic for our data to be very large or very small? Answer. If R 2 is large, then 1 R 2 is small, so R 2 /(1 R 2 ) is large, and the F -statistic as well, using (d). For our data, R 2 = 0.923. Therefore we expect the value of the F -statistic to be very large.

Problem 5.[8 points] A data set consists of percentage returns for the S&P 500 stock index over 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. For each date, we have recorded the year that the observation was recorded, the percentage returns for each of the five previous trading weeks, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous week, in billions) and Direction (whether the market was Up or Down on a given week). In this problem, a prediction is based on whether the predicted probability of a market increase is greater than or less than 0.5. (a) [2 points] We perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. We obtain the following output using a statistical software. ## C o e f f i c i e n t s : ## Estimate Std. Error z v a l u e Pr( > z ) ## ( I n t e r c e p t ) 0.2669 0.0859 3.11 0.0019 ## Lag1 0.0413 0.0264 1.56 0.1181 ## Lag2 0.0584 0.0269 2.18 0.0296 ## Lag3 0.0161 0.0267 0.60 0.5469 ## Lag4 0.0278 0.0265 1.05 0.2937 ## Lag5 0.0145 0.0264 0.55 0.5833 ## Volume 0.0227 0.0369 0.62 0.5377 Estimate the probability that the market goes up next week with Lag1 = 1.26%, Lag2 = -1.96%, Lag3 = 0.97%, Lag4 = 0.72%, Lag5 = 0.09% and a volume of 1.46 billion shares traded on the previous week. Answer. The logistic regression model reads as follows p(x) = eβ0+β1 Lag1+β2 Lag2+β3 Lag3+β4 Lag4+β5 Lag5+β6 Volume 1 + e β0+β1 Lag1+β2 Lag2+β3 Lag3+β4 Lag4+β5 Lag5+β6 Volume (3) where p(x) is the probability that the market goes up given the five lag variables and the volume of traded shares. Using the coefficient estimates given above, we obtain the following predicted probability p(x) = e0.2669 0.0413 1.26+0.0584 ( 1.96) 0.0161 0.97 0.0278 0.72 0.0145 0.09 0.0227 1.46 1 + e 0.2669 0.0413 1.26+0.0584 ( 1.96) 0.0161 0.97 0.0278 0.72 0.0145 0.09 0.0227 1.46 (b) [2 points] Do any of the predictors appear to be statistically significant? If so, which ones? Answer. On the basis of the p-values, and for a cutoff of 5%, Lag2 appears to be the only predictor to be statistically significant. (c) [2 points] Now we fit four classifiers using a training data period from 1990 to 2008, with Lag2 as the only predictor: logistic regression model, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and K-nearest neighbors (KNN) with K = 1. We obtain the following confusion matrices for the held-out data (that is, the data from 2009 and 2010) for the four classifiers. True Direction Down Up Predicted Down 9 5 Direction Up 34 56 Table 1: Confusion matrix using logistic regression. True Direction Down Up Predicted Down 9 5 Direction Up 34 56 Table 2: Confusion matrix using LDA. Compute the overall fraction of correct predictions for the held-out data for the four classifiers. Answer. The overall fraction of correct predictions for the held-out data for the four classifiers is:

True Direction Down Up Predicted Down 0 0 Direction Up 43 61 Table 3: Confusion matrix using QDA. True Direction Down Up Predicted Down 21 30 Direction Up 22 31 Table 4: Confusion matrix using KNN with K = 1. (9+56)/(9+5+34+56)= 65/104 for logistic regression, (9+56)/(9+5+34+56)= 65/104 for LDA, (0+61)/(9+5+34+56)= 61/104 for QDA, (21+31)/(9+5+34+56)= 1/2 for KNN with K = 1. (d) [2 points] Which of these methods appears to provide the best results on this data? Answer. On the basis of the test error rates, logistic regression and LDA seem to be tied for the best method. However, one could challenge that statement by considering other values of K for the KNN method.

Problem 6.[10 points] Suppose that n observations x 1, x 2,..., x n are drawn from a Poisson distribution with unknown parameter λ. Recall that the probability mass function of a Poisson distribution with parameter λ is p(x; λ) = e λ λ x /x!, where x is a nonnegative integer. (a) [2 points] Compute L(λ) the likelihood function of x 1,..., x n. Then give ln L(λ) the log-likelihood of x 1,..., x n. Answer. The likelihood function of x 1,..., x n is L(λ) = n p(x i ; λ) = n λxi λ e Therefore, the log-likelihood of x 1,..., x n is ( n ) ln L(λ) = nλ + x i ln λ n xi x i! = λ e nλ n x i! n ln(x i!) (b) [2 points] Determine ˆλ the maximum likelihood estimator of λ. You should find ˆλ = 1 n not need to check the sign of the second derivative. Answer. The derivative of ln L(λ) is n d ln L(λ) = n + x i. dλ λ Setting the derivative to zero gives the estimate ˆλ = 1 n n x i. n X i. You do (c) [1 point] Application: Researchers want to investigate whether reading may prevent Alzheimer s disease. To do so, they examined 2,000 individuals with Alzheimer s and 8,000 without Alzheimer s. Give π 1 the prior probability that a person has Alzheimer s. Answer. The prior probability that a person has Alzheimer s is π 1 = 2, 000/(2, 000 + 8, 000) = 0.2. (d) [2 points] Researchers discovered that people without Alzheimer s read 0.85 book per month on average, while people with Alzheimer s read 0.33 book per month on average. Assuming that the number of books that an individual reads per year follows a Poisson distribution, predict the probability that a person has Alzheimer s if he or she reads 7 books per year. Hint: Use Bayes theorem. Answer. Let A be the event that a person has Alzheimer s, A c be the event that a person does not have Alzheimer s, and B be the number of books that this person reads per year. Since Using the Bayes theorem, we obtain P(A)P(B = 7 A) P(A B = 7) = P(A)P(B = 7 A) + P(A c )P(B = 7 A c ) π 1 p(7; 0.33 12) = π 1 p(7; 0.33 12) + (1 π 1 ) p(7; 0.85 12) = 0.2 e 0.33 12 (0.33 12) 7 /7! 0.2 e 0.33 12 (0.33 12) 7 /7! + 0.8 e 0.85 12 (0.85 12) 7 /7!

(e) [2 points] We consider the general case of classification with only one predictor X. Suppose that we have K classes, and that if an observation belongs to the k-th class then X comes from a Poisson distribution with parameter λ k. Prove that in this case, the Bayes classifier assigns an observation X = x to the class for which δ k (x) = ln π k λ k + x ln λ k is largest. Answer. Recall that the Bayes classifier assigns an observation X = x to the class for which p k (x) = P(Y = k X = x) is largest. We have that π k p(x; λ k ) p k (x) = K l=1 π l p(x; λ l ) = π k e λ k λ x! K l=1 π l e λ l where π k is the prior probability for an observation X = x to belong to the k-th class. As the denominator is the same across all classes, and since the function ln is non-decreasing, choosing the class that maximizes p k (x) is the same as choosing the class that maximizes ln ( π k e λ k λx k x! x k λx l x! ) = ln π k λ k + x ln λ k ln x! From the above equation, we obtain the result since the term ln x! does not depend on the class. (f) [1 point] From (e), what is the shape of the Bayes decision boundaries? Answer. The Bayes decision boundaries correspond to the values of X = x for which p k (x) = p l (x), or equivalently δ k (x) = δ l (x), for any pair of classes k and l. Since the functions δ k are linear functions of x, so are the Bayes decision boundaries.,