Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your work. Define your own notations. You do not need to simplify or evaluate unless indicated. Partial credit will be assigned based upon the correctness, completeness and clarity of your answers. Correct answers without proper justification will not receive full credit. The exam is closed book, closed notes. Calculators and other electronic devices are not allowed.

Problem 1.[6 points] Suppose that we use the K-nearest neighbors (KNN) method for a classification problem using different values of K. (a) [3 points] Provide a sketch of typical training error rate, test error rate, and Bayes error rate, on a single plot. The x-axis should represent 1/K, and the y-axis should represent the values for each curve. There should be three curves. Make sure to label each one. Answer. Error Rate 0.00 0.05 0.10 0.15 0.20 Training Errors Test Errors 0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K Figure 1: The black dashed line represents the Bayes error rate. (b) [2 points] As K decreases, does the level of flexibility increase or decrease? Justify your answer. Answer. As K decreases, the method becomes more flexible. For very low values of K, the method may find patterns in the data that do not correspond to the Bayes decision boundary, and thus overfits. For the extreme case K = 1, the training error is 0, but the test error rate may be quite high. (c) [1 point] Draw a vertical line on the previous plot and show the part of the graph where overfitting occurs. Answer. Overfitting occurs when training error is low and test error is large. In the above graph, this corresponds to values of K below 10.

Problem 2.[8 points] I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ε. (a) [2 points] Suppose that the true relationship between X and Y is linear, i.e. Y = β 0 +β 1 X +ε. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. Answer. The cubic regression model is more flexible than the linear regression model. Therefore, we would expect the cubic model to fit better the data, and thus to have lower training RSS. (b) [2 points] Answer (a) using test rather than training RSS. Answer. If the true relationship between X and Y is linear, a cubic regression model is excessively flexible, and we would expect the method to fit test data poorly. Therefore, we would expect the cubic model to have a higher test RSS. (c) [2 points] Suppose that the true relationship between X and Y is not linear, but we do not know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. Answer. Same answer as (a). (d) [2 points] Answer (c) using test rather than training RSS. Answer. In this case, we do not know the right amount of flexibility to fit the true underlying model. So there is not enough information to tell which model would give the lower test RSS.

Problem 3.[14 points] Data for 51 U.S. states (50 states, plus the District of Columbia) was used to examine the relationship between violent crime rate (violent crimes per 100,000 persons per year) and the predictor variables of urbanization (percentage of the population living in urban areas) and poverty rate. A predictor variable indicating whether or not a state is classified as a Southern state (1 = Southern, 0 = not) was also included. Finally, we include two interaction terms {Urban-South} and {Poverty-South}. Some output for the analysis of this data is shown below. ## C o e f f i c i e n t s : ## Estimate Std. Error t v a l u e Pr( > t ) ## ( I n t e r c e p t ) 321.90 148.20 2.17 0.035 ## Urban 4.689 1.654 2.83 0.007 ## Poverty 39.34 13.52 2.91 0.006 ## South 649.30 266.96 2.43 0.019 ## Urban : South 12.05 2.871 4.20 0.000 ## Poverty : South 5.838 16.671 0.35 0.728 ## ## ## Residual standard e r r o r : 140.01 on 45 d e g r e e s o f freedom ## F s t a t i s t i c : 21.02 on 5 and 45 DF, p v a l u e : <2e 16 (a) [2 points] Write the multiple linear regression model. Answer. The multiple linear regression model reads as follows: violent crime rate = β 0 +β 1 Urban+β 2 Poverty+β 3 South+β 4 (Urban South)+β 5 (Poverty South)+ε (1) (b) [2 points] Predict the violent crime rate for a Southern state with an urbanization of 55.4 and a poverty rate of 13.7. Answer. Given the least-squares coefficient estimates, we can make the following prediction for the violent crime rate: 321.9 + 4.689 55.4 + 39.34 13.7 649.30 1 + 12.05 (55.4 1) 5.838 (13.7 1) (c) [2 points] Give an approximate 95% confidence interval for the coefficient related to the poverty rate. Answer. An approximate 95% confidence interval for β 2 is ˆβ 2 ± 2 SE( ˆβ 2 ). Therefore, the interval [39.34 2 13.52, 39.34 + 2 13.52] contains β 2 with an approximate probability of 0.95. (d) [2 points] Is there a relationship between the predictors and the response? Answer. To answer this question, we perform the hypothesis testing: H 0 : all coefficients β j = 0 for j = 1,..., 5 against H a : at least one β j 0 for j = 1,..., 5. Under H 0, the p-value is very small. Therefore, we can reject the null hypothesis, and conclude that there a relationship between the predictors and the response. (e) [2 points] Which predictors appear to have a statistically significant relationship to the response? Answer. To answer this question, we perform five individual hypothesis testings: for j = 1,..., 5, H 0 : β j = 0 for j = 1,..., 5 against H a : β j 0. We can reject the null hypothesis for a given predictor, if the corresponding p-value is small enough. If we choose a p-value cutoff of 5%, then all predictors appear to have a statistically significant relationship to the response, except for the interaction term {Poverty-South}. (f) [2 points] What does the coefficient for the interaction term {Urban-South} suggest?

Answer. According to the least-squares fit, we obtain the following prediction for the violent crime rate 321.9 + 4.689 Urban + 39.34 Poverty 649.30 South + 12.05 (Urban South) 5.838 (Poverty South) = 321.9 + (4.689 + 12.05 South) Urban + 39.34 Poverty 649.30 South 5.838 (Poverty South) In other words, in Southern states, a 1% increase in the population living in urban areas will increase the violent crime rate by 4.689 + 12.05 on average. Hence the effect of urbanization is amplified in Southern states. (g) [2 points] To what extent do you think the R 2 -statistic and the residual standard error would change if we remove the interaction term {Poverty-South} from the model? Answer. According to (e), the interaction term {Poverty-South} is not significantly related to the response. Therefore, one would expect a tiny decrease of the R 2 -statistic, and either a slight increase or decrease of the residual standard error.

Problem 4.[10 points] A scientific foundation wanted to evaluate the relation between Y = salary of researcher (in thousands of dollars), and four predictors, X 1 = number of years of experience, X 2 = an index of publication quality, X 3 = sex (1 for Male and 0 for Female), and X 4 = an index of success in obtaining grant support. A sample of 35 randomly selected researchers was used to fit the multiple linear regression model. Parts of the computer output appear below. ## C o e f f i c i e n t s : ## Estimate Std. Error t v a l u e Pr( > t ) ## ( I n t e r c e p t ) 17.846931 2.001876 8.915 0.0001 ## Years 1.103130 0.359573 3.068 0.0032 ## Papers 0.321520 0.037109 8.664? ## Sex 1.593400 0.687724 2.317 0.0083 ## Grants 1.288941 0.298479 4.318 0.0003 ## ## ## Residual standard e r r o r : 1.75 on 30 d e g r e e s o f freedom ## M u l t i p l e R squared : 0.923 ## F s t a t i s t i c :? on 30 and 4 DF, p v a l u e :? (a) [2 points] Explain how the t-statistic for the number of years of experience was computed. Answer. The multiple linear regression model of interest is Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε. (2) The t-statistic for the number of years of experience is obtained as follows: ˆβ 1 /SE( ˆβ 1 ) = 1.103130/0.359573. (b) [2 points] The 97.5% quantile of a t-distribution with 30 degrees of freedom is 2.042. Do you expect the p-value associated with the index of publication quality to be greater than or less than 0.05? Answer. The p-value of interest is computed for the hypothesis test: H 0 : β 2 = 0 against H a : β 2 0. We know that under H 0, the t-statistic follows a t-distribution with 30 degrees of freedom. Since the computed t-value 8.664 is larger than 2.042, we reject the null hypothesis at the level of significance 5%. Therefore, we expect the p-value to be less than 0.05. (c) [2 points] How well does the regression model explain the variability in the salary of a researcher? Answer. This is answered by reading the R 2 -statistic. Here, 92.3% of the variability in the salary of a researcher is explained by performing the multiple linear regression (2), which corresponds to a very good fit. (d) [2 points] Recall that the formula for the F -statistic is F = TSS RSS RSS n p 1 p where n is the number of observations and p is the number of predictors. And the R 2 statistic is given by R 2 = 1 RSS TSS. Using basic algebra, prove the following relationship between the F -statistic and the R 2 statistic: F = R2 1 R 2 n p 1. p,

Answer. It is sufficient to prove that (TSS RSS)/RSS = R 2 /(1 R 2 ). We start from the left-hand side. TSS RSS RSS = TSS RSS 1 = 1 1 R 2 1 = 1 (1 R2 ) 1 R 2 = R2 1 R 2. (e) [2 points] From (d), do you expect the value of the F -statistic for our data to be very large or very small? Answer. If R 2 is large, then 1 R 2 is small, so R 2 /(1 R 2 ) is large, and the F -statistic as well, using (d). For our data, R 2 = 0.923. Therefore we expect the value of the F -statistic to be very large.

Problem 5.[8 points] A data set consists of percentage returns for the S&P 500 stock index over 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. For each date, we have recorded the year that the observation was recorded, the percentage returns for each of the five previous trading weeks, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous week, in billions) and Direction (whether the market was Up or Down on a given week). In this problem, a prediction is based on whether the predicted probability of a market increase is greater than or less than 0.5. (a) [2 points] We perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. We obtain the following output using a statistical software. ## C o e f f i c i e n t s : ## Estimate Std. Error z v a l u e Pr( > z ) ## ( I n t e r c e p t ) 0.2669 0.0859 3.11 0.0019 ## Lag1 0.0413 0.0264 1.56 0.1181 ## Lag2 0.0584 0.0269 2.18 0.0296 ## Lag3 0.0161 0.0267 0.60 0.5469 ## Lag4 0.0278 0.0265 1.05 0.2937 ## Lag5 0.0145 0.0264 0.55 0.5833 ## Volume 0.0227 0.0369 0.62 0.5377 Estimate the probability that the market goes up next week with Lag1 = 1.26%, Lag2 = -1.96%, Lag3 = 0.97%, Lag4 = 0.72%, Lag5 = 0.09% and a volume of 1.46 billion shares traded on the previous week. Answer. The logistic regression model reads as follows p(x) = eβ0+β1 Lag1+β2 Lag2+β3 Lag3+β4 Lag4+β5 Lag5+β6 Volume 1 + e β0+β1 Lag1+β2 Lag2+β3 Lag3+β4 Lag4+β5 Lag5+β6 Volume (3) where p(x) is the probability that the market goes up given the five lag variables and the volume of traded shares. Using the coefficient estimates given above, we obtain the following predicted probability p(x) = e0.2669 0.0413 1.26+0.0584 ( 1.96) 0.0161 0.97 0.0278 0.72 0.0145 0.09 0.0227 1.46 1 + e 0.2669 0.0413 1.26+0.0584 ( 1.96) 0.0161 0.97 0.0278 0.72 0.0145 0.09 0.0227 1.46 (b) [2 points] Do any of the predictors appear to be statistically significant? If so, which ones? Answer. On the basis of the p-values, and for a cutoff of 5%, Lag2 appears to be the only predictor to be statistically significant. (c) [2 points] Now we fit four classifiers using a training data period from 1990 to 2008, with Lag2 as the only predictor: logistic regression model, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and K-nearest neighbors (KNN) with K = 1. We obtain the following confusion matrices for the held-out data (that is, the data from 2009 and 2010) for the four classifiers. True Direction Down Up Predicted Down 9 5 Direction Up 34 56 Table 1: Confusion matrix using logistic regression. True Direction Down Up Predicted Down 9 5 Direction Up 34 56 Table 2: Confusion matrix using LDA. Compute the overall fraction of correct predictions for the held-out data for the four classifiers. Answer. The overall fraction of correct predictions for the held-out data for the four classifiers is:

True Direction Down Up Predicted Down 0 0 Direction Up 43 61 Table 3: Confusion matrix using QDA. True Direction Down Up Predicted Down 21 30 Direction Up 22 31 Table 4: Confusion matrix using KNN with K = 1. (9+56)/(9+5+34+56)= 65/104 for logistic regression, (9+56)/(9+5+34+56)= 65/104 for LDA, (0+61)/(9+5+34+56)= 61/104 for QDA, (21+31)/(9+5+34+56)= 1/2 for KNN with K = 1. (d) [2 points] Which of these methods appears to provide the best results on this data? Answer. On the basis of the test error rates, logistic regression and LDA seem to be tied for the best method. However, one could challenge that statement by considering other values of K for the KNN method.

Problem 6.[10 points] Suppose that n observations x 1, x 2,..., x n are drawn from a Poisson distribution with unknown parameter λ. Recall that the probability mass function of a Poisson distribution with parameter λ is p(x; λ) = e λ λ x /x!, where x is a nonnegative integer. (a) [2 points] Compute L(λ) the likelihood function of x 1,..., x n. Then give ln L(λ) the log-likelihood of x 1,..., x n. Answer. The likelihood function of x 1,..., x n is L(λ) = n p(x i ; λ) = n λxi λ e Therefore, the log-likelihood of x 1,..., x n is ( n ) ln L(λ) = nλ + x i ln λ n xi x i! = λ e nλ n x i! n ln(x i!) (b) [2 points] Determine ˆλ the maximum likelihood estimator of λ. You should find ˆλ = 1 n not need to check the sign of the second derivative. Answer. The derivative of ln L(λ) is n d ln L(λ) = n + x i. dλ λ Setting the derivative to zero gives the estimate ˆλ = 1 n n x i. n X i. You do (c) [1 point] Application: Researchers want to investigate whether reading may prevent Alzheimer s disease. To do so, they examined 2,000 individuals with Alzheimer s and 8,000 without Alzheimer s. Give π 1 the prior probability that a person has Alzheimer s. Answer. The prior probability that a person has Alzheimer s is π 1 = 2, 000/(2, 000 + 8, 000) = 0.2. (d) [2 points] Researchers discovered that people without Alzheimer s read 0.85 book per month on average, while people with Alzheimer s read 0.33 book per month on average. Assuming that the number of books that an individual reads per year follows a Poisson distribution, predict the probability that a person has Alzheimer s if he or she reads 7 books per year. Hint: Use Bayes theorem. Answer. Let A be the event that a person has Alzheimer s, A c be the event that a person does not have Alzheimer s, and B be the number of books that this person reads per year. Since Using the Bayes theorem, we obtain P(A)P(B = 7 A) P(A B = 7) = P(A)P(B = 7 A) + P(A c )P(B = 7 A c ) π 1 p(7; 0.33 12) = π 1 p(7; 0.33 12) + (1 π 1 ) p(7; 0.85 12) = 0.2 e 0.33 12 (0.33 12) 7 /7! 0.2 e 0.33 12 (0.33 12) 7 /7! + 0.8 e 0.85 12 (0.85 12) 7 /7!

(e) [2 points] We consider the general case of classification with only one predictor X. Suppose that we have K classes, and that if an observation belongs to the k-th class then X comes from a Poisson distribution with parameter λ k. Prove that in this case, the Bayes classifier assigns an observation X = x to the class for which δ k (x) = ln π k λ k + x ln λ k is largest. Answer. Recall that the Bayes classifier assigns an observation X = x to the class for which p k (x) = P(Y = k X = x) is largest. We have that π k p(x; λ k ) p k (x) = K l=1 π l p(x; λ l ) = π k e λ k λ x! K l=1 π l e λ l where π k is the prior probability for an observation X = x to belong to the k-th class. As the denominator is the same across all classes, and since the function ln is non-decreasing, choosing the class that maximizes p k (x) is the same as choosing the class that maximizes ln ( π k e λ k λx k x! x k λx l x! ) = ln π k λ k + x ln λ k ln x! From the above equation, we obtain the result since the term ln x! does not depend on the class. (f) [1 point] From (e), what is the shape of the Bayes decision boundaries? Answer. The Bayes decision boundaries correspond to the values of X = x for which p k (x) = p l (x), or equivalently δ k (x) = δ l (x), for any pair of classes k and l. Since the functions δ k are linear functions of x, so are the Bayes decision boundaries.,