MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates X (assume all numeric covariates are numeric). She shares her results with Bob. Bob wants to replicate the results, and also uses ordinary least squares to fit a linear regression model, but does so after standardizing each column of data (the outcome as well as all covariates). When they compare the sum of squared residuals, they notice that they are wildly different. This catches Alice and Bob by surprise, because they were taught that standardizing doesn t change anything for linear regression. Why was the sum of squared residuals so different in their respective fitted models? (a) Because the intercept is not scaled. (b) Because the outcome is measured on a different scale. (c) Because they should have compared the square root of the sum of squared residuals, instead of just the sum of squared residuals. (d) One of them must have made a coding mistake, because the sum of squared residuals should have been the same. (b) When outcomes are not measured in the same units, we cannot compare the sum of squared residuals directly. PROBLEM 2. Suppose we have data with covariates X and outcome Y, and we build a linear regression model of Y against the covariates X. Let A be the resulting R 2 value. Now suppose we add new covariates to X. However, assume these covariates are just random noise (e.g., they might be i.i.d. N (0, 1) random variables), without any relationship to X or Y. We now build another linear regression model using all the original and new covariates, and compute the resulting R 2 value; let this be B. What can you say about how A and B are related to each other? (a) A B. (b) A = B. (c) A B. 1
(c) R 2 always increases when we add new covariates. PROBLEM 3. You are given data with covariates X and outcome Y, and fit three different models: one by ordinary least squares (OLS), one by ridge regression with λ > 0, and one by lasso with λ > 0. How does the sum of squared residuals compare across these methods? (a) The sum of squared residuals is smallest for OLS. (b) The sum of squared residuals is smallest for ridge regression. (c) The sum of squared residuals is smallest for lasso. (a) OLS directly minimizes the SSE while ridge and lasso also take into account the penalty term, so OLS must have lowest SSE. PROBLEM 4. Suppose we are given data with covariates X and outcome Y, and fit a linear regression model by ordinary least squares; let ˆβ be the resulting coefficient vector. We send the data to a friend, so that he can also analyze the data. By mistake, before running his analysis, our friend duplicates a few (but not all) rows of the data. He then computes the ordinary least squares solution, and finds a vector of coefficients β. Are ˆβ and β equal? (a) Yes, they are always equal. (b) They are equal, but only if the data was centered. (c) They are equal, but only if the data contained more rows of data than covariates. (d) In general they are not equal. (d) By duplicating rows, the coefficients will generally change. This can actually be seen as a form of weighted linear regression, the duplicated rows are considered more important, as they will appear multiple times in the sum of squared residuals that OLS minimizes. PROBLEM 5. Suppose you are given n data points (X i, Y i ), i = 1,..., n. You fit a simple linear regression model of Y on X. Suppose the resulting regression line is y = ˆβ 0 + ˆβ 1 x. 2
You also fit a simple linear regression model of X against Y. Suppose the resulting regression line is x = β 0 + β 1 y. Which of the following are true? (a) The intercepts are equal: ˆβ 0 = β 0. (b) The slopes are inverses of each other: ˆβ 1 = 1/ β 1. (c) Both (a) and (b). (d) Neither (a) nor (b). (d) We note that the coefficient ˆβ X from the regression of X on Y is r xy s y s x. From this, we see that (b) is false. For β 0 we have ˆβ 0 = ȳ ˆβ X x. This we cannot invert. PROBLEM 6. You have a dataset consisting of heights (measured in inches) and weights (measured in pounds) of n individuals. You fit a linear regression model of log(weight) on height by ordinary least squares, and find the following fitted model: log(weight) = -2.5 + 0.02 * height What is the meaning of the coefficient on height? (a) A 1% increase in height will cause a 2% increase in weight. (b) A 1% increase in height will cause a 0.02 pound increase in weight. (c) A 1 inch increase in height will cause a 0.02 pound increase in weight. (d) A 1 inch increase in height will cause a 2% increase in weight. (d) w = e 2.5+0.02(h+1) = e 2.5+0.02h e 0.02 1.02 e 2.5+0.02h. PROBLEM 7. Consider the kidiq dataset we have seen in class. The first few rows look like: kid_score mom_hs mom_iq 1 65 1 121.11753 2 98 1 89.36188 3 85 1 115.44316 4 83 1 99.44964 5 115 1 92.74571 6 98 0 107.90184... 3
We fit two different regression models. First, we fit the following model using all the data: kid_score 1 + mom_hs + mom_iq + mom_hs:mom_iq Let A be the coefficient on mom_iq in the resulting model. Next, we keep only those rows of the data where mom_hs is zero, and we fit the following model using only this data: kid_score 1 + mom_iq Let B be the coefficient on mom_iq in the resulting model. How do A and B compare to each other? (a) A > B. (b) B > A. (c) A = B. (c) In both models, the coefficient mom_iq measures the effect of mom_iq for moms for which mom_iq = 0, due to the interaction term in the full model. Hence, the coefficients are the same. PROBLEM 8. Suppose you are given n data points (X i, Y i ), i = 1,..., n. You fit a regression model of Y i on ˆβ 0 + ˆβ 1 X i by ordinary least squares. Let ˆβ be the resulting vector of coefficients. Define the respective sample means as follows: Y = 1 n Which of the following is true? (a) Y = ˆβ 0. (b) X = ˆβ 0. (c) Y = ˆβ 0 + ˆβ 1 X. (d) None of the above. n Y i ; X = 1 n i=1 n X i.. i=1 (c) The regression line always goes through the point of means: (X, Y ). In particular, we have that Y = ˆβ 0 + ˆβ 1 X. 4
PROBLEM 9. You are given a dataset that you split into a training set A and test set B. You train a linear regression model (call this Model 1 ) on the training set A, and then compute its mean squared prediction error E 1 on the test set B. After you do so, inspection of the results suggests that you might have been better off including an interaction term in the original regression; so you go back and train a new model (call this Model 2 ) on the training set A with this interaction term added, and test it again on your test set B. Let E 2 be the resulting mean squared prediction error. Tomorrow a colleague is going to give you a new test set C, coming from the same data generating process as your original data. Which of the following are true in general? (a) E 1 is unbiased as an estimate of the prediction error of Model 1 on test set C. (b) E 2 is unbiased as an estimate of the prediction error of Model 2 on test set C. (c) Both (a) and (b). (d) Neither (a) or (b). (a) E 1 is an unbiased estimate, as we used the test set B for the first time to test Model 1, and we obtained E 1. E 2 is not unbiased because Model 2 is fitted based on information derived from test set B. Therefore, in general, E 2 is an underestimate of the prediction error of Model 2 on the test set C. PROBLEM 10. Suppose you fit two linear regression models, Model 1 and Model 2, using the same data (and in particular the same outcome variable), but different subsets of the available covariates. Each model is fit using ordinary least squares. Model 1 has a lower C p score than Model 2, and a lower R 2 than Model 2. How does the number of covariates compare across the models? (a) Model 1 uses a smaller number of covariates than Model 2. (b) Model 1 uses a larger number of covariates than Model 2. (c) Model 1 uses the same number of covariates as Model 2. (a) Because the R 2 is lower, this means the SSE of Model 1 must be larger than that of Model 2. For Model 1 to have a lower C p score, it must then be the case that it has fewer covariates (as the number of data points n and the sample standard deviation of the residuals on the full fitted model, ˆσ 2, are common to both models). 5
PROBLEM 11. I generate training data as follows: For i = 1,..., 1000, X i are i.i.d. N (0, 1) random variables; and for i = 1,..., 1000, Y i = 1 + X i + X 2 i + X 3 i + X 4 i + ε i, where the ε i are i.i.d. N (0, 2) random variables. You take the training data X and Y, and produce a predictive model that always predicts the sample mean of the Y, i.e., for any new X, ˆf(X) = Y. Which of the following is true of this predictive model? (a) It has no bias. (b) It has low variance. (c) It has no variance. (d) None of the above. (b) It is clear that the model has low variance: the mean value Y is the average of 1000 points, so it won t change much from one training set to another. However, it does have some variance: a different training set leads to different predictions. PROBLEM 12. Last week, a friend of mine gave me a dataset with outcomes Y and design matrix X (with an intercept column). In addition, he gave me the coefficients ˆβ he claimed to have computed by ordinary least squares. (The columns of X were linearly independent.) However, after a quick check, I concluded that my friend had made a mistake in his calculation of ˆβ. Which one of the following could be the reason? (a) The R 2 of the fit was close to 1. (b) The data was not centered, but the intercept was zero. (c) The residuals did not add up to zero. (d) One of the coefficients ˆβ j was zero. (c) As the design matrix includes an intercept, we know that the OLS solution is such that the sum of residuals is zero (the vector of residuals is orthogonal to every column of X, in particular, to the vector of ones). PROBLEM 13. A dataset X, Y with n rows and p covariates is generated according to a linear model Y = Xβ + ε, where ε is i.i.d. N (0, σ 2 ). Following what you learned in MS&E 226, you fit the OLS solution and obtain ˆβ. In addition, it is your lucky day. Your favorite fortune-teller happens to be around, and she tells you the value of the true β. Now you are given a test set X, Ỹ with m rows. By using your models wisely, what s the mean squared prediction error you expect to obtain? 6
(a) σ 2 (1 + p/n). (b) σ 2 (1 + p/m). (c) σ 2. (d) Zero. (c) The wise thing to do is to use β for prediction. Still, we don t get perfect predictions, due to the noise in the population model. In this case, our noise ε has variance σ 2, and that s exactly the mean squared prediction error we expect, also known as irreducible error. 7