CAS MA575 Linear Models Boston University, Fall 2013 Midterm Exam (Correction) Instructor: Cedric Ginestet Date: 22 Oct 2013. Maximal Score: 200pts. Please Note: You will only be graded on work and answers found in your Blue Book(s). 1. Simple Linear Regression [60pts] In this section, we are considering a standard linear regression model with intercept on pairs of data points (y i, x i ), such that the mean and variance functions are for every i = 1,..., n. E[Y i X = x i ] = β 0 + β 1 x i, and Var[Y i X = x i ] = σ 2, 1. [10pts] In the following list, identify the quantities that are treated as not random: Y i, x i, β 1, β1, e i, ê i, ŷ i, n. Here, x i, β 1 and n are the sole quantities, which are unambiguously non-random. Three points should be deducted if either of them are not included. Moreover, Y i is random, so three points should also be deducted if this is included. However, no points should be deducted for including β 1, e i, ê i and ŷ i, which are in lower-cases, and therefore possibly non-random. 2. [10pts] What is the relationship between the OLS estimator, β 1 := SXY SXX, and the estimated correlation coefficient r xy between the y i s and the x i s? It suffices to show that: β 1 = n (x i x)(y i ȳ) n (x = Ĉov[X, Y ] σ y = r i x) 2 xy, Var[X] σ x 1
using the definition of the correlation coefficient, r xy := σ2 xy Ĉov[X, Y ] =. σ x σ y Var[X] Var[Y ] 3. [20pts] How is the F -test testing the null hypothesis, H 0 : E[Y X = x] = β 0, versus the alternative hypothesis, H 1 : E[Y X = x] = β 0 + β 1 x, related to the R 2 for this model? The F -statistic and R 2 have identical numerators, but different denominators, F = SSreg RSS( β 0, β 1 )/(n 2) = SSreg σ 2, and R 2 = SSreg RSS( β 0 ) = SSreg SYY. 4. [20pts] Show that the F -statistic for testing the null hypothesis H 0 : E[Y X = x] = β 0 is equal to the square of the t-statistic for testing H 0 : β 1 = 0, using the fact, Taking the square of the t-statistic for β 1, we have t 2 1 = ( β1 SSreg = RSS 1 ( β 0 ) RSS 2 ( β 0, β 1 ) = SXY2 SXX. se( β 1 ) ) 2 = 2. Multiple Linear Regression [40pts] β 2 1 σ 2 / SXX = SXY2 / SXX 2 σ 2 / SXX = SXY2 σ 2 SXX = F. Here, we are considering a multiple regression model with intercept on pairs of data points (y i, x i ), where x i := [x i0,..., x ip ] T, such that the mean and variance functions are respectively E[y X] = Xβ, and Var[y X] = σ 2 I n. Throughout this section, the design matrix X of order (n p ) is assumed to be full-rank, with p := p + 1. 1. [20pts] Compute the variance of the random vector of OLS estimators, β := (X T X) 1 X T y. The variance of this vector of estimators can be derived using the now familiar formula for the covariance matrix of a random vector, Var[Ay] = A Var[y]A T, and recalling that (X T X) 1 is self-transpose; such that we obtain Var[ β X] = Var[(X T X) 1 X T y X] = (X T X) 1 X T Var[y X] ( (X T X) 1 X T ) T = (X T X) 1 X T σ 2 I n X(X T X) 1 = σ 2 [(X T X) 1 (X T X)(X T X) 1 = σ 2 (X T X) 1. Department of Mathematics and Statistics, Boston University 2
2. [20pts] Show that the vector of residuals ê and the vector of fitted values ŷ are orthogonal to each other, in the sense that ê T ŷ = 0. Using the hat matrix, we have ê := (I H)y, and ŷ := Hy, and therefore 3. Maximum Likelihood [40pts] ê T ŷ = [(I H)y] T Hy = y T (I H) T Hy = y T (I H)Hy = y T (H HH)y = y T (H H)y = 0. In this section, the multiple linear regression model is identical to the one in the previous section. In addition, we are also assuming that ind y i N(x T i β, σ 2 ), i = 1,..., n. This gives the following likelihood function, L(β, σ 2 ; y, X) := n { 1 exp 1 } 2πσ 2 2σ 2 (y i x T i β) 2. (1) 1. [20pts] Show that the OLS and MLE estimators for the vector β are identical, such that β MLE := argmax L(β, σ 2 ; y, X) = argmin RSS(β) =: β OLS. β R p β R p First, we take the log of the likelihood function, log L(β, σ 2 ; y, X) = n ( { 1 log exp 1 }) 2πσ 2 2σ 2 (y i x T i β) 2 = n 2 log(2πσ2 ) 1 2σ 2 n (y i x T i β) 2. We can then omit the first term in the log-likelihood since this does not depend on β. This gives us an expression, which is closely related to the residual sum of squares. log L(β, σ 2 ; y, X) = n 2 log(2πσ2 ) 1 2σ 2 (y Xβ)T (y Xβ). Clearly, the second term is proportional to the RSS for β, and one can maximize this quantity after ignoring the first term. Department of Mathematics and Statistics, Boston University 3
2. [20pts] Given that you already know β MLE, maximize the likelihood function in equation (1), with respect to σ 2. We have seen that we can exploit the orthogonality of β and σ 2 in a Normal model, in order to maximize the likelihood by selecting these two sets of parameters independently of each other. Thus, once we have chosen β MLE, it suffices to select which gives Then, straightforwardly we have, σ MLE 2 := argmax log L( β MLE, σ 2 ; y, X), σ 2 R + ( n σ 2 2 log(2π) + n 2 log(σ2 ) + 1 2σ 2 ) n (y i x T β i MLE ) 2 = 0. n σ 2 2 log(σ2 ) = 1 σ 2 2σ 2 RSS( β MLE ) n 1 2 σ 2 = 1 2σ 4 RSS( β MLE ) n 1 σ 2 = 1 σ 4 RSS( β MLE ) σ 2 = 1 n RSS( β MLE ). 4. Data Analysis [60pts] You have been asked to re-analyze a data set originally published by Ericksen, Kadane and Tukey, in 1989. These authors were interested in the 1980 Census of Population and Housing. The data set represents 66 geographical areas in the United-States. In each of these areas, three variables have been collected during the census: Crime: Rate of serious crimes per 1000 inhabitants in that area. Poverty: Percentage of inhabitants living below the poverty line. Language: Percentage of inhabitants having difficulty speaking or writing English. The purpose of this particular study is to predict crime on the basis of the two other variables. A matrix scatterplot showing the marginal distribution of these variables, and their pairwise correlations has been provided in figure 1. Moreover, the correlation matrix between these variables is also given below: crime poverty language crime 1.0000000 0.3691061 0.5116460 poverty 0.3691061 1.0000000 0.1515658 language 0.5116460 0.1515658 1.0000000 1. [30pts] We fit the following multiple regression model in R: and obtain this summary output: Crime Poverty, (2) Department of Mathematics and Statistics, Boston University 4
crime 10 15 20 40 60 80 120 10 15 20 poverty 40 60 80 100 120 140 0 2 4 6 8 10 12 0 2 4 6 8 10 12 language Figure 1. Scatterplot matrix for the three variables in the Census of Population and Housing. Call: lm(formula = crime ~ poverty, data = data) Residuals: Min 1Q Median 3Q Max -50.449-13.583-3.182 16.691 62.857 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 35.4471 9.1527 3.873 0.000255 *** poverty 2.0503 0.6453 3.177 0.002290 ** Residual standard error: 23.31 on 64 degrees of freedom Multiple R-squared: 0.1362, Adjusted R-squared: 0.1227 F-statistic: 10.09 on 1 and 64 DF, p-value: 0.00229 Next, we fit the following multiple regression model in R: Crime Poverty + Language, (3) and produce a new summary output for this model. Can you anticipate how (a) the estimate for β, Department of Mathematics and Statistics, Boston University 5
(b) the t-statistic and (c) the p-value for poverty will differ, and explain why they will differ? Here is the summary output in R, after including Language in the model: Call: lm(formula = crime ~ poverty + language, data = data) Residuals: Min 1Q Median 3Q Max -38.188-10.638-1.675 8.426 72.874 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 31.6251 8.0542 3.927 0.000216 *** poverty 1.6576 0.5713 2.901 0.005114 ** language 4.7310 1.0433 4.535 2.64e-05 *** Residual standard error: 20.4 on 63 degrees of freedom Multiple R-squared: 0.3488, Adjusted R-squared: 0.3281 F-statistic: 16.87 on 2 and 63 DF, p-value: 1.356e-06 Since poverty and language are only weakly correlated the estimate of β for poverty will not be substantially affected by the introduction of this new variable. However, because language has the highest correlation with crime, it follows that it will account for a substantial amount of variability in the observed variable, crime, thereby slightly decreasing the variance in crime explained by poverty. Altogether, (a) [10pts] the estimate of β for poverty will slightly decrease, (b) [10pts] its t-value will also slightly decrease, (c) [10pts] and its p-value will consequently increase. 2. [30pts] We now consider the ANOVA table for the model described in equation (3): Analysis of Variance Table Response: crime Df Sum Sq Mean Sq F value Pr(>F) poverty 1 5486.6 5486.6 13.180 0.000569 *** language 1 8559.6 8559.6 20.562 2.645e-05 *** Residuals 63 26225.5 416.3 This model is compared to another one for which we have changed the ordering of the variables, such that we fit, Crime Language + Poverty, (4) Department of Mathematics and Statistics, Boston University 6
and produce a new ANOVA table for this model. Can you anticipate how (a) the sum of squares, (b) the F -statistic, and (c) the p-value for language will change, and justify your answers? That is, which of these quantities is likely to increase/decrease? Here is the ANOVA output in R, after changing the order of the variables, as shown in equation (4): Analysis of Variance Table Response: crime Df Sum Sq Mean Sq F value Pr(>F) language 1 10542.4 10542.4 25.3254 4.306e-06 *** poverty 1 3503.8 3503.8 8.4171 0.005114 ** Residuals 63 26225.5 416.3 Since language and poverty are weakly correlated, it follows that changing the ordering of the variables in the ANOVA table will modify the R output. In particular, some of the variance explained by poverty will now be accounted for by language. Therefore, (a) [10pts] the sum of squares for language will slightly increase. (b) [10pts] its F -statistic will also slightly increase. (c) [10pts] and its p-value will consequently decrease. Department of Mathematics and Statistics, Boston University 7