MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is a random vector, as above. Then, E[a 0 + Ay] = a 0 + AE[y]. The variance/covariance matrix of the random vector y is given by an outer product, Var[y] := E[y E[y])y E[y]) T ], In contrast to the expectation of a random vector, the variance of the transformed version of y, is non-linear, since we have Var[a 0 + Ay] = A Var[y]A T. 2 Simple Linear Regression 2.1 Distribution of the Estimators Recall that β 0 = Ȳ β 1 x, and β1 = SXY SXX. We have shown that these estimators are unbiased, in the sense that E[ β 0 X] = β 0, and E[ β 1 X] = β 1. Moreover, we have also computed the variance of the slope estimator, ) σ 2 Var[ β 1 X] =. SXX If in addition, we assume that the errors are iid draws from a normal distribution, N0, σ 2 ), we obtain the following distribution for this estimator, σ β 2 ) 1 X N β 1,. 1) SXX Finally, we have also considered the following sample estimator of the error variance for simple regression, σ 2 := 1 y ŷ i ) 2, 2) n 2 where ŷ i := β 0 + β 1 x i are the fitted values. Department of Mathematics and Statistics, Boston University 1

2.2 t-tests for Regression Coefficient Using these relationships, we can construct a t-test for the following null hypothesis, H 0 : β 1 = 0, H 1 : β 1 0; Here, we wish to test whether this particular regression coefficient is equal to a given value. Statistical inference can then be conducted by observing that under our distributional assumption on the error terms, we have 2.3 The F -test We consider the difference t 1 := β 1 tn 2). se β 1 ) SSreg := RSS 1 β 0 ) RSS 2 β 0, β 1 ). The F -test for regression is defined using the following formulae for comparing model M 1 with model M 2, F := RSS 1 RSS 2 p 2 p 1 RSS 2. n p 2 For simple regression, this gives This formula can be re-written in this manner, F := RSS 1 RSS 2 )/1, RSS 2 /n 2) F := SYY RSS 2)/1 σ 2 = SSreg σ 2. 3) That is, the F -statistic is simply defined as the re-scaled version of the difference SSreg := SYY RSS. Therefore, we are here interested in conducting the following hypothesis test, H 0 : E[Y X = x] = β 0, H 1 : E[Y X = x] = β 0 + β 1 x, We wish to test whether E[Y X = x] is constant as x varies. If the error terms are additionally assumed to be iid realizations from a normal distribution, then it can be shown that the F -statistic in equation 3) follows an F -distribution, which is denoted by F F p 2 p 1, n p 2 ), which follows from the fact that we are here considering a ratio of two independent random variables that both have a χ 2 -distribution with respective degrees of freedom p 2 p 1, and n p 2. 2.4 Coefficient of Determination R 2 ) The coefficient of determination measures the percentage of variance explained. From the definition of SSreg, we have SSreg = RSS β 0 ) RSS β 0, β 1 ) Department of Mathematics and Statistics, Boston University 2

Dividing both sides by SYY, we obtain SSreg SYY = RSS β 0 ) SYY RSS β 0, β 1 ), SYY where SYY := RSS β 0 ). This simplifies to give the coefficient of determination, or R 2, R 2 := SSreg SYY = 1 RSS β 0, β 1 ), SYY The F -statistic and R 2 have identical numerators, but different denominators, F = SSreg RSS β 0, β 1 )/n 2), and R2 = SSreg RSS β 0 ). Since we have more parameters in RSS 2 than in RSS 1, it follows that we necessarily have RSS 1 RSS 2 ; and therefore the R 2 is comprised between 0 and 1. 2.5 MSE Decomposition The MSE combines the previous two criteria, on the unbiasedness and the variance of β, through the following decomposition: E[ β β) 2 X] = E[ β E[ β X] + E[ β X] β) 2 X] = E [ β E[ β X]) 2 X ] + 2E [ β E[ β X])E[ β X] β) X ] + E [ E[ β X] β) 2 X ]. Here, the cross-product can be seen to cancel out, since the second term in this cross-product does not depend on Y, it follows that we obtain, E[ β E[ β X])E[ β X] β) X] = E[ β X] β)e[ β E[ β X]) X] = E[ β X] β)e[ β X] E[ β X]) = 0. Thus, the MSE admits the following decomposition, into a variance and a bias term: where the bias of β is defined as follows, 3 Multiple Regression 3.1 The Model MSE β, β) = Var[ β X] + b 2 β), b 2 β) := E[ β X] β) 2. Multiple linear regression MLR) is defined in the following manner, y i = p x ij β j + e i, i = 1,..., n, j=0 Department of Mathematics and Statistics, Boston University 3

which may then be reformulated, using linear algebra and letting p := p + 1, y = Xβ + e, where y and e are n 1) vectors, X is an n p ) matrix, and β is a p 1) vector. In addition to the standard OLS assumptions for simple linear regression, we will also assume that X has full rank, rankx) = p. The OLS estimators can be defined as the vector of β j s that minimizes the RSS, β := argmin RSSβ), β R p which takes the form, β = X T X) 1 X T y. 3.2 Hat Matrix The predicted values ŷ can then be written as, ŷ = X β ) = XX T X) 1 X T y =: Hy, Similarly, the residuals can also be expressed as a function of H, ê := y ŷ = y Hy = I H)y, with I denoting the n n identity matrix, and where again the residuals can also be seen to be a linear function of the observed values, y. In summary, we therefore have ŷ = Hy and ê = I H)y. Recall that H is idempotent and symmetric such that HH = H and H = H T, respectively. 3.3 ANOVA Table For a model including an intercept, the total sum of squares TSS), or SYY, can be expanded in the following manner using a given vector of predicted values ŷ for some target model: SYY := y i ȳ) 2 = y i ŷ i ) 2 + 2 y i ŷ i )ŷ i ȳ) + where the cross-term can be re-written in matrix notation, such that ŷ 1ȳ) T y ŷ) = ŷ T y ŷ) 1ȳ) T y ŷ) = ŷ T ê 1ȳ) T ê = 0, ŷ i ȳ) 2, where we have used the fact n êi = 0, which can be verified as an exercise. Therefore, we obtain the classical variance partitioning for multiple regression: n y i ȳ) 2 = n ŷ i ȳ) 2 + n y i ŷ i ) 2 SYY = SSreg + RSS n 1 = p 1 + n p. Department of Mathematics and Statistics, Boston University 4

Table 1. Analysis of Variance Table. Source df SS MS a F b p-value Regression p 1 SSreg SSreg /p 1) MSreg / σ 2 PF MSreg / σ 2 ) Residual n p RSS RSS /n p ) Total n 1 SYY SYY /n 1) a Here, let MSreg := SSreg /p 1), and σ 2 := RSS /n p ), as previously. b The F -statistic satisfies F F p 1, n p ), if in addition, e i iid N0, σ 2 ). This provides a particularly transparent way of allocating the different degrees of freedom to each variance component. We can then construct a table of variance for this model, as described in table 1. The F -statistic describes in table 1 can then be used to test for the following null hypothesis, H 0 : E[Y X = x] = β 0 H 1 : E[Y X = x] = x T β. The fact that we obtain an F -distribution depends on i) the normality of the error terms, and ii) the linearity of the modeling assumptions under both H 0 and H 1. Indeed, linearity is here required in order to derive a ratio of two χ 2 -distributions. 4 Maximum Likelihood 4.1 Probabilistic Model For some set of independent observations y i, x i ), with i = 1,..., n, we assume the following probabilistic model, ind y i Nx T i β, σ 2 ), i = 1,..., n. The likelihood function for this data set parametrized by β, σ 2 ) is then defined as a product of densities, Lβ, σ 2 ; y, X) := n py i x i, β, σ 2 ). In the case of multiple regression, the definition of the Normal distribution gives the following product, Lβ, σ 2 ; y, X) = n { 1 exp 1 } 2πσ 2 2σ 2 y i x T i β) 2. Intuitively, the maximum likelihood estimator MLE) is defined as the parameter value for which the data sample is the most likely. For a linear model like multiple regression, the set of parameters, whose values need to be optimized are composed of the vector of coefficients β and the variance σ 2, such that the MLEs is a vector of the form θ MLE := β 0,..., β p, σ 2 ). Department of Mathematics and Statistics, Boston University 5

4.2 Estimator of Variance We have seen that we can exploit the orthogonality of β and σ 2 in a Normal model, in order to maximize the likelihood by selecting these two sets of parameters independently of each other. Thus, once we have chosen β MLE, it suffices to select which gives σ MLE 2 := argmax log L β MLE, σ 2 ; y, X), σ 2 R + n σ 2 2 log2π) + n 2 logσ2 ) + 1 2σ 2 This can be readily solved in order to obtain ) y i x T β i MLE ) 2 = 0. σ 2 MLE = 1 n RSS β MLE ), which is a biased estimate of the true variance, σ 2. By contrast, the OLS estimator for this parameter is σ OLS 2 := 1 n p RSS β OLS ) = n n p σ2 MLE, which can be shown to be unbiased. In practice, we tend to favor the OLS estimator, as the MLE for σ 2 under-estimates the variance of the residuals, which can lead to spurious statistical inference on the β j s. Department of Mathematics and Statistics, Boston University 6