The Linear Regression Model

Size: px

Start display at page:

Download "The Linear Regression Model"

Blaise Newton
5 years ago
Views:

1 The Linear Regression Model Carlo Favero Favero () The Linear Regression Model 1 / 67

2 OLS To illustrate how estimation can be performed to derive conditional expectations, consider the following general representation of the model of interest: y = Xβ + ɛ, y = β = y 1... y N β 1... β k 1 C A, X = 1 C A, ɛ = x 11 x 12.. x 1k x N1 x N2.. x Nk ε 1... ε N 1 C A. 1 C A, Favero () The Linear Regression Model 2 / 67

3 OLS The simplest way to derive estimates of the parameters of interest is the ordinary least squares (OLS) method. Such a method chooses values for the unknown parameters to minimize the magnitude of the non-observable components. In our simple bivariate case this amount to choosing a line that goes through the scatterplot of excess returns on each asset on the market excess returns such that it provides the best fit. The best fit is obtained by minimizing the sum of squared vertical deviations of the data points from the fitted line. Define the following quantity: e (β) = y where e (β) is a (n 1) vector. If we treat Xβ, as a (conditional) prediction for y, then we can consider e (β) as a forecasting error. The sum of the squared errors is then Xβ, S (β) = e (β) e (β). Favero () The Linear Regression Model 3 / 67

4 OLS The OLS method produces an estimator of β, β, b defined as follows: S bβ = min e (β) e (β). β Given β, b we can define an associated vector of residual bɛ as bɛ = y Xβ. b The OLS estimator is derived by considering the necessary and sufficient conditions for β b to be a unique minimum for S: 1 X bɛ = ; 2 rank(x) = k. Condition 1 imposes orthogonality between the right-hand side variables on the OLS residuals, and ensures that residuals have an average of zero when a constant is included among the regressors. Condition 2 requires that the columns of the X matrix are linearly independent: no variable in X can be expressed as a linear combination of the other variables in X. Favero () The Linear Regression Model 4 / 67

5 OLS From 1 we derive an expression for the OLS estimates: X bɛ = X y Xβ b = X y X Xβ b =, bβ = X X 1 X y. Favero () The Linear Regression Model 5 / 67

6 Properties of the OLS estimates We have derived the OLS estimator without any assumption on the statistical structure of the data. However, the statistical structure of the data is needed to define the properties of the estimator. To illustrate them, we refer to the basic concepts of mean and variance of vector variables. Given a generic vector of variables, x,and its mean vector E (x) x = x 1... x n 1 C,E (x) = B E (x 1 )... E (x n ) 1 C A Favero () The Linear Regression Model 6 / 67

7 Properties of the OLS estimates We define the mean matrix of outer products E (xx ) as: E x 2 1 E (x 1 x 2 ).. E (x 1 x n ) E xx. E x E (x 2 x n ) = E (x n x 1 ) E (x n x 2 ).. E x 2 n The variance-covariance matrix of x is the defined as: var (x) = E (x E (x)) E (x E (x)) = E xx E (x) E (x). 1 C A The variance-covariance matrix is symmetric and positive definite, by construction. Given an arbitrary A vector of dimension n, we have: var A x = A var (x) A. Favero () The Linear Regression Model 7 / 67

8 The first hypothesis The first relevant hypothesis for the derivation of the statistical properties of OLS regards the relationship between disturbances and regressors in the estimated equation. This hypothesis is constructed in two parts: first we assume that E (y i j x i ) = xi β, ruling out the contemporaneous correlation between residuals and regressors (note that assuming the validity of this hypothesis implies that there are no omitted variables correlated with the regressors), second we assume that the components of the available sample are independently drawn. The second assumption guarantees the equivalence between E (y i j x i ) = xi β and E (y i j x 1,..., x i,..., x n ) = xi β. Using vector notation, we have: E (y j X) = Xβ, which is equivalent to E (ɛ j X) =. (1) Favero () The Linear Regression Model 8 / 67

9 The first hypothesis Note that hypothesis (1) is very demanding. It implies that E (ɛ i j x 1,...x i,...x n ) = (i = 1,...n). The conditional mean is, in general, a non-linear function of (x 1,..., x i,..., x n ) and (1) requires that such a function is a constant of zero. Note that (1) requires that each regressor is orthogonal not only to the error term associated with the same observation (E (x ik ε i ) = for all k), but also to the error tem associated with each other observation E x jk ε i = for all j 6= k. This statement is proved by using the properties of conditional expectations. Favero () The Linear Regression Model 9 / 67

10 The first hypothesis Since E (ɛ j X) = implies, from the law of iterated expectations, that E (ɛ) =, we have Then E ε i j x jk = E E (εi j x) j x jk =. (2) E ε i x jk = E E ε i x jk j x jk = E x jk E ε i j x jk =. Favero () The Linear Regression Model 1 / 67

11 The second and third hypothesis The second hypothesis defines the constancy of the conditional variance of shocks: E ɛ ɛ j X = σ 2 I, (3) where σ 2 is a constant independent from X. In the case of our data, this is a strong assumption unlikely to be met in practice. The third hypothesis is the one already introduced, which guarantees that the OLS estimator can be derived: rank (X) = k. (4) Under hypotheses (1) estimator. (4) we can derive the properties of the OLS Favero () The Linear Regression Model 11 / 67

12 Property 1: unbiasedness The conditional expectation (with respect to X) of the OLS estimates is the vector of unknown parameters β: bβ = X X 1 X (Xβ + ɛ) E bβ j X = β+ X X 1 X ɛ = β+ X X 1 X E (ɛ j X) = β, by hypothesis (1). Favero () The Linear Regression Model 12 / 67

13 Property 2: variance of OLS The conditional variance of the OLS estimator is σ 2 (X X) 1 : var bβ j X bβ = E β bβ β j X = E X X 1 X ɛɛ X X X 1 j X = X X 1 X E ɛɛ j X X X X 1 = X X 1 X σ 2 IX X X 1 = σ 2 X X 1. Favero () The Linear Regression Model 13 / 67

14 Property 3: Gauss-Markov theorem The OLS estimator is the most efficient in the class of linear unbiased estimators. Consider the class of linear estimators: β L = Ly. This class is defined by the set of matrices (k n) L, which are fixed when conditioning upon X. L does not depend on y. Therefore we have: E (β L j X) = E (LXβ + Lε j X) = LXβ, and LXβ = β only if LX = I k. Such a condition is obviously satisfied by the OLS estimator, which is obtained by setting L = (X X) 1 X. The variance of the general estimator in the class of linear unbiased estimators is readily obtained as: var (β L j X) = E Lεε L j X = σ 2 LL. Favero () The Linear Regression Model 14 / 67

15 To show that the OLS estimator is the most efficient within this class we have to show that the variance of the OLS estimator differs from the variance of the generic estimator in the class by a positive semidefinite matrix. To this aim define D = L (X X) 1 X ; LX = I requires DX =. LL = X X 1 X + D X X X 1 + D from which we have that = X X 1 X X X X 1 + X X 1 X D + +DX X X 1 + DD = X X 1 + DD, var (β L j X) = var bβ j X + σ 2 DD, which proves the point. For any given matrix D, (not necessarily square), the symmetric matrix DD is positive semidefinite. Favero () The Linear Regression Model 15 / 67

16 Residual Analysis Consider the following representation: bɛ = y Xbβ = y X X X 1 X y = My, where M = I n Q, and Q = X (X X) 1 X. The (n n) matrices M and Q, have the following properties: 1 they are symmetric: M = M, Q = Q; 2 they are idempotent: QQ = Q, MM = M; 3 MX =, MQ =, QX = X. Favero () The Linear Regression Model 16 / 67

17 Residual Analysis Note that the OLS projection for y can be written as by = Xβ b = Qy, and that bɛ = My, from which we have the known result of orthogonality between the OLS residuals and regressors. We also have My = MXβ + Mɛ = Mɛ, given that MX =. Therefore we have a very well-specified relation between the OLS residuals and the errors in the model bɛ = Mɛ, which cannot be used to derive the errors given the residuals, since the M matrix is not invertible. We can re-write the sum of squared residuals as: S bβ = bɛ bɛ = ɛ M Mɛ = ɛ Mɛ. S bβ is an obvious candidate for the construction of an estimate for σ 2. Favero () The Linear Regression Model 17 / 67

18 Residual Analysis To derive an estimate of σ 2 from S bβ, we introduce the concept of trace. The trace of a square matrix is the sum of all elements on its principal diagonal. The following properties are relevant: 1 given any two square matrices A and B, tr (A + B) = tra + trb; 2 given any two matrices A and B, tr (AB) = tr (BA) ; 3 the rank of an idempotent matrix is equal to its trace. Favero () The Linear Regression Model 18 / 67

19 Residual Analysis Using property 2 together with the fact that a scalar coincides with its trace, we have: ɛ Mɛ =tr ɛ Mɛ = tr Mɛɛ. Now we analyse the expected value of S bβ, conditional upon X: E S bβ j X = E trmɛɛ j X = tre Mɛɛ j X = trm Eɛɛ j X = σ 2 trm. Favero () The Linear Regression Model 19 / 67

20 Residual Analysis From properties 1 and 2 we have: trm = tri n tr X X X 1 X = n tr X X X X 1 = n k. Therefore, an unbiased estimate of σ 2 is given by s 2 = S bβ / (n k). Favero () The Linear Regression Model 2 / 67

21 Residual Analysis:the R-squared Using the result of orthogonality between the OLS projections and residuals, we can write: var (y) = var (by) + var (bɛ), from which we can derive the following residual-based indicator of the goodness of fit: R 2 = var (by) var (y) = 1 var (bɛ) var (y). The information contained in R 2 is associated with the information contained in the standard error of the regression, which is the square root of the estimated variance of OLS residuals. Favero () The Linear Regression Model 21 / 67

22 Interpreting Regression Results Interpreting regression results is not a simple exercise. We propose to split these procedure in three steps. First, introduce a measure of sampling variability and evaluate again what you know taking into account that parameters are estimated and there is uncertainty surrounding your point estimates. Second, understand the relevance of our regression independently from inference on the parameters. There is an easy way to do this: suppose all parameters in the model are known and identical to the estimated values and learn how to read these. Third, remember that each regression is run after a reduction process has been, explicitly or implicitly implemented. The relevant question is what happens if something went wrong in the reduction process? What are the consequences of omitting relevant information or of including irrelevant ones in your specification? Favero () The Linear Regression Model 22 / 67

23 Statistical Significance and Relevance Relevance of a regression is different form statistical significance of the estimated parameters. In fact, confusing statistical significance of the estimated parameter describing the effect of a regressor on the dependent variable with practical relevance of that effect is a rather common mistake in the use of the linear model. Statistical inference is a tool for estimating parameters in a probability model and assessing the amount of sampling variability. Statistics gives us indication on what we can say about the values of the parameters in the model on the basis of our sample. The relevance of a regression is determined by the share of the unconditional variance of y that is explained by the variance of E (y j X). Measuring how large is the share of the unconditional variance of y explained by the regression function is the fundamental role of R 2. Favero () The Linear Regression Model 23 / 67

24 Statistical Significance of regression coefficients Estimate the coefficients in a regression, specify a null hypothesis of interest (for example, 4 Factors are needed to explain team performance). Derive a statistic (i.e. a quantity function of the regression coefficients) whose distribution is known under the null hypothesis, compute the observed value of the statistics Compute p as be the probability (under the null) of getting the value you have observed for the statistics p is called the p-value. Adopt a decision rule about p, call it p* and reject the null if the observed value of your statistic is smaller than p*. For example, if you take p=.5 you reject the null everytime your observed statistics is smaller than.5. In this case you make the call that the observation of an event that has very low probability under the null is an indication that the null is rejected. Favero () The Linear Regression Model 24 / 67

25 Statistical Significance of regression coefficients Of course by using the criterion adopted you run the risk of rejecting an hypothesis when that hypothesis is true. This is called the Probability of Type I error or the size of your test. There is another risk that you run: the probability of type II error, that is the probability of not rejecting a null when it is false. Think about an alternative hypothesis on the coefficients, you can compute the probability with which your statistics will be smaller than the cutoff point to which you associate a probability p*. That is the probability of type II error. The power of the test is 1- Pr (type II error). Note that the p-value can be computed in two ways i) by deriving the relevant distribution under the null ii) by simulating via Monte-Carlo or bootstrap the relevant distribution under the null. Using simulation makes easy to calculate the power of your test against given alternatives. Favero () The Linear Regression Model 25 / 67

26 Relevance of regression coefficients Estimate the coefficients in a regression and keep them fixed at their point estimate Run an experiment by changing the conditional mean of the dependent variable via a shock to the regressors Assess how relevant is the shock to the regressor(s) (say, one of the four factor) to determine the dependent variables (say,team performance) Favero () The Linear Regression Model 26 / 67

27 Inference in the Linear Regression Model Inference in the Linear Regression Model is about design the appropriate statistics to test the hypothesis of interest on the coefficients in a linear model. We shall address this process in two steps how to formalize the relevant hypothesis on how to build the statistics. Favero () The Linear Regression Model 27 / 67

28 How to formalize the relevant hypothesis Given the genral representation of the linear regression model: y = Xβ + ɛ, Our general case of interest is that of r restrictions on the vector of parameters with r < k. If we limit our interest to the class of linear restrictions on coefficients, we can express them as H = Rβ = r, where R is an (r k) matrix of parameters with rank k and r is an (r 1) vector of parameters. Favero () The Linear Regression Model 28 / 67

29 How to formalize the relevant hypothesis To illustrate how R and r are constructed, we consider the baseline case of the CAPM model; we want to impose the restriction β,i = on the following specification: r i t r rf t = β,i + β 1,i r m t Rβ = r, 1 β,i = (). β 1,i r rf t + u i,t, (5) Favero () The Linear Regression Model 29 / 67

30 How to build the statistics To perform inference in the linear regression model, we need a further hypothesis to specify the distribution of ɛ conditional upon X: ɛ j X N, σ 2 I. (6) or, equivalently y j X N Xβ, σ 2 I, (7) Given (??) we can immediately derive the distribution of bβ j X which, being a linear combination of a normal distribution, is also normal: bβ j X N β, σ 2 X X 1. (8) Favero () The Linear Regression Model 3 / 67

31 How to build the statistics If bβ j X N β, σ 2 (X X) 1, then: R b β r j X N Rβ r, σ 2 R X X 1 R. (9) And the relevant test can be constructed by deriving the distribution of (??) under the null Rβ r =. Unfortunately, using the normal distribution would require the knowledge of σ 2, which in general is not known. Fortunately, a statistics can be built based on the OLS estimate for σ 2. Favero () The Linear Regression Model 31 / 67

32 How to build the statistics Fortunately, a statistics can be built based on the OLS estimate for σ 2. In fact, it can be shown that Rβ b r R (X X) 1 R 1 Rβ b r s 2 rf (r, T k), under H, that can be used to test the relevant hypothesis. Notice that, as we know that in the case r=1, t t k = p F (1, T k),if we are interested in testing hypothesis on a single coefficients (say β 1 ) we can use the following statistic: ˆ β 1 β 1 1/2 t (T k) under H. Var ˆβ1 Therefore, an immediate test of significance of the coefficient can be performed, by taking the ratio of each estimated coefficient and the associated standard error. Favero () The Linear Regression Model 32 / 67

33 The Partitioned Regression Model Given the linear model: y = Xβ + ɛ, Partition X in two blocks two blocks of dimension (Txr) and (Tx (k r)) and β in a corresponding way into β 1 β 2. The partitioned regression model can then be written as follows y = X 1 β 1 + X 2 β 2 + ɛ, Favero () The Linear Regression Model 33 / 67

34 The Partitioned Regression Model It is useful to derive the formula for the OLS estimator in the partitioned regression model. To obtain such results we partition the normal equations X X b β = X y as: X 1 X 2 X 1 X 2 b β1 bβ 2! = X 1 X 2 y, or, equivalently, X 1 X 1 X 1 X 2 X 2 X 1 X 2 X 2 b β1 bβ 2! = X 1 y X 2 y. (1) Favero () The Linear Regression Model 34 / 67

35 The Partitioned Regression Model System (1) can be resolved in two stages by first deriving an expression β b 2 as: bβ 2 = X2 X 1 2 X2 y X 2 X 1β b 1, and then by substituting it in the first equation of (1) to obtain X1 X 1β b 1 + X1 X 2 X2 X 1 2 X2 y X 2 X 1β b 1 = X1 y, from which: bβ 1 = X1 M 1 2X 1 X 1 M 2 y M 2 = I X 2 X2 X 1 2 X 2. Favero () The Linear Regression Model 35 / 67

36 The Partitioned Regression Model Note that, as M 2 is idempotent, we can also write: bβ 1 = X 1 M 2 M 2X 1 1 X 1 M 2 M 2y, and b β 1 can be interpreted as the vector of OLS coefficients of the regression of y on the matrix of residuals of the regression of X 1 on X 2. Thus, an OLS regression on two regressors is equivalent to two OLS regressions on a single regressor (Frisch-Waugh theorem). Favero () The Linear Regression Model 36 / 67

37 The Partitioned Regression Model Finally, consider the residuals of the partitioned model: bɛ = y X 1 b β1 X 2 b β2, bɛ = y X 1 b β X2 X 2 X 2 bɛ = M 2 y M 2 X 1 b β1 1 X 2 y X 2 X 1 b β 1, = M 2 y M 2 X 1 X1 M 1 2X 1 X 1 M 2 y = M 2 M 2 X 1 X1 M 1 2X 1 X 1 M 2 y, however, we already know that bɛ = My, therefore, M = M 2 M 2 X 1 X1 M 1 2X 1 X 1 M 2. (11) Favero () The Linear Regression Model 37 / 67

38 Testing restrictions on a subset of coefficients In the general framework to test linear restrictions we set r =, R = I r, and partition β in a corresponding way into β 1 β 2. In this case the restriction Rβ r = is equivalent to β 1 = in the partitioned regression model. Under H, X 1 has no additional explicatory power for y with respect to X 2, therefore: H : y = X 2 β 2 + ɛ, (ɛ j X 1, X 2 ) s N, σ 2 I. Note that the statement y = X 2 γ 2 + ɛ, (ɛ j X 2 ) N, σ 2 I, is always true under our maintained hypotheses. However, in general γ 2 6= β 2. Favero () The Linear Regression Model 38 / 67

39 Testing restrictions on a subset of coefficients To derive a statistic to test H remember that the general matrix R (X X) 1 R becomes now the upper left block of (X X) 1, which we can now write as (X 1 M 2X 1 ) 1. The statistic then takes the form bβ 1 (X 1 M 2X 1 ) b β 1 rs 2 = y M 2 X 1 (X1 M 2X 1 ) 1 X1 M 2y T y My r k F (T k, r). Given (11), (1) can be re-written as: y M 2 y y My T y My r k F (T k, r), (12) where the denominator is the sum of the squared residuals in the unconstrained model, while the numerator is the difference between the sum of residuals in the constrained model and the sum of residuals in the unconstrained model. Favero () The Linear Regression Model 39 / 67

40 Testing restrictions on a subset of coefficients Consider the limit case r = 1 and β 1 is a scalar. The F-statistic takes the form bβ 2 1 s 2 (X 1 M 2X 1 ) s F (T k, r), under H, where (X 1 M 2X 1 ) 1 is element (1, 1) of the matrix (X X) 1. Using the result on the relation between the F and the Student s t-distribution: bβ 1 s (X 1 M 2X 1 ) 1/2 t (T k) under H. Therefore, an immediate test of significance of the coefficient can be performed, by taking the ratio of each estimated coefficient and the associated standard error. Favero () The Linear Regression Model 4 / 67

41 The Relevance of a Regression The relevance of a regression is determined by the share of the unconditional variance of y that is explained by the variance of E (y j X). Measuring how large is the share of the unconditional variance of y explained by the regression function is the fundamental role of R 2. a variable can be very significant in explaining the variance of E (y j X), but little of the unconditional variance of y can be explained by the variance of E (y j X). statistical significance does not imply relevance. Favero () The Linear Regression Model 41 / 67

42 The partial regression theorem The Frisch-Waugh Theorem described above is worth more consideration. The theorem tells us than any given regression coefficient in the model E (y j X) = Xβ can be computed in two different but exactly equivalent ways: 1) by regressing y on all the columns of X, 2) by first regressing the j-th column of X on all the other columns of X, computing the residuals of this regression and then by regressing y on these residuals. This result is relevant in that it clarifies that the relationships pinned down by the estimated parameters in a linear model do not describe the connections between the regressand and each regressor but the connection between the part of each regressor that is not explained by the other ones and the regressand. Favero () The Linear Regression Model 42 / 67

43 What if analysis The relevant question in this case becomes how much shall y change if I change X i? The estimation of a single equation linear model does not allow to answer that question, for a number of reasons. First, estimated parameters in a linear model can only answer the question how much shall E (y j X) if I change X? We have seen that the two questions are very different if the R 2 of the regression is low, in this case a change in E (y j X) may not effect any visible and relevant effect on y. Second, a regression model is a conditional expected value GIVEN X. In this sense there is no space for changing the value of any element in X. Favero () The Linear Regression Model 43 / 67

44 What if analysis Any statement involving such a change requires some assumption on how the conditional expectation of y changes if X changes and a correct analysis of this requires an assumption on the joint distribution of y and X. Simulation might require the use of the multivariate joint model even when valid estimation can be performed concentrating only on the conditional model. Strong exogeneity is stronger than weak exogeneity for the estimation of the parameters of interest. Favero () The Linear Regression Model 44 / 67

45 What if analysis Think of a linear model with know parameters y = β 1 x 1 + β 2 x 2 What is in this model the effect of on y of changing x 1 by one unit while keeping x 2 constant? Easy β 1. Now think of the estimated linear model: y = ˆ β 1 x 1 + ˆ β 2 x 2 + û Now y is different from E (y j X) and the question "what is in this model the effect of on E (y j X) of changing x 1 by one unit while keeping x 2 constant?" does not in general make sense. Favero () The Linear Regression Model 45 / 67

46 What if analysis Changing x 1 keeping x 2 unaltered implies that there is zero correlation among this variables. But the estimates β ˆ 1 and β ˆ 2 are obtained by using data in which in general there is some correlation between x 1 and x 2. Data in which fluctuations in x 1 do not have any effect on x 2 would have most likely generated different estimates from those obtained in the estimation sample. The only valid question that can be answered using the coefficients in linear regression is "What is the effect on E (y j X) of changing the part of each regressors that is orthogonal to the other ones". "What if" analysis requires simulation and in most cases a low level of reduction than that used for regression analysis. Favero () The Linear Regression Model 46 / 67

47 The semi-partial R-squared When the columns of X are orthogonal to each other the total R 2 can be exactly decomposed in the sum of the partial R 2 due to each regressor x i (the partial R 2 of a regressor i is defined as the R 2 of the regression of y on x i ). This is in general not the case in applications with non experimental data: columns of X are correlated and a (often large) part of the overall R 2 does depend on the joint behaviour of the columns of X. However, it is always possible to compute the marginal contribution to the overall R 2 due to each regressor x i, defined as the difference between the overall R 2 and the R 2 ot the regression that includes all columns X except x i. This is called the semi-partial R 2. Favero () The Linear Regression Model 47 / 67

48 The semi-partial R-squared Interestingly, the the semi-partial R 2 is a simple tranformation of the t-ratio: spr 2 i = t2 β i 1 R 2 (T k) This result has two interesting implications. First, a quantity which we considered as just a measure of statistical reliability, can lead to a measure of relevance when combined with the overall R 2 of the regression. Second, we can re-iterate the difference between statistical significance and relevance. Suppose you have a sample size of 1 and you have 1 columns in X and the t-ratio on a coefficient β i is of about 4 with an associate P-value of the order.1: very statistical significant! The derivation of the semi-partial R 2 tells us that the contribution of this variable to the overall R2 is at most approximately 16/(1-1) that is: less than two thousands. Favero () The Linear Regression Model 48 / 67

49 Model Mis-specification Each specification can be interpreted of the result of a reduction process, what happens if the reduction process that has generated E (y j X) omits some relevant information? There are three general cases of mis-specification. Mis-specification related to the choice of variables included in the regressions, Mis-specification related to ignoring the existence on constraints on the estimated parameters Misspecification related to wrong assumptions on the properties of the error terms. Favero () The Linear Regression Model 49 / 67

50 Choice of variables under-parameterization (the estimated model omits variables included in the DGP) over-parameterization (the estimated model includes more variables than the DGP). Favero () The Linear Regression Model 5 / 67

51 Under-parameterization Given the DGP: for which hypotheses (1) estimated: y = X 1 β 1 +X 2 β 2 +ɛ, (13) (??) hold, the following model is y = X 1 β 1 + ν. (14) The OLS estimates are given by the following expression: bβ up 1 = X 1 X 1 1 X 1 y, (15) Favero () The Linear Regression Model 51 / 67

52 Under-parameterization while the OLS estimates which are obtained by estimation of the DGP, are: bβ 1 = X 1 M 2X 1 1 X 1 M 2 y. (16) The estimates in (16) are best linear unbiased estimators (BLUE) by construction, while the estimates in (15) are biased unless X 1 and X 2 are uncorrelated. To show this, consider: bβ 1 = X1 X 1 1 X1 y X 1 X 2β b 2 (17) = b β up 1 bd b β 2, (18) where bd is the vector of coefficients in the regression of X 2 on X 1 and bβ 2 is the OLS estimator obtained by fitting the DGP. Favero () The Linear Regression Model 52 / 67

53 Illustration Given the DGP: the following model is estimated by OLS y = X X ɛ 1, (19) X 2 =.8X 1 + ɛ 2 (2) y = X 1 β 1 + ν. (21) (a) The OLS estimate of β 1 will be.5 (b) The OLS estimate of β 1 will be (c) The OLS estimate of β 1 will be.9 (d) The OLS estimate of β 1 will have a mean of.9 Favero () The Linear Regression Model 53 / 67

54 Under-parameterization Note that if E (y j X 1, X 2 ) = X 1 β 1 +X 2 β 2, E (X 1 j X 2 ) = X 1 D, then, E (y j X 1 ) = X 1 β 1 +X 1 Dβ 2 = X 1 α. Therefore the OLS estimator in the under-parameterized model is a biased estimator of β 1, but an unbiased estimator of α. Then, if the objective of the model is forecasting and X 1 is more easily observed than X 2, the under-parameterized model can be safely used. On the other hand, if the objective of the model is to test specific predictions on parameters, the use of the under-parameterized model delivers biased results. Favero () The Linear Regression Model 54 / 67

55 Over-parameterization Given the DGP, y = X 1 β 1 + ɛ, (22) for which hypotheses (1) (??) hold, the following model is estimated: y = X 1 β 1 + X 2 β 2 + v. (23) The OLS estimator of the over-parameterized model is bβ op 1 = X 1 M 2X 1 1 X 1 M 2 y, (24) Favero () The Linear Regression Model 55 / 67

56 Over-parameterization while, by estimating the DGP, we obtain: bβ 1 = X 1 X 1 1 X 1 y. (25) By substituting y from the DGP, one finds that both estimators are unbiased and the difference is now made by the variance. In fact we have: var bβ op 1 j X 1, X 2 = σ 2 X1 M 1 2X 1, (26) var bβ1 j X 1, X 2 = σ 2 X1 X 1 1. (27) Favero () The Linear Regression Model 56 / 67

57 Over-parameterization Remember that if two matrices A and B are positive definite and A is positive semidefinite, then also the matrix B 1 A 1 is positive semidefinite. We have to show that X1 X 1 X1 M 2X 1 is a positive semidefinite matrix. Such a result is almost immediate: B X 1 X 1 X 1 M 2X 1 = X 1 (I M 2 ) X 1 = X 1 Q 2X 1 = X 1 Q 2Q 2 X 1. We conclude that over-parameterization impacts on the efficiency of estimators and the power of the tests of hypotheses. Favero () The Linear Regression Model 57 / 67

58 Estimation under linear constraints The estimated model is the linear model analysed up to now: y = Xβ + ɛ, while the DGP is instead: y = Xβ + ɛ, subject to Rβ r =, where the constraints are expressed using the so called implicit form. Favero () The Linear Regression Model 58 / 67

59 Estimation under linear constraints A useful alternative way of expressing constraints, known as the explicit form has been expressed by Sargan (1988): β = Sθ + s, where S is a (k (k r)) matrix of rank k r and s is a k 1 vector. To show how constraints are specified in the two alternatives let us consider the case of β 1 = β 2 on the following specification: ln y i = β + β 1 x 1i + β 2 x 2i + ε i. (28) Favero () The Linear Regression Model 59 / 67

60 Estimation under linear constraints Using Rβ r = : 1 β β 1 β 2 1 A = (), while using β = Sθ + β β 1 β 2 1 A A β β 1 In practice the constraints in the explicit form are written by considering θ as the vector of free parameters. 1 A. Favero () The Linear Regression Model 6 / 67

61 Estimation under linear constraints Note that there is no unique way of expressing constraints in the explicit form, in our case the same constraint can be imposed β β 1 β 2 1 A A β β 2 As the two alternatives are indifferent, Rβ r = and RSθ + Rs r = are equivalent, which implies: 1 RS = ; 2 Rs r =. 1 A. Favero () The Linear Regression Model 61 / 67

62 The restricted least squares (RLS) estimator To construct RLS, substitute the constraint in the original model to obtain: y Xs = XSθ + ɛ. (29) Equation (29) is equivalent to: y = X θ + ɛ, (3) where y = y Xs, X = XS. Note that the transformed model features the same residuals with the original model; therefore, if standard hypotheses hold for the original model, they also hold for the transformed. Favero () The Linear Regression Model 62 / 67

63 Application Given the DGP: y = X 1 β 1 +X 2 β 1 +ɛ, (31) the RLS estimator will be obtained by regressing y on (a) (X 1 + X 2 ) (b) (X 1 X 2 ) (c) (X 1 /X 2 ) (d) X 1 and X 2 Favero () The Linear Regression Model 63 / 67

64 The restricted least squares (RLS) estimator We apply OLS to the transformed model to obtain: bθ = X X 1 X y (32) = S X XS 1 S X (y Xs). From (32) the RLS estimation is easily obtained by applying the transformation β b rls = Sbθ + s. Similarly, the variance of the RLS estimator is easily obtained as: var b θ j X = σ 2 X X 1 = σ 2 S X XS 1, var bβ rls j X = var Sbθ + s j X = S var b θ j X S = σ 2 S S X XS 1 S. Favero () The Linear Regression Model 64 / 67

65 The restricted least squares (RLS) estimator We can now discuss the properties of OLS and RLS in the case of a DGP with constraints. Unbiasedness Under the assumed DGP, both estimators are unbiased, since such properties depend on the validity of hypotheses (1) (??), which is not affected by the imposition of constraints on parameters. Efficiency Obviously, if we interpret RLS as the OLS estimator on the transformed model (32) we immediately derive the results that the RLS is the most efficient estimator, as the hypotheses for the validity of the Gauss Markov theorem are satisfied when OLS is applied to (32). Note that by posing L = (X X) 1 X in the context of the transformed model, we do not generally obtain OLS but an estimator whose conditional variance with respect to X, coincides with the conditional variance of the OLS estimator. Favero () The Linear Regression Model 65 / 67

66 The restricted least squares (RLS) estimator We support this intuition with a formal argument var bβ j X var bβ rls j X = σ 2 X X 1 σ 2 S S X XS 1 S. Define A as: A = X X 1 S S X XS 1 S. Given that AX XA = X X 1 S S X XS 1 S X X X X 1 S S X XS 1 S = X X 1 = X X 1 = A, 2S S X XS 1 S + S S X XS 1 S S S X XS 1 S S S X XS 1 S A is positive semidefinite, being the product of a matrix and its transpose. Favero () The Linear Regression Model 66 / 67

67 Heteroscedasticity, Autocorrelation, and the GLS estimator Let us reconsider the single equation model and generalize it to the case in which the hypotheses of diagonality and constancy of the conditional variances-covariance matrix of the residuals do not hold: y = Xβ + ɛ, (33) ɛ n.d., σ 2 Ω, where Ω is a (T T) symmetric and positive definite matrix. When the OLS method is applied to model (33), it delivers estimators which are consistent but not efficient; moreover, the traditional formula for the variance-covariance matrix of the OLS estimators, σ 2 (X X) 1, is wrong and leads to an incorrect inference. Favero () The Linear Regression Model 67 / 67

Interpreting Regression Results

Interpreting Regression Results Carlo Favero Favero () Interpreting Regression Results 1 / 42 Interpreting Regression Results Interpreting regression results is not a simple exercise. We propose to split