Classical Least Squares Theory

Size: px

Start display at page:

Download "Classical Least Squares Theory"

Ethelbert Terry
5 years ago
Views:

1 Classical Least Squares Theory CHUNG-MING KUAN Department of Finance & CRETA National Taiwan University October 18, 2014 C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

2 Lecture Outline 1 The Method of Ordinary Least Squares (OLS) Simple Linear Regression Multiple Linear Regression Geometric Interpretations Measures of Goodness of Fit Example: Analysis of Suicide Rate 2 Statistical Properties of the OLS Estimator Classical Conditions Without the Normality Condition With the Normality Condition 3 Hypothesis Testing Tests for Linear Hypotheses Power of the Tests Alternative Interpretation of the F Test Confidence Regions Example: Analysis of Suicide Rate C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

3 Lecture Outline (cont d) 4 Multicollinearity Near Multicollinearity Regression with Dummy Variables Example: Analysis of Suicide Rate 5 Limitation of the Classical Conditions 6 The Method of Generalized Least Squares (GLS) The GLS Estimator Stochastic Properties of the GLS Estimator The Feasible GLS Estimator Heteroskedasticity Serial Correlation Application: Linear Probability Model Application: Seemingly Unrelated Regressions C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

4 Simple Linear Regression Given the variable of interest y, we are interested in finding a function of another variable x that can characterize the systematic behavior of y. y: Dependent variable or regressand x: Explanatory variable or regressor Specifying a linear function of x: α + βx with unknown parameters α and β The non-systematic part is the error: y (α + βx) Together we write: y = α + βx + e(α, β). }{{}}{{} linear function error C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

5 For the specification α + βx, the objective is to find the best fit of the data (y t, x t ), t = 1,..., T. 1 Minimizing a least-squares (LS) criterion function wrt α and β: Q T (α, β) := 1 T T (y t α βx t ) 2. t=1 2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β: 1 T T y t α βx t. t=1 3 Minimizing asymmetrically weighted absolute deviations: 1 θ y T t α βx t + (1 θ) y t α βx t, t:y t>α+βx t t:y t<α+βx t with 0 < θ < 1. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

6 For the specification α + βx, the objective is to find the best fit of the data (y t, x t ), t = 1,..., T. 1 Minimizing a least-squares (LS) criterion function wrt α and β: Q T (α, β) := 1 T T (y t α βx t ) 2. t=1 2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β: 1 T T y t α βx t. t=1 3 Minimizing asymmetrically weighted absolute deviations: 1 θ y T t α βx t + (1 θ) y t α βx t, t:y t>α+βx t t:y t<α+βx t with 0 < θ < 1. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

7 For the specification α + βx, the objective is to find the best fit of the data (y t, x t ), t = 1,..., T. 1 Minimizing a least-squares (LS) criterion function wrt α and β: Q T (α, β) := 1 T T (y t α βx t ) 2. t=1 2 Minimizing a least-absolute-deviation (LAD) criterion wrt α and β: 1 T T y t α βx t. t=1 3 Minimizing asymmetrically weighted absolute deviations: 1 θ y T t α βx t + (1 θ) y t α βx t, t:y t>α+βx t t:y t<α+βx t with 0 < θ < 1. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

8 The OLS Estimators The first order conditions (FOCs) of LS minimization are: Q T (α, β) α Q T (α, β) β = 2 T = 2 T T (y t α βx t ) = 0, t=1 T (y t α βx t )x t = 0. t=1 The solutions are known as the ordinary least squares (OLS) estimators: ˆβ T = T t=1 (y t ȳ)(x t x) T t=1 (x t x) 2, ˆα T = ȳ ˆβ T x. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

9 The estimated regression line is ŷ = ˆα T + ˆβ T x, which is the linear function evaluated at ˆα T and ˆβ T, and ê = y ŷ is the error evaluated at ˆα T and ˆβ T and also known as residual. The t-th fitted value of the regression line is ŷ t = ˆα T + ˆβ T x t. The t-th residual is ê t = y t ŷ t = e t (ˆα T, ˆβ T ). No other linear functions of the form a + bx can provide a better fit of the data in terms of sum of squared errors. ˆβ T characterizes the the predicted change of y, given a change of one unit of x, whereas ˆα T is the predicted y without (the information of) x. Note that the OLS method requires no assumption on the data, except that x t can not be a constant. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

10 Algebraic Properties Substituting ˆα T and ˆβ T into the FOCs: 1 T T (y t α βx t ) = 0, t=1 1 T T (y t α βx t )x t = 0, t=1 we have the following algebraic results: T t=1 êt = 0. T t=1 êtx t = 0. T t=1 y t = T t=1 ŷt so that ȳ = ŷ. ȳ = ˆα T + ˆβ T x; that is, the estimated regression line must pass through the point ( x, ȳ). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

11 Example: Analysis of Suicide Rate Suppose we want to know how the suicide rate (s) in Taiwan can be explained by unemployment rate (u), GDP growth rate (g), or time (t). The suicide rate is 1/ Data ( ): s = with s.d. 3.91; ḡ = 5.64 with s.d. 3.16; ū = 3.09 with s.d Estimation results: ŝ t = g t, R 2 = 0.17; ŝ t = g t 1, R 2 = 0.21; ŝ t = u t, R2 = 0.70; ŝ t = u t 1, R 2 = 0.68; ŝ t = t, R 2 = C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

12 Suicide Rates Suicide Rate Real GDP Growth Rate Suicide Rate Unemplyment Rate (a) Suicide & GDP growth rates (b) Suicide and unemploy. rates C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

13 Multiple Linear Regression With k regressors x 1,..., x k (x 1 is usually the constant one): y = β 1 x β k x k + e(β 1,..., β k ). With data (y t, x t1,..., x tk ), t = 1,..., T, we can write y = Xβ + e(β), (1) where β = (β 1 β 2 β k ), y = y 1 y 2. y T, X = x 11 x 12 x 1k x 21 x 22 x 2k x T 1 x T 2 x Tk, e(β) = e 1 (β) e 2 (β). e T (β). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

14 Least-squares criterion function: Q T (β) := 1 T e(β) e(β) = 1 T (y Xβ) (y Xβ). (2) The FOCs of minimizing Q T (β) are 2X (y Xβ)/T = 0, leading to the normal equations: X Xβ = X y. Identification Requirement [ID-1]: X is of full column rank k. Any column of X is not a linear combination of other columns. Intuition: X does not contain redundant information. When When X is not of full column rank, we say there exists exact multicollinearity among regressors. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

15 Given [ID-1], X X is positive definite and hence invertible. The unique solution to the normal equations is known as the OLS estimator of β: ˆβ T = (X X) 1 X y. (3) Under [ID-1], we have the second order condition: 2 β Q T (β) = 2(X X)/T is p.d. The result below requires only the identification requirement and does not depend on the statistical properties of y and X. Theorem 3.1 Given specification (1), suppose [ID-1] holds. Then, the OLS estimator ˆβ T = (X X) 1 X y uniquely minimizes the criterion function (2). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

16 The magnitude of ˆβ T is affected by the measurement units of the dependent and explanatory variables. A larger coefficient does not imply that the associated regressor is more important. The so-called beta coefficients (see homework) do not depend on the measurement units, and hence their magnitudes are comparable. Given ˆβ T, the vector of the OLS fitted values is ŷ = Xˆβ T, and the vector of the OLS residuals is ê = y ŷ = e(ˆβ T ). Plugging ˆβ T into the FOCs: X (y Xβ) = 0, we have: X ê = 0. When X contains a vector of ones, T t=1 êt = 0. ŷ ê = ˆβ T X ê = 0. These are all algebraic results under the OLS method. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

17 Geometric Interpretations Recall that P = X(X X) 1 X is the orthogonal projection matrix that projects vectors onto span(x), and I T P is the orthogonal projection matrix that projects vectors onto span(x), the orthogonal complement of span(x). Thus, PX = X and (I T P)X = 0. The vector of fitted values, ŷ = Xˆβ T = X(X X) 1 X y = Py, is the orthogonal projection of y onto span(x). The residual vector, ê = y ŷ = (I T P)y, is the orthogonal projection of y onto span(x). ê is orthogonal to X, i.e., X ê = 0, and it is also orthogonal to ŷ because ŷ is in span(x), i.e., ŷ ê = 0. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

18 y x 2 ê = (I P )y x 2 ˆβ2 P y = x 1 ˆβ1 + x 2 ˆβ2 x 1 ˆβ1 x 1 Figure: The orthogonal projection of y onto span(x 1,x 2 ). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

19 Theorem 3.3 (Frisch-Waugh-Lovell) Given y = X 1 β 1 + X 2 β 2 + e, the OLS estimators of β 1 and β 2 are ˆβ 1,T = [X 1(I P 2 )X 1 ] 1 X 1(I P 2 )y, ˆβ 2,T = [X 2(I P 1 )X 2 ] 1 X 2(I P 1 )y, where P 1 = X 1 (X 1X 1 ) 1 X 1 and P 2 = X 2 (X 2X 2 ) 1 X 2. ˆβ 1,T can also be computed from regressing (I P 2 )y on (I P 2 )X 1, where (I P 2 )y and (I P 2 )X 1 are the residual vectors of y on X 2 and X 1 on X 2, respectively. That is, ˆβ 1,T characterizes the marginal effect of X 1 on y, after the effect of X 2 is purged away. Similarly, regressing (I P 1 )y on (I P 1 )X 2 yields ˆβ 2,T. Thus, That is, ˆβ 2,T characterizes the marginal effect of X x on y, after the effect of X 1 is purged away. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

20 Proof: Writing y = X 1 ˆβ 1,T + X 2 ˆβ 2,T + (I P)y, where P = X(X X) 1 X with X = [X 1 X 2 ], we have X 1(I P 2 )y = X 1(I P 2 )X 1 ˆβ 1,T + X 1(I P 2 )X 2 ˆβ 2,T + X 1(I P 2 )(I P)y = X 1(I P 2 )X 1 ˆβ 1,T + X 1(I P 2 )(I P)y. We know span(x 2 ) span(x), so that span(x) span(x 2 ). Hence, (I P 2 )(I P) = I P, and X 1(I P 2 )y = X 1(I P 2 )X 1 ˆβ 1,T + X 1(I P)y = X 1(I P 2 )X 1 ˆβ1,T, from which we obtain the expression for ˆβ 1,T. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

21 Some Implications of the FWL Theorem The OLS estimator of regressing y on X 1 is not the same as ˆβ 1,T, unless X 1 and X 2 are orthogonal to each other. Regressing y on X 1 and X 2 enables us to assess the marginal effect of X 1 on y while controlling the effect of X 2 and also the marginal effect of X 2 on y while controlling the effect of X 1. When one regresses y on X 1 only, the estimated marginal effect of X 1 is affected by other relevant variables not included in the regression. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

22 Observe that (I P 1 )y = (I P 1 )X 2 ˆβ 2,T + (I P 1 )(I P)y. (I P 1 )(I P) = I P, so that the residual vector of regressing (I P 1 )y on (I P 1 )X 2 is identical to the residual vector of regressing y on X = [X 1 X 2 ]: (I P 1 )y = (I P 1 )X 2 ˆβ2,T + (I P)y. P 1 = P 1 P, so that the orthogonal projection of y directly on span(x 1 ) (i.e., P 1 y) is equivalent to iterated projections of y on span(x) and then on span(x 1 ) (i.e., P 1 Py). Hence, (I P 1 )X 2 ˆβ 2,T = (I P 1 )Py = (P P 1 )y. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

23 y x 2 ê = (I P)y Py (I P 1 )y (P P 1 )y P 1 y x 1 Figure 1: An Illustration of the Frisch-Waugh-Lovell Theorem. Figure: An illustration of the Frisch-Waugh-Lovell Theorem. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

24 Measures of Goodness of Fit Given ŷ ê = 0, we have y y = ŷ ŷ + ê ê, where y y is known as TSS (total sum of squares), ŷ ŷ is RSS (regression sum of squares), and ê ê is ESS (error sum of squares). The non-centered coefficient of determination (or non-centered R 2 ), R 2 = RSS TSS = 1 ESS TSS, (4) measures the proportion of the total variation of y t that can be explained by the model. It is invariant wrt measurement units of the dependent variable but not invariant wrt constant addition. It is a relative measure such that 0 R 2 1. It is nondecreasing in the number of regressors. (Why?) C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

25 Centered R 2 When the specification contains a constant term, T (y t ȳ) 2 = t=1 }{{} centered TSS T (ŷ t ŷ) T 2 + t=1 }{{} centered RSS t=1 ê 2 t }{{} ESS The centered coefficient of determination (or centered R 2 ),. R 2 = T t=1 (ŷ t ȳ) 2 Centered RSS T t=1 (y = t ȳ) 2 Centered TSS = 1 ESS Centered TSS, measures the proportion of the total variation of y t that can be explained by the model, excluding the effect of the constant term. It is invariant wrt constant addition. 0 R 2 1, and it is non-decreasing in the number of regressors. It may be negative when the model does not contain a constant term. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

26 Centered R 2 : Alternative Interpretation When the specification contains a constant term, T (y t ȳ)(ŷ t ȳ) = t=1 T (ŷ t ȳ + ê t )(ŷ t ȳ) = t=1 because T t=1 ŷtê t = t t=1 êt = 0. Centered R 2 can also be expressed as R 2 = T t=1 (ŷ t ȳ) 2 T t=1 (y t ȳ) 2 = T (ŷ t ȳ) 2, t=1 [ T t=1 (y t ȳ)(ŷ t ȳ)] 2 [ T t=1 (y t ȳ) 2 ][ T t=1 (ŷ t ȳ) 2 ], which is the the squared sample correlation coefficient of y t and ŷ t, also known as the squared multiple correlation coefficient. Models for different dep. variables are not comparable in terms of R 2. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

27 Adjusted R 2 Adjusted R 2 is the centered R 2 adjusted for the degrees of freedom: R 2 = 1 ê ê/(t k) (y y T ȳ 2 )/(T 1). R 2 adds a penalty term to R 2 : R 2 = 1 T 1 T k (1 R2 ) = R 2 k 1 T k (1 R2 ), where the penalty term depends on the trade-off between model complexity and model explanatory ability. R 2 may be negative and need not be non-decreasing in k. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

28 Example: Analysis of Suicide Rate Q: How the suicide rate (s) can be explained by unemployment rate (u), GDP growth rate (g), and time (t) during ? Estimation results with g t and u t : ŝ t = g t, R 2 = 0.17; ŝ t = u t, R 2 = 0.70; ŝ t = u t g t, R 2 = Estimation results with g t 1 and u t 1 : ŝ t = g t 1, R2 = 0.21; ŝ t = u t 1, R 2 = 0.68; ŝ t = u t g t 1, R 2 = C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

29 Estimation results with t but without g: ŝ t = u t, R2 = 0.70; ŝ t = u t t, R 2 = 0.69; ŝ t = u t 1, R 2 = 0.68; ŝ t = u t t, R 2 = Estimation results with t and g: ŝ t = u t g t, R 2 = 0.69; ŝ t = u t g t t, R2 = 0.68; ŝ t = u t g t 1, R 2 = 0.67; ŝ t = u t g t t, R 2 = C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

30 Estimation results with t and t 2 : ŝ t = t, R 2 = 0.44; ŝ t = t t 2, R 2 = 0.63; ŝ t = u t 0.46 t t 2, R2 = 0.78; ŝ t = g t 0.45 t t 2, R 2 = 0.62; ŝ t = u t g t 0.46 t t 2, R 2 = 0.77; ŝ t = u t t t 2, R 2 = 0.75; ŝ t = g t t t 2, R2 = 0.64; ŝ t = u t g t t t 2, R2 = As far as R 2 is concerned, a specification with t, t 2, and u seems to provide good fit of data and reasonable interpretation. Q: Is there any other way to determine if a specification is good? C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

31 Classical Conditions To derive the statistical properties of the OLS estimator, we assume: [A1] [A2] X is non-stochastic. y is a random vector such that (i) IE(y) = Xβ o for some β o ; (ii) var(y) = σ 2 oi T for some σ 2 o > 0. [A3] y is a random vector s.t. y N (Xβ o, σ 2 oi T ) for some β o and σ 2 o > 0. The specification (1) with [A1] and [A2] is known as the classical linear model, whereas (1) with [A1] and [A3] is the classical normal linear model. When var(y) = σoi 2 T, the elements of y are homoskedastic and (serially) uncorrelated. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

32 Without Normality The OLS estimator of the parameter σ 2 o is an average of squared residuals: ˆσ 2 T = 1 T k T êt 2. t=1 Theorem 3.4 Consider the linear specification (1). (a) Given [A1] and [A2](i), ˆβ T is unbiased for β o. (b) Given [A1] and [A2], ˆσ T 2 is unbiased for σ2 o. (c) Given [A1] and [A2], var(ˆβ T ) = σo(x 2 X) 1. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

33 Proof: By [A1], IE(ˆβ T ) = IE[(X X) 1 X y] = (X X) 1 X IE(y). [A2](i) gives IE(y) = Xβ o, so that IE(ˆβ T ) = (X X) 1 X Xβ o = β o, proving unbiasedness. Given ê = (I T P)y = (I T P)(y Xβ o ), IE(ê ê) = IE[trace ( (y Xβ o ) (I T P)(y Xβ o ) )] = IE[trace ( (y Xβ o )(y Xβ o ) (I T P) )] = trace ( IE[(y Xβ o )(y Xβ o ) ](I T P) ) = trace ( σoi 2 T (I T P) ) = σo 2 trace(i T P), where the 4-th equality follows from [A2](ii) that var(y) = σoi 2 T. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

34 Proof (cont d): As trace(i T P) = rank(i T P) = T k, we have IE(ê ê) = σo(t 2 k) and IE(ˆσ T 2 ) = IE(ê ê)/(t k) = σo, 2 proving (b). By [A1] and [A2](ii), var(ˆβ T ) = var ( (X X) 1 X y ) = (X X) 1 X [var(y)]x(x X) 1 = σo(x 2 X) 1 X I T X(X X) 1 = σo(x 2 X) 1. This establishes the assertion of (c). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

35 Theorem 3.4 establishes unbiasedness of the OLS estimators ˆβ T and ˆσ 2 T but does not address the issue of efficiency. By Theorem 3.4(c), the elements of ˆβ T can be more precisely estimated (i.e., with a smaller variance) when X has larger variation. To see this, consider the simple linear regression: y = α + βx + e, it can be verified that var( ˆβ T ) = σ 2 o 1 T t=1 (x t x) 2. Thus, the larger the (squared) variation of x t (i.e., T t=1 (x t x) 2 ), the smaller is the variance of ˆβ T. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

36 The result below establishes efficiency of ˆβ T among all unbiased estimators of β o that are linear in y. Theorem 3.5 (Gauss-Markov) Given linear specification (1), suppose that [A1] and [A2] hold. Then the OLS estimator ˆβ T is the best linear unbiased estimator (BLUE) for β o. Proof: Consider an arbitrary linear estimator ˇβ T = Ay, where A is a non-stochastic matrix, say, A = (X X) 1 X + C. Then, ˇβ T = ˆβ T + Cy, such that var(ˇβ T ) = var(ˆβ T ) + var(cy) + 2 cov(ˆβ T, Cy). By [A1] and [A2](i), IE(ˇβ T ) = β o + CXβ o, which is unbiased iff CX = 0. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

37 Proof (cont d): The condition CX = 0 implies cov(ˆβ T, Cy) = 0. Thus, var(ˇβ T ) = var(ˆβ T ) + var(cy) = var(ˆβ T ) + σ 2 occ. This shows that var(ˇβ T ) var(ˆβ T ) is a p.s.d. matrix σ 2 occ, so that ˆβ T is more efficient than any linear unbiased estimator ˇβ T. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

38 Example: IE(y) = X 1 b 1 and var(y) = σoi 2 T. Two specification: y = X 1 β 1 + e. with the OLS estimator ˆb 1,T, and y = Xβ + e = X 1 β 1 + X 2 β 2 + e. with the OLS estimator ˆβ T = (ˆβ ˆβ 1,T 2,T ). Clearly, ˆb 1,T is the BLUE of b 1 with var(ˆb 1,T ) = σo(x 2 1X 1 ) 1. By the Frisch-Waugh-Lovell Theorem, IE(ˆβ 1,T ) = IE ( [X 1(I T P 2 )X 1 ] 1 X 1(I T P 2 )y ) = b 1, IE(ˆβ 2,T ) = IE ( [X 2(I T P 1 )X 2 ] 1 X 2(I T P 1 )y ) = 0. That is, ˆβ T is unbiased for (b 1 0 ). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

39 Example (cont d): var(ˆβ 1,T ) = var([x 1(I T P 2 )X 1 ] 1 X 1(I T P 2 )y) = σo[x 2 1(I T P 2 )X 1 ] 1. As X 1X 1 X 1(I T P 2 )X 1 = X 1P 2 X 1 is p.s.d., it follows that [X 1(I T P 2 )X 1 ] 1 (X 1X 1 ) 1 is p.s.d. This shows that ˆb 1,T is more efficient than ˆβ 1,T, as it ought to be. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

40 With Normality Under [A3] that y N (Xβ o, σ 2 oi T ), the log-likelihood function of y is log L(β, σ 2 ) = T 2 log(2π) T 2 log σ2 1 2σ 2 (y Xβ) (y Xβ). The score vector is s(β, σ 2 ) = 1 σ 2 X (y Xβ) T 2σ σ 4 (y Xβ) (y Xβ) Solutions to s(β, σ 2 ) = 0 are the (quasi) maximum likelihood estimators (MLEs). Clearly, the MLE of β is the OLS estimator, and the MLE of σ 2 is. σ 2 T = (y Xˆβ T ) (y Xˆβ T ) T = ê ê T ˆσ2 T. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

41 With the normality condition on y, a lot more can be said about the OLS estimators. Theorem 3.7 Given the linear specification (1), suppose that [A1] and [A3] hold. (a) ˆβ T N ( β o, σo(x 2 X) 1). (b) (T k)ˆσ T 2 /σ2 o χ 2 (T k). (c) ˆσ T 2 has mean σ2 o and variance 2σo/(T 4 k). Proof: For (a), we note that ˆβ T is a linear transformation of y N (Xβ o, σoi 2 T ) and hence also a normal random vector. As for (b), writing ê = (I T P)(y Xβ o ), we have (T k)ˆσ T 2 /σ2 o = ê ê/σo 2 = y (I T P)y, where y = (y Xβ o )/σ o N (0, I T ) by [A3]. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

42 Proof (cont d): Let C orthogonally diagonalizes I T P such that C (I T P)C = Λ. Since rank(i T P) = T k, Λ contains T k eigenvalues equal to one and k eigenvalues equal to zero. Then, [ ] y (I T P)y = y C[C (I T P)C]C y = η I T k 0 η. 0 0 where η = C y. As η N (0, I T ), η i are independent, standard normal random variables. It follows that y (I T P)y = T k i=1 η 2 i χ 2 (T k), proving (b). (c) is a direct consequence of (b) and the facts that χ 2 (T k) has mean T k and variance 2(T k). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

43 Theorem 3.8 Given the linear specification (1), suppose that [A1] and [A3] hold. Then the OLS estimators ˆβ T and ˆσ T 2 are the best unbiased estimators (BUE) for β o and σo, 2 respectively. Proof: The Hessian matrix of the log-likelihood function is H(β, σ 2 ) = 1 X X σ 2 1 X (y Xβ) σ 4. 1 (y Xβ) X T 1 (y Xβ) (y Xβ) σ 4 2σ 4 σ 6 Under [A3], IE[s(β o, σo)] 2 = 0 and IE[H(β o, σo)] 2 = 1 X X 0 σo 2 0 T 2σo 4. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

44 Proof (cont d): By the information matrix equality, IE[H(β o, σ 2 o)] is the information matrix. Then, its inverse, IE[H(β o, σo)] 2 1 = σ2 o(x X) is the Cramér-Rao lower bound. var(ˆβ T ) achieves this lower bound (the upper-left block) so that ˆβ T is the best unbiased estimator for β o. This conclusion is much 2σ 4 o T stronger than the Gauss-Markov Theorem. Although var(ˆσ 2 T ) = 2σ4 o/(t k) is greater than the lower bound (lower-right element), it can be shown that ˆσ T 2 is still the best unbiased estimator for σo; 2 see Rao (1973, p. 319) for a proof., C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

45 Tests for Linear Hypotheses Linear hypothesis: Rβ o = r, where R is q k with full row rank q and q < k, r is a vector of hypothetical values. A natural way to construct a test statistic is to compare Rˆβ T and r; we reject the null if their difference is too large. Given [A1] and [A3], Theorem 3.7(a) states: ˆβ T N ( β o, σo(x 2 X) 1), so that Rˆβ T N (Rβ o, σ 2 o[r(x X) 1 R ]). The comparison between Rˆβ T and r is based on this distribution. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

46 When q = 1, Rˆβ T and R(X X) 1 R are scalars. Under the null hypothesis, Rˆβ T r σ o [R(X X) 1 R ] 1/2 = R(ˆβ T β o ) σ o [R(X X) 1 R N (0, 1). ] 1/2 As σ 2 o is unknown, the statistic above is not operational. By replacing σ o with ˆσ T, we obtain the so-called t statistic: τ = Rˆβ T r ˆσ T [R(X X) 1 R ] 1/2. Theorem 3.9 Given the linear specification (1), suppose that [A1] and [A3] hold. When R is 1 k, τ t(t k) under the null hypothesis. Note: The normality condition [A3] is crucial for this t distribution result. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

47 Proof: We write the statistic τ as / Rˆβ τ = T r (T k)ˆσ σ o [R(X X) 1 R T 2 /σ2 o ] 1/2, T k where the numerator is N (0, 1) and (T k)ˆσ 2 T /σ2 o is χ 2 (T k) by Theorem 3.7(b). The assertion follows when the numerator and denominator are independent. This is indeed the case, because ˆβ T and ê are jointly normally distributed with cov(ê, ˆβ T ) = IE[(I T P)(y Xβ o )y X(X X) 1 ] = (I T P) IE[(y Xβ o )y ]X(X X) 1 = σ 2 o(i T P)X(X X) 1 = 0. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

48 Examples To test β i = c, let R = [ ] and m ij be the (i, j) th element of M 1 = (X X) 1. Then, the resulting t statistic is: τ = ˆβ i,t c ˆσ T m ii t(t k), where m ii = R(X X) 1 R. For testing β i = 0, τ is also referred to as the t ratio. It is straightforward to verify that to test aβ i + bβ j = c, with a, b, c given constants, the corresponding t statistic reads: τ = a ˆβ i,t + b ˆβ j,t c ˆσ T [a 2 m ii + b 2 m jj + 2abm ij ] t(t k). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

49 When R is a q k matrix with full row rank q (q > 1), we have under the null hypothesis: [R(X X) 1 R ] 1/2 (Rˆβ T r)/σ o N (0, I q ). Hence, (Rˆβ T r) [R(X X) 1 R ] 1 (Rˆβ T r)/σ 2 o χ 2 (q). The so-called F test statistic is: ϕ = (Rˆβ T r) [R(X X) 1 R ] 1 (Rˆβ T r) ˆσ 2 T q. It is easy to verify that ϕ = (Rˆβ T r) [R(X X) 1 R ] 1 (Rˆβ T r)/(σoq) 2 (T k)ˆσ T 2, /[σ2 o(t k)] where both the numerator and denominator are χ 2 variables divided by their corresponding degrees of freedom. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

50 Theorem 3.10 Given the linear specification (1), suppose that [A1] and [A3] hold. When R is q k with full row rank, ϕ F (q, T k) under the null hypothesis. Notes: 1 When q = 1, ϕ F (1, T k), and this distribution is the same as that of τ 2. 2 Note that t distribution is symmetric about zero. Hence one may consider one- or two-sided t test. On the other hand, F distribution is non-negative and asymmetric, it is more typical to consider only one-sided F test. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

51 Example: H o : β 1 = b 1 and β 2 = b 2. The F statistic, ϕ = 1 2ˆσ 2 T ( ˆβ 1,T b 1 ˆβ 2,T b 2 ) [ m 11 m 12 m 21 m 22 ] 1 ( ˆβ 1,T b 1 ˆβ 2,T b 2 ), is distributed as F (2, T k). Example: H o : β 2 = 0, and β 3 = 0, and β k = 0, ϕ = 1 (k 1)ˆσ T 2 ˆβ 2,T ˆβ 3,T. ˆβ k,t m 22 m 23 m 2k m 32 m 33 m 3k.. m k2 m k3 m kk 1 ˆβ 2,T ˆβ 3,T. ˆβ k,t, is distributed as F (k 1, T k) and known as regression F test. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

52 Test Power To examine the power of the F test, we evaluate the distribution of ϕ under the alternative hypothesis: Rβ o = r + δ, with R is a q k matrix with rank q < k and δ 0. Theorem 3.11 Given the linear specification (1), suppose that [A1] and [A3] hold. When Rβ o = r + δ, ϕ F (q, T k; δ D 1 δ, 0), where D = σo[r(x 2 X) 1 R ], and δ D 1 δ is the non-centrality parameter of the numerator of ϕ. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

53 Proof: When Rβ o = r + δ, [R(X X) 1 R ] 1/2 (Rˆβ T r)/σ o = D 1/2 [R(ˆβ T β o ) + δ], which is distributed as N (0, I q ) + D 1/2 δ. Then, (Rˆβ T r) [R(X X) 1 R ] 1 (Rˆβ T r)/σo 2 χ 2 (q; δ D 1 δ), a non-central χ 2 distribution with the non-centrality parameter δ D 1 δ. It is also readily seen that (T k)ˆσ T 2 /σ2 o is still distributed as χ 2 (T k). Similar to the argument before, these two terms are independent, so that ϕ has a non-central F distribution. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

54 Test power is determined by the non-centrality parameter δ D 1 δ, where δ signifies the deviation from the null. When Rβ o deviates farther from the hypothetical value r (i.e., δ is large ), the non-centrality parameter δ D 1 δ increases, and so does the power. Example: The null distribution is F (2, 20), and its critical value at 5% level is Then for F (2, 20; ν 1, 0) with the non-centrality parameter ν 1 = 1, 3, 5, the probabilities that ϕ exceeds 3.49 are approximately 12.1%, 28.2%, and 44.3%, respectively. Example: The null distribution is F (5, 60), and its critical value at 5% level is Then for F (5, 60; ν 1, 0) with ν 1 = 1, 3, 5, the probabilities that ϕ exceeds 2.37 are approximately 9.4%, 20.5%, and 33.2%, respectively. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

55 Alternative Interpretation Constrained OLS: Finding the saddle point of the Lagrangian: min β,λ 1 T (y Xβ) (y Xβ) + (Rβ r) λ, where λ is the q 1 vector of Lagrangian multipliers, we have λ T = 2[R(X X/T ) 1 R ] 1 (Rˆβ T r), β T = ˆβ T (X X/T ) 1 R λt /2. The constrained OLS residuals are ë = y Xˆβ T + X(ˆβ T β T ) = ê + X(ˆβ T β T ), with ˆβ T β T = (X X) 1 R [R(X X) 1 R ] 1 (Rˆβ T r). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

56 The sum of squared, constrained OLS residuals are: ë ë = ê ê + (ˆβ T β T ) X X(ˆβ T β T ) = ê ê + (Rˆβ T r) [R(X X) 1 R ] 1 (Rˆβ T r), where the 2nd term on the RHS is the numerator of the F statistic. Letting ESS c = ë ë and ESS u = ê ê we have ϕ = ë ë ê ê qˆσ 2 T = (ESS c ESS u )/q, ESS u /(T k) suggesting that F test in effect compares the constrained and unconstrained models based on their lack-of-fitness. The regression F test is thus ϕ = (R2 u Rc 2 )/q which compares model (1 Ru 2 )/(T k) fitness of the full model and the model with only a constant term. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

57 The sum of squared, constrained OLS residuals are: ë ë = ê ê + (ˆβ T β T ) X X(ˆβ T β T ) = ê ê + (Rˆβ T r) [R(X X) 1 R ] 1 (Rˆβ T r), where the 2nd term on the RHS is the numerator of the F statistic. Letting ESS c = ë ë and ESS u = ê ê we have ϕ = ë ë ê ê qˆσ 2 T = (ESS c ESS u )/q, ESS u /(T k) suggesting that F test in effect compares the constrained and unconstrained models based on their lack-of-fitness. The regression F test is thus ϕ = (R2 u Rc 2 )/q which compares model (1 Ru 2 )/(T k) fitness of the full model and the model with only a constant term. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

58 Confidence Regions A confidence interval for β i,o is the interval (g α, g α ) such that IP{ g α β i,o g α } = 1 α, where (1 α) is known as the confidence coefficient. Letting c α/2 be the critical value of t(t k) with tail prob. α/2, ( ) IP{ ˆβi,T β i,o / (ˆσT ) } m ii cα/2 { = IP ˆβ i,t c α/2ˆσ T m ii β i,o ˆβ } i,t + c α/2ˆσ T m ii = 1 α. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

59 The confidence region for a vector of parameters can be constructed by resorting to F statistic. For (β 1,o = b 1, β 2,o = b 2 ), suppose T k = 30 and α = Then, F 0.05 (2, 30) = 3.32, and ( ) [ ] 1 ( ) 1 ˆβ IP 1,T b 1 m 11 m 12 ˆβ 1,T b 1 2ˆσ T 2 ˆβ 2,T b 2 m 21 m ˆβ 2,T b 2 is 1 α, which results in an ellipse with the center ( ˆβ 1,T, ˆβ 2,T ). Note: It is possible that (β 1, β 2 ) is outside the confidence box formed by individual confidence intervals but inside the joint confidence ellipse. That is, while a t ratio may indicate statistic significance of a coefficient, the F test may suggest the opposite based on the confidence region. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

60 Example: Analysis of Suicide Rate Part I: Estimation results with t const u t u t 1 t R2 Reg F (4.57 ) (8.63 ) (4.50 ) (5.04 ) (0.18) (5.09 ) (8.30 ) (5.01 ) (4.74 ) (0.13) Note: The numbers in parentheses are t-ratios; and stand for significance of a two-sided test at 1% and 5% levels, respectively. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

61 Part II: Estimation results with t and g const u t u t 1 g t g t 1 t R2 Reg F (2.47 ) (7.25 ) (0.14) (2.39 ) (4.85 ) (0.18) (0.21) (3.25 ) (6.70 ) ( 0.51) (3.18 ) (4.40 ) ( 0.49) (0.06) F tests for the joint significance of the coefficients of g and t: 0.03 (Model 2) and 0.13 (Model 4), which are insignificant even at 10% level. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

62 Part III: Estimation results with t and t 2 const u t u t 1 g t g t 1 t t 2 R 2 /F (8.82 ) ( 2.50 ) (4.16 ) (6.32 ) (4.62 ) ( 3.34 ) (3.73 ) (4.42 ) (4.50 ) (0.32) ( 3.27 ) (3.68 ) (6.11 ) (3.87 ) ( 2.92 ) (3.27 ) (4.84 ) (3.60 ) ( 0.53) ( 2.90 ) (3.22 ) F tests for the joint significance of the coefficients of g and t: 5.45 (Model 3) and 4.29 (Model 5), which are significant at 5% level. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

63 Selected estimation results (with more precise estimates): ŝ t = u t t t 2, R2 = 0.78; ŝ t = u t t t 2, R 2 = For the second regression, the marginal effect of u t 1 on s t is 1.80, approx. 410 persons. That is, the suicide rate would increase by 1.8 when there is one percent increase of last year s unemployment rate. The time effect of the second regression is t, which changes with t. At 2013, the time effect is 0.54, approx. 120 persons. Since 1994 (about 14.6 years after 1980), there has been a natural increase of the suicide rate in Taiwan; lowering unemployment rate would help cancel out the time effect to some extent. The predicted and actual suicide rates in 2013 are, respectively, and 15.3; the difference between them is 2.56, approx. 590 persons. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

64 Near Multicollinearity It is more common to have near multicollinearity: Xa 0. Writing X = [x i X i ], we have from the FWL Theorem that var( ˆβ i,t ) = σ 2 o[x i(i P i )x i ] 1 = σ 2 o T t=1 (x ti x i ) 2 (1 R 2 (i)), where P i = X i (X i X i) 1 X i, and R2 (i) is the centered R 2 from regressing x i on X i. Consequence of near multicollinearity: R 2 (i) is high, so that var( ˆβ i,t ) tend to be large and that ˆβ i,t are sensitive to data changes. Large var( ˆβ i,t ) lead to small (insignificant) t ratios. Yet, regression F test may suggest that the model (as a whole) is useful. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

65 How do we circumvent the problems from near multicollinearity? Try to break the approximate linear relation. Adding more data if possible. Dropping some regressors. Statistical approaches: Ridge regression: For some λ 0, ˆb ridge = (X X + λi k ) 1 X y. Principal component regression: Note: Multicollinearity vs. micronumerosity (Goldberger) C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

66 Digression: Regression with Dummy Variables Example: Let y t denote the wage of the t th individual and x t the working experience (in years). Consider the following specification: y t = α 0 + α 1 D t + β 0 x t + e t, where D t is a dummy variable such that D t = 1 if t is a male and D t = 0 otherwise. This specification puts together two regressions: Regression for female: D t = 0, and intercept is α 0. Regression for male: D t = 1, and intercept os α 0 + α 1. These two regressions coincide if α 1 = 0. Testing no wage discrimination against female amounts to testing the hypothesis of α 1 = 0. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

67 We may also consider the specification with a dummy variable and its interaction with a regressor: y t = α 0 + α 1 D t + β 0 x t + β 1 (x t D t ) + e t. Then, the slopes of the regressions for female and male are, respectively, β 0 and β 0 + β 1. These two regressions coincide if α 1 = 0 and β 1 = 0. In this case, testing no wage discrimination against female amounts to testing the joint hypothesis of α 1 = 0 and β 1 = 0. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

68 Example: Consider two dummy variables: D 1,t = 1 if high school is t s highest degree and D 1,t=0 otherwise; D 2,t = 1 if college or graduate is t s highest degree and D 2,t=0 otherwise. The specification below in effect puts together 3 regressions: y t = α 0 + α 1 D 1,t + α 2 D 2,t + βx t + e t, where below-high-school regression has intercepts α 0, high-school regression has intercept α 0 + α 1, college regression has intercept α 0 + α 2. Similar to the previous example, we may also consider a more general specification in which x interacts with D 1 and D 2. Dummy variable trap: To avoid exact multicollinearity, the number of dummy variables in a model (with the constant term) should be one less than the number of groups. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

69 Example: Analysis of Suicide Rate Let D t = 1 for t = T + 1,..., T and D t = 0 otherwise, where T is the year of structure change. Consider the specification: s t = α 0 + δd t + β 0 u t 1 + γu t 1 D t + e t. The before-change regression has the intercept α 0 and slope β 0, and the after-change regression has the intercept α 0 + δ and slope β 0 + γ. Testing a structure change at T amounts to testing δ = 0 and γ = 0 (Chow test). Alternatively, we can estimate the specification: s t = α 0 (1 D t ) + α 1 D t + β 0 u t 1 (1 D t ) + β 1 u t 1 D t + e t, and test a structure change at T by checking if α 0 = α 1 and β 0 = β 1. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

70 Part I: Estimation results with a known change: Without t T const D t u t 1 u t 1 D t R 2 /Reg F Chow (2.85 ) ( 1.00) (1.20) (0.94) (2.58 ) ( 0.49) (1.51) (0.57) (2.48 ) ( 0.12) (1.72) (0.32) (2.46 ) (0.17) (1.83) (0.13) Chow test is the F test of the coefficients of D t and u t 1 D t being zero. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

71 Part II: Estimation results with a known change: With t T const D t u t 1 t td t R2 /Reg F (8.30 ) ( 5.09 ) (2.52 ) ( 4.00 ) (5.45 ) (8.47 ) ( 4.79 ) (2.47 ) ( 4.42 ) (5.63 ) (8.42 ) ( 4.41 ) (2.25 ) ( 4.48 ) (5.53 ) (8.19 ) ( 3.92 ) (1.98) ( 4.32 ) (5.26 ) F test of the coefficients of D t and td t being zero: ( 92); ( 93); ( 94); ( 95) C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

72 We do not know T, the year of change, and hence tried estimating with different T : T = 1992 : ŝ t = D t u t t td t ; T = 1993 : ŝ t = D t u t t td t ; T = 1994 : ŝ t = D t u t t td t ; T = 1995 : ŝ t = D t u t t td t. Consider the regression with T = The before-change and after-change slopes are 0.48 (a decrease over time) and 0.37 (an increase over time), respectively. The marginal effect of u t 1 on s t is significant at 5% level. The regression line with T = 1994 predicts the suicide rate in 2013 as 17.8; the difference between the predicted and actual suicide rates is 2.5, approximately 570 persons. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

73 Limitation of the Classical Conditions [A1] X is non-stochastic: Economic variables can not be regarded as non-stochastic; also, lagged dependent variables may be used as regressors. [A2](i) IE(y) = Xβ o : IE(y) may be a linear function with more regressors or a nonlinear function of regressors. [A2](ii) var(y) = σoi 2 T : The elements of y may be correlated (serial correlation, spatial correlation) and/or may have unequal variances. [A3] Normality: y may have a non-normal distribution. The OLS estimator loses the properties derived before when some of the classical conditions fail to hold. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

74 When var(y) σ 2 oi T Given the linear specification y = Xβ + e, suppose, in addition to [A1] and [A2](i), var(y) = Σ o σoi 2 T, where Σ o is p.d. That is, the elements of y may be correlated and have unequal variances. The OLS estimator ˆβ T remains unbiased with var(ˆβ T ) = var((x X) 1 X y) = (X X) 1 X Σ o X(X X) 1. ˆβ T is not the BLUE for β o, and it is not the BUE for β o under normality. The estimator var(ˆβ T ) = ˆσ T 2 (X X) 1 is a biased estimator for var(ˆβ T ). Consequently, the t and F tests do not have t and F distributions, even when y is normally distributed. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

75 The GLS Estimator Consider the specification: Gy = GXβ + Ge, where G is nonsingular and non-stochastic. IE(Gy) = GXβ o and var(gy) = GΣ o G. GX has full column rank so that the OLS estimator can be computed: b(g) = (X G GX) 1 X G Gy, which is still linear and unbiased. It would be the BLUE provided that G is chosen such that GΣ o G = σ 2 oi T. Setting G = Σo 1/2, where Σ 1/2 o = CΛ 1/2 C and C orthogonally diagonalizes Σ o : C Σ o C = Λ, we have Σ 1/2 o Σ o Σo 1/2 = I T. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

76 With y = Σo 1/2 y and X = Σ 1/2 o X, we have the GLS estimator: ˆβ GLS = (X X ) 1 X y = (X Σ 1 o X) 1 (X Σ 1 o y). (5) The ˆβ GLS is a minimizer of weighted sum of squared errors: Q(β; Σ o ) = 1 T (y X β) (y X β) = 1 T (y Xβ) Σ 1 o (y Xβ). The vector of GLS fitted values, ŷ GLS = X(X Σ 1 o X) 1 (X Σ 1 o y), is an oblique projection of y onto span(x), because X(X Σ 1 o X) 1 X Σ 1 o is idempotent but asymmetric. The GLS residual vector is ê GLS = y ŷ GLS. The sum of squared OLS residuals is less than the sum of squared GLS residuals. (Why?) C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

77 Stochastic Properties of the GLS Estimator Theorem 4.1 (Aitken) Given linear specification (1), suppose that [A1] and [A2](i) hold and that var(y) = Σ o is positive definite. Then, ˆβ GLS is the BLUE for β o. Given [A3 ] y N (Xβ o, Σ o ), ˆβ GLS N ( β o, (X Σ 1 o X) 1). Under [A3 ], the log likelihood function is log L(β; Σ o ) = T 2 log(2π) 1 2 log(det(σ o)) 1 2 (y Xβ) Σ 1 o (y Xβ), with the FOC: X Σ 1 o (y Xβ) = 0. Thus, the GLS estimator is also the MLE under normality. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

78 Under normality, the information matrix is IE[X Σ 1 o (y Xβ)(y Xβ) Σ 1 o X] = X Σ 1 o X. β=βo Thus, the GLS estimator is the BUE for β o, because its covariance matrix reaches the Crámer-Rao lower bound. Under the null hypothesis Rβ o = r, we have (Rˆβ GLS r) [R(X Σ 1 o X) 1 R ] 1 (Rˆβ GLS r) χ 2 (q). A major difficulty: How should the GLS estimator be computed when Σ o is unknown? C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

79 Under normality, the information matrix is IE[X Σ 1 o (y Xβ)(y Xβ) Σ 1 o X] = X Σ 1 o X. β=βo Thus, the GLS estimator is the BUE for β o, because its covariance matrix reaches the Crámer-Rao lower bound. Under the null hypothesis Rβ o = r, we have (Rˆβ GLS r) [R(X Σ 1 o X) 1 R ] 1 (Rˆβ GLS r) χ 2 (q). A major difficulty: How should the GLS estimator be computed when Σ o is unknown? C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

80 The Feasible GLS Estimator The Feasible GLS (FGLS) estimator is ˆβ FGLS = (X Σ 1 T X) 1 X Σ 1 T y, where Σ T is an estimator of Σ o. Further difficulties in FGLS estimation: The number of parameters in Σ o is T (T + 1)/2. Estimating Σ o without some prior restrictions on Σ o is practically infeasible. Even when an estimator Σ T is available under certain assumptions, ˆβ FGLS is a complex function of the data y and X. As such, the finite-sample properties of the FGLS estimator are typically difficult to derive. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

81 The Feasible GLS Estimator The Feasible GLS (FGLS) estimator is ˆβ FGLS = (X Σ 1 T X) 1 X Σ 1 T y, where Σ T is an estimator of Σ o. Further difficulties in FGLS estimation: The number of parameters in Σ o is T (T + 1)/2. Estimating Σ o without some prior restrictions on Σ o is practically infeasible. Even when an estimator Σ T is available under certain assumptions, ˆβ FGLS is a complex function of the data y and X. As such, the finite-sample properties of the FGLS estimator are typically difficult to derive. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

82 The Feasible GLS Estimator The Feasible GLS (FGLS) estimator is ˆβ FGLS = (X Σ 1 T X) 1 X Σ 1 T y, where Σ T is an estimator of Σ o. Further difficulties in FGLS estimation: The number of parameters in Σ o is T (T + 1)/2. Estimating Σ o without some prior restrictions on Σ o is practically infeasible. Even when an estimator Σ T is available under certain assumptions, ˆβ FGLS is a complex function of the data y and X. As such, the finite-sample properties of the FGLS estimator are typically difficult to derive. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

83 Some Remarks on the Feasible GLS Estimation In the classical literature, the feasible GLS estimation is based on some assumptions on Σ o. These assumptions are needed to reduce the number of parameters and hence render the estimation of Σ o feasible. The subsections below deal with the estimation under some special assumptions, which are imposed on either the diagonal or off-diagonal terms of Σ o but not both. Although the assumptions on Σ o are weaker than the condition of Σ o = σoi 2 T, they are very ad hoc and can not be easily generalized. C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

84 Tests for Heteroskedasticity A simple form of Σ o is [ σ1 2 Σ o = I T σ2 2I T 2 ], with T = T 1 + T 2 ; this is known as groupwise heteroskedasticity. The null hypothesis of homoskedasticity: σ1 2 = σ2 2 = σ2 o. Perform separate OLS regressions using the data in each group and obtain the variance estimates ˆσ T 2 1 and ˆσ T 2 2. Under [A1] and [A3 ], the F test is: ϕ := ˆσ2 T 1 ˆσ T 2 = (T 1 k)ˆσ T σo(t 2 1 k) / (T2 k)ˆσ 2 T 2 σ 2 o(t 2 k) F (T 1 k, T 2 k). C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

85 More generally, for some constants c 0, c 1 > 0, σ 2 t = c 0 + c 1 x 2 tj. The Goldfeld-Quandt test: (1) Rearrange obs. according to the values of x j in a descending order. (2) Divide the rearranged data set into three groups with T 1, T m, and T 2 observations, respectively. (3) Drop the T m observations in the middle group and perform separate OLS regressions using the data in the first and third groups. (4) The statistic is the ratio of the variance estimates: Some questions: ˆσ T 2 1 /ˆσ T 2 2 F (T 1 k, T 2 k). Can we estimate the model with all observations and then compute ˆσ T 2 1 and ˆσ T 2 2 based on T 1 and T 2 residuals? If Σ o is not diagonal, does the F test above still work? C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

86 More generally, for some constants c 0, c 1 > 0, σ 2 t = c 0 + c 1 x 2 tj. The Goldfeld-Quandt test: (1) Rearrange obs. according to the values of x j in a descending order. (2) Divide the rearranged data set into three groups with T 1, T m, and T 2 observations, respectively. (3) Drop the T m observations in the middle group and perform separate OLS regressions using the data in the first and third groups. (4) The statistic is the ratio of the variance estimates: Some questions: ˆσ T 2 1 /ˆσ T 2 2 F (T 1 k, T 2 k). Can we estimate the model with all observations and then compute ˆσ T 2 1 and ˆσ T 2 2 based on T 1 and T 2 residuals? If Σ o is not diagonal, does the F test above still work? C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

87 More generally, for some constants c 0, c 1 > 0, σ 2 t = c 0 + c 1 x 2 tj. The Goldfeld-Quandt test: (1) Rearrange obs. according to the values of x j in a descending order. (2) Divide the rearranged data set into three groups with T 1, T m, and T 2 observations, respectively. (3) Drop the T m observations in the middle group and perform separate OLS regressions using the data in the first and third groups. (4) The statistic is the ratio of the variance estimates: Some questions: ˆσ T 2 1 /ˆσ T 2 2 F (T 1 k, T 2 k). Can we estimate the model with all observations and then compute ˆσ T 2 1 and ˆσ T 2 2 based on T 1 and T 2 residuals? If Σ o is not diagonal, does the F test above still work? C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 18, / 100

Classical Least Squares Theory

Classical Least Squares Theory CHUNG-MING KUAN Department of Finance & CRETA National Taiwan University October 14, 2012 C.-M. Kuan (Finance & CRETA, NTU) Classical Least Squares Theory October 14, 2012