OLS, MLE and related topics. Primer.

Size: px

Start display at page:

Download "OLS, MLE and related topics. Primer."

Roland Hudson
5 years ago
Views:

1 OLS, MLE and related topics. Primer. Katarzyna Bech Week 1 () Week 1 1 / 88

2 Classical Linear Regression Model (CLRM) The model: y = X β + ɛ, and the assumptions: A1 The true model is y = X β + ɛ. A2 E (ɛ) = 0. A3 Var(ɛ) = E (ɛɛ 0 ) = σ 2 I n. A4 X is a non-stochastic n k matrix with rank k n. () Week 1 2 / 88

3 Least Squares Estimation The OLS estimator bβ will minimize the sum of squares bβ = arg min β RSS(β) = ɛ 0 ɛ = (y X β) 0 (y X β). Using the rules for matrix calculus the FOC is RSS(β) β = (y X β)0 (y X β) β = 2X 0 y + 2X 0 X β = 0, and the SOC is 2 RSS(β) β β 0 which is clearly positive de nite. = 2X 0 X () Week 1 3 / 88

4 Least Squares Estimation From the FOC: bβ = (X 0 X ) 1 X 0 y. Properties of the OLS estimator: E [bβ] = β + (X 0 X ) 1 X 0 E [ɛ] = β hence it is unbiased, and h Var[bβ] = E (bβ β)(bβ β) 0i = σ 2 (X 0 X ) 1. Gauss-Markov Theorem: under A1-A4: bβ OLS is BLUE. () Week 1 4 / 88

5 Gauss Markov Theorem Proof Consider again the model y = X β + ɛ Let β = C y (k1) (kn) n1 be another linear unbiased estimator. For β to be unbiased we require Thus, we need We have E ( β) = E (Cy) = E (CX β + C ɛ) = CX β = β. CX = I k! β = CX β + C ɛ = β + C ɛ. Var( β) = E (( β β)( β β) 0 ) = E (C ɛɛ 0 C 0 ) = CE (ɛɛ 0 )C 0 = σ 2 CC 0. () Week 1 5 / 88

6 Gauss Markov Theorem Proof We want to show that Var( β) Var( ˆβ) 0. Thus, Var( β) Var( ˆβ) = σ 2 (CC 0 (X 0 X ) 1 ) = σ 2 (CC 0 CX (X 0 X ) 1 X 0 C 0 ) =Ik =I k = σ 2 (CI n C 0 CX (X 0 X ) 1 X 0 C 0 ) = σ 2 C (I n X (X 0 X ) 1 X 0 )C 0 = σ 2 CM X C 0 = σ 2 CM X M 0 X C 0 = σ 2 DD 0 where D = CM X. DD 0 is positive semide nite. σ 2 DD 0 0 implying that Var( β) Var( ˆβ) 0. () Week 1 6 / 88

7 Gauss Markov Theorem Proof The latter result also applies to any linear combination of the elements of β:, i.e. Var(c 0 β) Var(c 0 ˆβ) 0. We thus have the following corollary Corollary Under A1-A4, for any vector of constants c the minimum variance linear unbiased estimator of c 0 β in the classical regression model is c 0 ˆβ, where ˆβ is the least squares estimator. Corollary Each coe cient β j is estimated at least as e ciently by ˆβ j as by any other linear unbiased estimator. () Week 1 7 / 88

8 Least Squares Estimation Clearly, we also need to estimate σ 2. The most obvious estimator to take is 0 y X bβ (y X bβ) bσ 2 = = y 0 M X y n n where M X = I P X = I X (X 0 X ) 1 X 0. BUT bσ 2 is biased, so we have to de ne an unbiased estimator 0 s 2 = n y X bβ (y X bβ) n k bσ2 =. n k () Week 1 8 / 88

9 How Econometrics is typically thought? What are the consequences of the violation of the assumptions A1-A4? () Week 1 9 / 88

10 Perfect Multicollinearity Let s see what happens when A4 is violated, i.e. rank(x ) < k (equivalently, the column of X are linearly dependent). In this case X 0 X is not invertible (it s singular), so that the ordinary least squares estimator cannot be computed. The parameters of such a regression model are unidenti ed. The use of too many dummy variables (dummy variable trap) is a typical cause for exact multicollinearity. Consider for instance the case where we would like to estimate wage including a dummy for males (MALE i ), a dummy for females (FEMALE i ) as well as a constant (note that MALE i + FEMALE i = 1 8i ). Exact multicollinearity is easily solved by excluding one of the variables from the model. () Week 1 10 / 88

11 "Near" Multicollinearity Imperfect multicollinearity arises when two or more regressors are highly correlated, in the sense that there is a linear function of regressors which is highly correlated with another regressor. When regressors are highly correlated, it becomes di cult to disentangle the separate e ects of the regressors on the dependent variable. Near multicollinearity is not a violation of the classical linear regression assumptions. So our OLS estimates are still the best linear unbiased estimators. Best, just simply it might not be all that good. () Week 1 11 / 88

12 "Near" Multicollinearity The higher the correlation between the regressors becomes, the less precise our estimates will be (note that Var( ˆβ) = σ 2 (X 0 X ) 1 ). Consider y i = α + x i1 β 1 + x i2 β 2 + ɛ i. Var( ˆβ j ) = σ 2 n. (x ij x j ) 2 (1 rx 2 1,x 2 ) i=1 Obviously, as r 2 x 1,x 2! 1, the variance increases. () Week 1 12 / 88

13 "Near" Multicollinearity Practical consequences: Although OLS estimators are still BLUE they have large variances and covariances. Individual t-tests may fail to reject that coe cients are 0, even though they are jointly sign cant. Parameter estimates may be very sensitive to one or a small number of observations. Coe cients may have the wrong sign or implausible magnitude. How to detect multicollinearity? High R 2 but few signi cant t-ratios. High pairwise correlation among explanatory variables. How to deal with multicollinearity? A priori information. Dropping a variable. Transformation of variables. () Week 1 13 / 88

14 Misspecifying the Set of Regressors- violation of A1 We consider two issues: Omission of relevant variable (e.g., due to oversight or lack of measurement). Inclusion of irrelevant variables. Preview of the results: In the rst case, the estimator is generally biased. In the second case, there is no bias but the estimator is ine cient. () Week 1 14 / 88

15 Omitted Variable Bias Consider the following two models: True model: y = X β + Z δ + ɛ, E (ɛ) = 0, E (ɛɛ 0 ) = σ 2 I (1) and misspeci ed model which we estimate: The OLS estimate of β in (2) are y = X β + v (2) β = (X 0 X ) 1 X 0 y. () Week 1 15 / 88

16 Omitted Variable Bias In order to assess the statistical properties of the latter, we have to substitute the true model for y, which is given by (1). We obtain β = (X 0 X ) 1 X 0 (X β + Z δ + ɛ) = β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 ɛ. Taking expectation (recall that X and Z are non stochastic), we have E ( β) = β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 E (ɛ) = β + (X 0 X ) 1 X 0 Z δ. Thus, generally β is biased. Interpretation of the bias term: (X 0 X ) 1 X 0 Z is the regression coe cient(s) of omitted variable(s) on all included variables. δ is the true coe cient corresponding to the omitted variable(s). () Week 1 16 / 88

17 Omitted Variable Bias We can interpret E ( β) as the sum of two terms, i.e. β is direct change in E (y) associated with changes in X (X 0 X ) 1 X 0 Z δ is the indirect change in E (y) associated with changes in Z (X partially acts as a proxi for Z, when Z is omitted). There would be no omitted variable bias if X and Z were orthogonal, i.e. X 0 Z = 0. Trivially there would be no bias also if δ = 0. () Week 1 17 / 88

18 Example Consider the standard human capital earning function lnw i = α + β 1 school i + β 2 exp i + v i (3) and we are interested in β 1, which is the returns to schooling. We suspect that the relevant independent variable ability i is omitted. What is the likely direction of the bias? Let ˆν denote the OLS coe cients corresponding to schooling in the regression of ability i on a constant, school i and exp i. ˆν is likely to be positive (note that we are not claiming a causal e ect). On the other hand, consider the true model (that cannot be used since ability i is not observed) lnw i = α + β 1 school i + β 2 exp i + δability i + v i. The true coe cient is likely to be positive. () Week 1 18 / 88

19 Example Hence, the direction of the bias of ˆβ 1, when (3) is estimated is given by ˆν (+) (+) δ = +. The returns to schooling tend to be overestimated. () Week 1 19 / 88

20 Variance Taking into account that X and Z are non stochastic and E (ɛ) = 0, Var( β) = E (( β E ( β))( β E ( β)) 0 ) = σ 2 (X 0 X ) 1. Important: s 2 is a biased estimate of the error variance of the true regression error. Indeed, s 2 = ˆv 0 ˆv, n k x where ˆv = y get Thus (show it) X β = M x y. By substituting the true model (1) for y, we ˆv = M x (X β + Z δ + ɛ) = M x (Z δ + ɛ). E (s 2 ) = σ 2 + δ0 Z 0 M X Z δ n k x 6= σ 2. This implies that the t-statistics and F-statistics are invalid. () Week 1 20 / 88

21 Summary of consequences of omitting a relevant variable If the omitted variable(s) are correlated with the included variable(s), the parameter estimates are biased. The disturbance variance σ 2 is incorrectly estimated. As a result, the usual con dence intervals and hypothesis testing procedures are likely to give misleading conclusions. () Week 1 21 / 88

22 Irrelevant Variables Suppose that the correct model is y = X β + ɛ (4) but we instead choose to estimate the bigger model y = X β + Z δ + v (5) The OLS estimate of β in (5) are β = (X 0 M Z X ) 1 X 0 M Z y. () Week 1 22 / 88

23 Irrelevant Variables In order to assess the statistical properties of the latter, we have to substitute the true model for y, which is given by (4). We obtain β = (X 0 M Z X ) 1 X 0 M Z (X β + ɛ) = β + (X 0 M Z X ) 1 X 0 M Z ɛ. Taking expectation, we have Thus, β is unbiased. However, it can be shown that E ( β) = β + (X 0 M Z X ) 1 X 0 M Z E (ɛ) = β. Var( β) σ 2 (X 0 X ) 1 where σ 2 (X 0 X ) 1 is the variance of the correct OLS estimator. () Week 1 23 / 88

24 Summary of consequences of including an irrelevant variable OLS estimators of the parameters of the "incorrect" model are unbiased, but less precise (larger variance). As a consequence of the fact that the price for omitting a relevant variable is so much higher than the price for including irrelevant ones many econometric theorists have suggested a "General to Speci c" modelling strategy. It entails, basically, initially including every regressor that may be suspected of being relevant. Then a combination of R 2 and F and t tests are used to begin to eliminate the least signi cant regressors, hopefully leading to a "correct" model. () Week 1 24 / 88

25 Ramsey s RESET test The most commonly used test to check for the speci cation of the mean function is known as Ramsey s RESET test. Let by i = xi 0b β be the tted values of the OLS regression, then the Ramsey s RESET procedure is to run the regression, as in y i = x 0 i β + γ 1 (by i ) 2 + γ 2 (by i ) γ q (by i ) q+1 + v i and perform an F test of the hypothesis H 0 : γ 1 = γ 2 =... = γ q = 0 If we cannot reject H 0 the model is correctly speci ed. () Week 1 25 / 88

26 Modelling Strategy Approaches to model building: "General to speci c". "Simple to general". The motivating factor when building a model should be ECONOMIC THEORY How do we judge a model to be "good"? Parsimony Identi ability Goodness of t Theoretical consistency Predictive power () Week 1 26 / 88

27 Selecting regressors It is good practice to select the set of potentially relevant variables on the basis of economic arguments rather than statistical ones. There is always a small (but not ignorable) probability of drawing the wrong conclusion. For example there is always a probability of rejecting the null hypothesis that a coe cient is zero, while the null is actually true. Such type I errors are rather likely to happen if we use a sequence of many tests to select the regressors to include in the model. This process is referred to as data mining. In presenting your estimation results, it is not a mistake to have insigni cant variables included in your speci cation. Of course, you should be careful with including many variables in your model that are multicollinear so that, in the end, almost none of the variables appear individually signi cant. () Week 1 27 / 88

28 Spherical Disturbances Assumption A3 of the Classical Linear Regression Model ensures that the variance-covariance matrix of the errors is This implies two things: E (ɛɛ 0 ) = σ 2 I n. E (ɛ 2 i ) = σ2 for all i, i.e. homoskedasticity E (ɛ i ɛ j ) = 0 if i 6= j, i.e. no serial correlation. Naturally, testing for heteroskedasticity and for serial correlation forms a very important part of the econometrics of the linear model. Before we do that, however, we need to look at what sources there are of what the textobooks call non-spherical errors and what e ects might be on our model.

29 Generalized Linear Regression Model The Generalized Linear Regression Model is just the Classical Linear Regression Model, but with non-spherical disturbances (the covariance matrix is no longer proportional to the identity matrix). We consider the model y = X β + ɛ, where and E (ɛ) = 0 E (ɛɛ 0 ) = Σ. Σ is any symmetric positive de nite matrix.

30 Non-spherical disturbances We are interested in two di erent forms of non-spherical disturbances: Heteroscedasticity, with E (ɛ 2 i ) = σ 2 i and E (ɛ i ɛ j ) = 0 if i 6= j, so that each observation will have a di erent variance. In this case, 0 σ 2 1 Σ = σ Ω A = diagfσ σ 2 ng σ 2 n Note that Ω is a diagonal matrix of weights ω i. Sometimes it is convenient to write σ 2 i = σ 2 ω i. Autocorrelation, with E (ɛ 2 i ) = σ 2 and E (ɛ i ɛ j ) 6= 0 if i 6= j. Autocorrelation can be found in time series data.

31 Heteroscedasticity- graphical intuition When data are homoscedastic we expect something like

32 Heteroscedasticity- graphical intuition With heteroscedasticity, we expect

33 Consequences for OLS estimation Let s examine the statistical properties of OLS estimator when E (ɛɛ 0 ) = σ 2 Ω 6= σ 2 I n. Unbiasedness: bβ is still unbiased, since bβ = β + (X 0 X ) 1 X 0 ɛ and since E (ɛ) = 0, we conclude that E (bβ) = β. E ciency: the variance of bβ changes to Var(bβ) = E ((bβ β)(bβ β) 0 ) = (X 0 X ) 1 X 0 E (ɛɛ 0 )X (X 0 X ) 1 = σ 2 (X 0 X ) 1 X 0 ΩX (X 0 X ) 1 The consequence is that the standard OLS estimated variance and standard deviation (computed by statistical packages) is biased, since it is based on the wrong formula σ 2 (X 0 X ) 1. Standard tests are not valid.

34 Consequences for OLS estimation Since one of the Gauss-Markov assumptions is violated (E (ɛɛ 0 ) 6= σ 2 I n ), the OLS of such models is no longer BLUE (although, remember, it is still unbiased). Possible remedies: Construct another estimator which is BLUE- we will call it Generalized Least Squares (GLS) We stick to OLS, but we compute the correct estimated variance.

35 GLS The idea is to transform the model y = X β + ɛ so that the transformed model satis es the Gauss- Markov assumptions. We assume for the time being that Ω is known (rather unrealistic assumption). Property Since Ω is positive de nite and symmetric, there exists a square, nonsingular matrix P such that P 0 P = Ω 1. Sketch of proof: (Spectral decomposition) Since Ω is symmetric there exists C and Λ such that Ω = C ΛC 0, C 0 C = I n. Because Ω is positive de nite (positive de nite matrix is always nonsingular): Ω 1 = C Λ 1 C 0 = P 0 P with P 0 = C Λ 1 2 C 0.

36 GLS We can transform the model y = X β + ɛ by premultiplying it by the matrix P: Py = PX β + Pɛ i.e. y = X β + ɛ. (6) This transformed model satis es the Gauss-Markov assumptions, since and E (ɛ ) = E (Pɛ) = PE (ɛ) = 0 Var(ɛ ) = E (ɛ ɛ 0 ) = E (Pɛɛ 0 P 0 ) = PE (ɛɛ 0 )P 0 = σ 2 PΩP 0 = σ 2 P(P 0 P) 1 P 0 = σ 2 PP 1 P 0 1 P 0 = σ 2 I n.

37 GLS Hence, the OLS estimator of the transformed model (6) is BLUE and bβ GLS = (X 0 X ) 1 X 0 y = (X 0 P 0 PX ) 1 X 0 P 0 Py = (X 0 Ω 1 X ) 1 X 0 Ω 1 y It is easy to verify that E b β GLS = β and Var b β GLS = σ 2 (X 0 Ω 1 X ) 1 = (X 0 Σ 1 X ) 1. (I leave it to you as the exercise)

38 GLS Note: the GLS estimator we have just discussed is useful in any general case of non-spherical disturbances. The general formula is bβ GLS = (X 0 Σ 1 X ) 1 X 0 Σ 1 y. Now, we will specify it to the heteroscedasticity form, i.e. when Σ = diagfσ σ2 ng.

39 Unfeasible GLS We consider again the model y i = x 0 i β + ɛ i, with Var(ɛ i ) = σ 2 i. Thus, Var(ɛ) = diagfσ σ2 ng. In this setting, the transformation P is given by P = diagf 1 σ 1,..., 1 σ n g It is easy to verify that (P 0 P) 1 = Ω = Σ, when we set σ 2 = 1.

40 Unfeasible GLS The transformed model is given by Thus, bβ GLS = y 0 i xi = β + ɛ i. σ i σ i σ i n i=1! 1 x i xi 0 σ 2 i which is also called "weighted least squares". This approach is only available if σ 2 i is known.! n x i y i i=1 σ 2, i

41 Feasible GLS Problem: σ 2 i is not known in general. Therefore, the latter estimator is unfeasible. We need to use a feasible version of such estimator, which means we need to replace the unknown σ 2 i with sensible estimates: bβ GLS = n i=1! 1 x i xi 0 bσ 2 i! n x i y i i=1 bσ 2. i

42 Feasible GLS The main issue is that we need to know the form of the heteroscedasticity in order to estimate σ 2 i. The simplest situation, for instance, would be if we had a conjecture that σ 2 i = zi 0a, where the components of z i are observable. In such case, a consistent estimator of σ 2 i would be the tted values of the arti cial (called skedastic) regression: bɛ 2 i = z 0 i a + v i, where bɛ i are the residuals of the original regression when estimated by standard OLS. We can accomodate into this idea more complicated functional forms for the heteroscedasticity (e.g. σ 2 i = exp(z 0 i a)).

43 White robust standard errors Rather than estimating β by GLS, we can stick to OLS and "correct" the standard errors. We have seen that Var b β OLS = (X 0 X ) 1 X 0 E (ɛɛ 0 )X (X 0 X ) 1 = (X 0 X ) 1 X 0 ΣX (X 0 X ) 1 and in the vector form Var b β OLS =! 1 n x i xi 0 i=1! n σ 2 i x i xi 0 i=1! 1 n x i xi 0. i=1 We don t know σ 2 i, but under general conditions White (1980) showed that the matrix n 1 bɛ 2 i x i xi 0, n i=1 where again bɛ 2 i are the residuals of the OLS estimation, is a consistent estimator of 1 n n i=1 σ 2 i x i xi 0.

44 Detection of heteroscedasticity Informal methods: graphical method Formal methods: Goldfeld- Quandt test Breusch-Pagan/Godfrey test White s general heteroscedasticity test

45 Goldfeld- Quandt test Suppose that we think that the heteroscedasticity is particularly related to one of the variables, i.e. one of the columns of X, say x j. The Goldfeld- Quandt test is based on the following: 1 Order the observations according to the values of x j, starting with the lowest. 2 Split the sample into three parts, of lenghts n 1, c and n 2. 3 Obtain the residuals bɛ 1 and bɛ 2 from regressions using the rst n 1, and then the last n 2 observations separately. 4 Test H 0 : σ 2 i = σ 2 for all i using the statistic F = bɛ0 1 bɛ 1/(n 1 k) bɛ 0 2 bɛ 2/(n 2 k) with critical values obtained from the F (n 1 k, n 2 k) distribution. Practical advice: use this test for relatively small sample sizes (up to n = 100); with n = 30 choose c = 8, with n = 60 choose c = 16.

46 Breusch-Pagan/Godfrey test Suppose that the source of heteroscedasticity is that σ 2 i = g(α 0 + ez 0 i α), where ez 0 i is the q 1 row of a matrix of regressors Z, which is n q. Z can contain some or all of the X s in our CLRM, if desired, although ideally the choice of Z is done on the basis of some economic theory. For computational purposes Z does not contain a constant term. This test involves an auxiliary regression: v = ez γ + u, ez = M 0 Z where M 0 = I ee 0 e 0 e with e = (1, 1,..., 1)0. u is a vector of iid errors and the dependent variable is de ned by v i = bɛ2 i (bɛ 0 bɛ)/n 1.

47 Breusch-Pagan/Godfrey test The test is related to an F test in the auxiliary regression of the joint restrictions H 0 : γ = 0 and the test- statistic is given by BP = 1 1 v 0 ez Z e 0 ez Z e v 0, 2 which follows χ 2 (q) distribution in large samples.

48 White test For simplicity, assume that the data follow y i = β 0 + β 1 x 1i + β 2 x 2i + ɛ i. 1 Run your regression and save the residuals (denoted as bɛ i, as usual). 2 Run an auxiliary (arti cial) regression with the squared residuals as the dependent variable and where explanatory variables include all the explanatory variables from step 1 plus the squares and cross products: bɛ 2 i = β 0 + β 1 x 1i + β 2 x 2i + β 3 x 2 1i + β 4 x 2 2i β 0 + β 5 x 1i x 2i + v i. 3 Obtain R 2 from the auxiliary regression in step 2. 4 Construct the test- statistic: nr 2 asy. χ 2 (k 1), where k is the number of parameters we estimate in the auxiliary regression, so k = 6 in this example.

49 White test Note that H 0 : Var(ɛ i ) = σ 2 vs. H 1 : heteroscedasticity. Thus, if the calculated value of the test-statistic exceeds the critical chi-square value at the chosen level of signi cance, we reject the null that the error has a constant variance. White test is a large sample test, so it is expected to work well only when the sample is su ciently large.

50 Summary of heteroscedasticity A crucial CLRM assumption is that Var(ɛ i ) = σ 2. If Var(ɛ i ) = σ 2 i instead, we have heteroscedasticity. OLS estimators are no longer BLUE; still unbiased, but not e cient. Two remedial approaches to deal with heteroscedasticity: when σ 2 i is known: weighted least squares (WLS), when σ 2 i is unknown: make an educated guess about the likely pattern of the heteroscedasticity to consistently estimate σ 2 i and use it in the feasible GLS White s heteroscedasticity- consistent variances and standard errors A number of tests are available to detect heteroscedasticity.

51 Serial correlation- general case Now we want to specify the results we discussed to the model y t = x 0 t β + ɛ t, where x t is the usual k 1 vector. E (ɛ t ) = 0 and E (ɛ t ɛ s ) 6= 0 for some t 6= s. Note that autocorrelation (or serial correlation, it s the same) is very common in time series data, for instance due to unobserved factors (omitted variables) that are correlated over time. A common form of serial correlation is the autoregressive structure.

52 Autocorrelation patterns There are several forms of autocorrelation, each leading to a di erent structure for the error covariance matrix Var(ɛ) = σ 2 Ω. We will only consider here AR(1) structures, i.e. ɛ t = ρɛ t 1 + v t, where v t iid(0, σ 2 ), where jρj < 1. It s easy to show that E (ɛ t ) = 0 We can show that: E (ɛɛ 0 ) = σ2 v 1 ρ ρ ρ 2... ρ n 1 ρ 1 ρ... ρ n ρ n 1 ρ n C A

53 Autocorrelation patterns As we discussed, in case we have non-spherical disturbances: we can either stick to OLS and correct the standard errors or we can derive GLS, but this requires a feasible version of the matrix P to be implemented. A suitable matrix P can be derived fairly easily in the case of AR(1) error term as 0 p 1 1 ρ ρ P = B 0 ρ A ρ 1 Rather than on the derivation of P we focus on tests for detecting serial correlation.

54 Adjust the standard errors when errors are serially correlated When all other classical regression assumptions are satis ed (in particular uncorrelatedness between the errors and regressors) we may decide to use OLS, which is unbiased and consistent but ine cient. The corrected standard errors can be derived from Var( ˆβ) = (X 0 X ) 1 X 0 ΣX (X 0 X ) 1, using Newey-West Heteroskedasticity Autocorrelation Consistent estimator for Σ.

55 Testing for serial correlation Informal checks: Tests: plot the residuals and check whether there is any systematic pattern obtain the correlogram (graphical representation of the autocorrelations) of the residuals and check whether calculated autocorrelations are signi cantly di erent from zero. From the correlogram you can also "guess" the structure of the autocorrelation. Durbin- Watson d test Breusch- Godfrey test

56 Durbin- Watson d test The rst test we consider to test H 0 : d statistic, which is ρ = 0 is the Durbin-Watson d = T t=2 (ˆɛ t ˆɛ t 1 ) 2 T ˆɛ 2 t t=2 where ˆɛ t are the OLS residuals obtained by estimating the parameters. It is possible to show that for large T, d! 2 2ρ. The value of d is given by any statistical package. If the value of d is close to 2, we can conclude that the model is free from serial correlation. We need to establish a precise rejection rule.,

57 Durbin- Watson d test Unfortunately, establishing a rejection rule for a test of H 0 based on d statistic is more complicated than usual. This is because the critical values depend also on the value of the regressors. The best we can do is to give upper and lower bounds for the critical values of d, d U and d L, and establish the following rejection rule If the computed value of d is lower than d L, we reject H 0. If the computed value of d is greater than d U, we fail to reject H 0. If the computed value of d is between d L and d U we cannot conclude (the outcome is indeterminate). d L and d U in the statistical tables are given in terms of number of observations and number of regressors.

58 Durbin- Watson d test Limitations: d test is only valid when: the model contains the intercept there are no lagged dependent variables as regressors

59 Breusch- Godfrey test Suppose we have the general model y t = x 0 t β + ɛ t where ɛ t = ρɛ t 1 + v t, (7) where one or more of the regressors can be a lagged y (such as, y t we wish to test H 0 : ρ = 0, we can 1 ). If Estimate the parameters of (7) by OLS and save the residuals ˆɛ t for t = 1,...T. Estimate the following residual regression ˆɛ t = x 0 t γ k + δˆɛ t 1 + ν t. (8) From the regression output obtained by performing regression (8), compute nr 2 ( n is the actual number of observations, which is T 1 in this case as the rst one is lost). Under H 0, nr 2 is asymptotically χ 2 with one degree of freedom.

60 Maximum Likelihood Estimation The most direct way of estimating unknown parameters is known as maximum likelihood estimation. Although the principle can be applied more generally, assume that the data fy i g n i=1 are iid copies of the RV Y, which is known to have a density function f (y, θ). As y i are iid, the joint sample density is f (y 1,..., y n, θ) = n i=1 f (y i, θ). We then interpret this as a function of the unknown parameters given we have observed the data, as in L(θ, y) = f (y 1,..., y n, θ) called the likelihood. Then, we simply ask what values of θ are most likely given this data i.e. we maximize the likelihood with respect to θ. Actually, since log is a monotone function, we de ne MLE to be bθ = arg max l(θ) = log L(θ, y). θ

61 Maximum Likelihood Estimation in NLRM Consider the model y = X β + ε and let the standard assumptions (A1-A4) hold. Additionally assume that the errors are normally distributed, i.e. A5 : ε N(0, σ 2 I n ). Since y = X β + ε, we immediately have y N(X β, σ 2 I n ), as if w N(µ, Σ), then for any n m A and m 1 b, z = A 0 w + b N(A 0 µ + b, A 0 ΣA).

62 Maximum Likelihood Estimation in NLRM Moreover we can write down its joint density (multivariate normal) f (y) = expf 1 2σ 2 (y X β) 0 (y X β)g (2πσ 2 ) n 2 from which we obtain the log-likelihood l(β, σ 2 ) = 1 2σ 2 (y X β)0 (y X β) n 2 ln n σ2 ln 2π. 2

63 Maximum Likelihood Estimation in NLRM Thus the maximum likelihood estimators must satisfy the FOC (the score vector equals 0 at the MLE), i.e. with l(β, σ 2 ) β S(β, σ 2 ) = l(β,σ 2 ) β = 0 l(β,σ 2 ) σ 2 = X 0 y X 0 X β σ 2 = 0 (9) l(β, σ 2 ) σ 2 = 1 2σ 4 (y X β)0 (y X β) Solving (9), we nd bβ MLE = (X 0 X ) 1 X 0 y n 2σ 2 = 0. bσ 2 MLE = (y X bβ) 0 (y X bβ) n = y 0 My n.

64 Maximum Likelihood Estimation in NLRM The SOC, is that the matrix of second order derivatives (evaluated at MLE estimators), i.e l(β,σ 2 ) 2 l(β,σ 2 ) H(β, σ 2 )j b β MLE,bσ 2 β β 0 β σ 2 A MLE 2 l(β,σ 2 ) σ 2 β 2 l(β,σ 2 ) (σ 2 ) 2 is negative de nite. Note that H(β, σ 2 ) = and X 0 X X 0 y X 0 X β σ 2 σ 4 X 0 y X 0 X β σ 4 1 σ 6 (y X β) 0 (y X β) + n 2σ 4 H(bβ MLE, bσ 2 MLE ) = which is obviously negative de nite. X 0 X 0 bσ 2 MLE n 0 2bσ 2 MLE!,! (10)

65 Properties of MLE Clearly bβ MLE N(β, σ 2 (X 0 X ) 1 ). Note that the OLS estimator equals bβ MLE under normality. We proved that the OLS estimator is BLUE by the Gauss- Markov Theorem (smallest variance in the class of linear unbiased estimators). Under normality, we can strengthen our notion of e ciency and we can show that bβ MLE is BUE (Best Unbiased Estimator) by the Cramer-Rao Lower Bound (the variance of any unbiased estimator is at least as large as the inverse of the information).

66 Cramer-Rao Lower Bound The information is de ned as I n (β, σ 2 ) = E H(β, σ 2 ) and given (10), in our case we have I n (β, σ 2 ) = X 0 X σ n 2σ 4 so that the inverse of the information is 1 σ In (β, σ 2 2 (X 0 X ) 1 0 ) = 2σ 0 4 n.

67 Maximum Likelihood Estimation in NLRM h i Since we know that V b β MLE = σ 2 (X 0 X ) 1, then clearly bβ MLE is BUE. Note that bσ 2 MLE is biased, but we can de ne the unbiased estimator by s 2 = n n k bσ2 MLE = (y X bβ) 0 (y X bβ). n k Unfortunately the unbiased s 2 is not BUE, since V s 2 = 2σ4 (n k) > 2σ4 n. Note that as n!, s2 becomes e cient as long as k is xed.

68 Maximum Likelihood Estimation in NLRM h i Since we know that V b β MLE = σ 2 (X 0 X ) 1, then clearly bβ MLE is BUE. Note that bσ 2 MLE is biased, but we can de ne the unbiased estimator by s 2 = n n k bσ2 MLE = (y X bβ) 0 (y X bβ). n k Unfortunately the unbiased s 2 is not BUE, since V s 2 = 2σ4 (n k) > 2σ4 n. Note that as n!, s2 becomes e cient as long as k is xed.

69 Oaxaca-Blinder (1973) Microeconometric decomposition technique, which allows to study the di erences in outcomes between two groups, when these di erences come from the di erences in characteristics (explained variation) and di erences in parameters (discrimination/ unexplained variation). Most often used in the literature on wage inequalities (female vs. male, union vs. nonunion workers, public vs. private sector workers, migrants vs. native workers).

70 Oaxaca-Blinder preliminaries Assume that y is explained by the vector of regressors x as in the linear regression model: y = β female x i + ε female i, if i is female β male x i + ε male i, if i is male where β also contains the intercept. Assume that men are privilidged. The di erence in a mean outcome is, y male y female = β male x male β female x female

71 Oaxaca-Blinder Alternatively: y male y female = 4x β male + 4βx female or y male y female = 4x β female + 4βx male Di erences in outcome come from di erent characteristics and di erent parameters (females have worse x and worse β). Even more generally: y male y female = 4x β female + 4βx female + 4x4β = E + C + CE Problem: which group to pick as the reference?

72 General version of the decomposition General equation: y male y female = β (x male x female ) (di in characteristics) + x male (β male β )+x female (β β female )(di in parameters) (male advantage) (female disadvantage) where β = λβ male + (1 λ)β female

73 General version of the decomposition λ value Interpretation λ = 1 Male parameters as a reference λ = 0 Female parameters as a reference λ = 0.5 Average, Reimers (1983) λ = %male Parameters weighted by the sample proportion, Cotton (1988) β Parameters for the whole sample without gender dummy, = pooled Neumark (1988) β Parameters for the whole sample with gender dummy, = pooled Fortin (2008) λ = %female Parameters weighted by the opposite sex, S oczyński (2013)

74 Some asymptotic results prior to Stochastic Regressors The idea is to treat results that hold when n! as approximations for nite n. In particular, we will be interested to see whether estimators are consistent and what their asymptotic distribution is. Asymptotic results are useful in nite samples ( nite n) because, for instance, we may not be able to show unbiasedness or we may not be able to determine the sampling distribution to carry on statistical inference.

75 Asymptotic theory- some results De nition - Convergence in Probability Let X n = fx i, i = 1,..., ng be a sequence of random variables. X n converges in probability to X if lim Pr(jX n X j > ε) = 0 for any ε > 0. n! You will see convergence in probability written as either X n! p X or p lim n! X n = X.

76 Convergence in Probability The idea behind this type of convergence is that the probability of an "unusual" outcome becomes smaller and smaller as the sequence progresses. Example: Suppose you take a basketball and start shooting free throws. Let X n be your success percentage in n th shot. Initially you are likely to miss a lot, but as the time goes on your skill increases, and you are more likely to make the shots. After years of practice the probability that you miss will be getting increasingly smaller and smaller. Thus, as n! the sequence X n converges in probability to X = 100%. Note that the probability does not become 100% as there is always a small probability of missing.

77 Consistency De nition - Consistency An estimator bθ of θ is consistent if, when the sample size increases, bθ gets "closer" to θ. Formally, bθ is consistent when p lim n! bθ = θ. Useful result - Slutsky Theorem Let g() be a continuous function. Then p lim n! g(x n ) = g(p lim n! X n ). Useful result - Chebyshev s inequality For a random variable X n with mean µ and variance Var(X n ) Pr(jX n µj > ε) Var(X n) ε 2 for any ε > 0.

78 Consistency Su cient condition for consistency (not necessary): If lim E (bθ) = θ and Var(bθ)! 0 as n! n! then or as we write, lim Pr(j bθ θj > ε) = 0 n! p lim n! bθ = θ. If bθ is unbiased, in order to determine whether bθ is also consistent, we only need to check that Var(bθ)! 0 as n!. Note that biased estimators can be consistent as long as lim n! E (bθ) = θ and Var(bθ)! 0 as n!.

79 Unbiasedness versus Consistency Unbiased, but not consistent: Suppose that given an iid sample fx 1,..., X n g we want to estimate the mean of X, i.e. E (X ). We could use the rst observation as the estimator for the mean, so bθ = X 1. bθ is unbiased, since E (bθ) = E (X 1 ) = E (X ), because we have an iid sample. However, bθ does not converge to any value, therefore it cannot be consistent. Biased, but consistent: an alternative estimator for the mean in an iid sample fx 1,..., X n g might be eθ = 1 n n i=1 X i + 1 n. eθ is biased, since E (eθ) = E (X ) + 1 n, but lim E ( eθ) = E (X ) n! and Var(eθ) = Var(X ) n Therefore, eθ is consistent.! 0 as n!.

80 Law of Large Numbers Laws of large numbers provide conditions to ensure that sample moments converge to their population moments. Weak Law of Large Numbers Let X n = fx i, i = 1,..., ng be an independent and identically distributed (iid) sequence of random variables with E jx i j <. Then n 1 n X i! p E (X i ). i=1 We only consider this version of WLLN (for iid data). When data are not iid, such as for time series data, stronger conditions are needed.

81 Law of Large Numbers Example: Consider a coin (heads and tails) being ipped. Logic says there are equal chances of getting heads or tails. If the coin is ipped 10 times, there is a good chance that the proportion of heads and tails is not equal. Crudely, the law of large numbers says that as the number of times ipped increases the proportion of heads will converge in probability to 0.5.

82 Sampling distribution When we do not know the exact sampling distribution of an estimator (for instance when we do not assume normality of the error term), we may ask whether asymptotics can allow us to infer something about its distribution for large n, so that we are still able to make inference on the estimates. De nition - Convergence in Distribution Let X n = fx i, i = 1,..., ng be a sequence of random variables and let X be a random variable with distribution F X (x). X n converges in distribution to X if lim n! Pr(X n x) = F X (x) and is written X n! d X. The approximating distribution F X (x) is called a limiting or asymptotic distribution.

83 Useful results Convergence in probability implies convergence in distribution, i.e. X n! p X ) X n! d X. Convergence in distribution to a constant implies convergence in probability to a constant, i.e. X n! d c ) X n! p c. Suppose that fy n g is another sequence of random variables and let g() be a continuous function. If X n! d X and Y n! p c then For example, X n + Y n! d X + c. g(x n, Y n )! d g(x, c).

84 Central Limit Theorem The most important example of convergence in distribution is called the Central Limit Theorem, which is useful to establish the asymptotic distribution. If X 1,..., X n is an iid sample from any probability distribution with nite mean µ and nite variance σ 2, we have p n(x µ)! d N(0, σ 2 ), or equivalently p n(x µ)! d N(0, 1). σ The CLT guarantees that, even if the errors are not normally distributed, but simply iid with zero mean and variance σ 2, we can conclude p! n b 1 1 β β! d N 0, σ 2 lim n X 0 X.

85 Limiting distributions of test-statistics Thus, when the disturbances are iid with zero mean and nite variance, the tests we previously discussed are asymptotically valid and t! d N(0, 1) under H 0 W = F! d χ 2 J under H 0.

86 Law of iterated expectation We have a useful result that we will use extensively. Let X and Y be two random variables, we have E (Y ) = E [E (Y jx )]. A very useful by-product of the Law of Iterated Expectation is sometimes written as E (XY ) = E [XE (Y jx )], E (XY ) = E X [XE (Y jx )] to indicate that the outer expectation is taken with respect to X.

87 Stochastic regressors We consider again y = X β + ɛ. So far, we assumed that the regressors are xed. This is a rather unrealistic assumption. We now allow X to have stochastic components. However, a crucial condition that has to hold to perform regression analysis is that regressors and errors have to be uncorrelated.

88 Assumptions Our new revised assumptions become A1 The true model is y = X β + ɛ. A2 8i, E (ɛ i jx ) = 0 (conditional zero mean)- more about this later. A3 Var(ɛjX ) = E (ɛɛ 0 jx ) = σ 2 I (spherical disturbances). A4 X has full column rank. A5 (eventually, for testing purposes, ɛjx N(0, σ 2 I )). Under A1-A4 the OLS is Gauss - Markov e cient.

1 The Multiple Regression Model: Freeing Up the Classical Assumptions

1 The Multiple Regression Model: Freeing Up the Classical Assumptions Some or all of classical assumptions were crucial for many of the derivations of the previous chapters. Derivation of the OLS estimator