Least Squares Estimation-Finite-Sample Properties

Similar documents
Econometrics Summary Algebraic and Statistical Preliminaries

Additional Topics on Linear Regression

ECON The Simple Regression Model

Econ 510 B. Brown Spring 2014 Final Exam Answers

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Homoskedasticity. Var (u X) = σ 2. (23)

Linear Regression. Junhui Qian. October 27, 2014

Introductory Econometrics

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Heteroskedasticity (Section )

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Motivation for multiple regression

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

The Statistical Property of Ordinary Least Squares

Projection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42

Intermediate Econometrics

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Review of Econometrics

Multiple Regression Analysis

Heteroskedasticity and Autocorrelation

Multiple Regression Analysis: Further Issues

Essential of Simple regression

Multiple Regression Analysis

the error term could vary over the observations, in ways that are related

Simple Linear Regression: The Model

Econometrics I Lecture 3: The Simple Linear Regression Model

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

ECNS 561 Multiple Regression Analysis

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Lecture 4: Multivariate Regression, Part 2

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

1. The Multivariate Classical Linear Regression Model

The Multiple Regression Model Estimation

Regression and Statistical Inference

Reliability of inference (1 of 2 lectures)

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Answers to Problem Set #4

3. Linear Regression With a Single Regressor

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Introductory Econometrics

1 Motivation for Instrumental Variable (IV) Regression

Multiple Regression Analysis: Heteroskedasticity

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Econometrics - 30C00200

Simple Linear Regression Model & Introduction to. OLS Estimation

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Environmental Econometrics

Multiple Linear Regression CIVL 7012/8012

Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

The Simple Regression Model. Simple Regression Model 1

The Standard Linear Model: Hypothesis Testing

ECON3327: Financial Econometrics, Spring 2016

Advanced Econometrics I

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Linear Models in Econometrics

Christopher Dougherty London School of Economics and Political Science

ECON3150/4150 Spring 2015

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Simple and Multiple Linear Regression

Ordinary Least Squares Regression

Applied Statistics and Econometrics

Lecture 34: Properties of the LSE

OSU Economics 444: Elementary Econometrics. Ch.10 Heteroskedasticity

Lectures 5 & 6: Hypothesis Testing

Econometrics of Panel Data

Multiple Linear Regression

Introductory Econometrics

Multivariate Regression


Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

Testing Linear Restrictions: cont.

Heteroskedasticity. Part VII. Heteroskedasticity

Empirical Economic Research, Part II

Multiple Regression Analysis

Introductory Econometrics

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Multivariate Regression Analysis

STAT 100C: Linear models

Heteroskedasticity. We now consider the implications of relaxing the assumption that the conditional

Econometrics Multiple Regression Analysis: Heteroskedasticity

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

2. Linear regression with multiple regressors

Econometrics I KS. Module 1: Bivariate Linear Regression. Alexander Ahammer. This version: March 12, 2018

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Lecture 4: Multivariate Regression, Part 2

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation

Lecture 3: Multiple Regression

WISE International Masters

Basic Regression Analysis with Time Series Data

A Practitioner s Guide to Cluster-Robust Inference

STAT 540: Data Analysis and Regression

The Simple Linear Regression Model

ECON 4230 Intermediate Econometric Theory Exam

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Greene, Econometric Analysis (7th ed, 2012)

Lecture 4: Heteroskedasticity

Transcription:

Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29

Terminology and Assumptions 1 Terminology and Assumptions 2 Goodness of Fit 3 Bias and Variance 4 The Gauss-Markov Theorem 5 Multicollinearity 6 Hypothesis Testing: An Introduction 7 LSE as a MLE Ping Yu (HKU) Finite-Sample 2 / 29

Terminology and Assumptions Terminology and Assumptions Ping Yu (HKU) Finite-Sample 2 / 29

Terminology and Assumptions Terminology y = x 0 β + u, E[ujx] = 0. u is called the error term, disturbance or unobservable. y x Dependent variable Independent Variable Explained variable Explanatory variable Response variable Control (Stimulus) variable Predicted variable Predictor variable Regressand Regressor LHS variable RHS variable Endogenous variable Exogenous variable Covariate Conditioning variable Table 1: Terminology for Linear Regression Ping Yu (HKU) Finite-Sample 3 / 29

Terminology and Assumptions Assumptions We maintain the following assumptions in this chapter. Assumption OLS.0 (random sampling): (y i,x i ), i = 1,,n, are independent and identically distributed (i.i.d.). Assumption OLS.1 (full rank): rank(x) = k. Assumption OLS.2 (first moment): E[yjx] = x 0 β. Assumption OLS.3 (second moment): E[u 2 ] <. Assumption OLS.3 0 (homoskedasticity): E[u 2 jx] = σ 2. Ping Yu (HKU) Finite-Sample 4 / 29

Terminology and Assumptions Discussion Assumption OLS.2 is equivalent to y = x 0 β + u (linear in parameters) plus E[ujx] = 0 (zero conditional mean). To study the finite-sample properties of the LSE, such as the unbiasedness, we always assume Assumption OLS.2, i.e., the model is linear regression. 1 Assumption OLS.3 0 is stronger than Assumption OLS.3. The linear regression model under Assumption OLS.3 0 is called the homoskedastic linear regression model, y = x 0 β + u, E[ujx] = 0, E[u 2 jx] = σ 2. If E[u 2 jx] = σ 2 (x) depends on x we say u is heteroskedastic. 1 For large-sample properties such as consistency, we require only weaker assumptions. Ping Yu (HKU) Finite-Sample 5 / 29

Goodness of Fit Goodness of Fit Ping Yu (HKU) Finite-Sample 6 / 29

Goodness of Fit Residual and SER Express y i = by i + bu i, (1) where by i = x 0 i b β is the predicted value, and bu i = y i by i is the residual. 2 Often, the error variance σ 2 = E[u 2 ] is also a parameter of interest. It measures the variation in the "unexplained" part of the regression. Its method of moments (MoM) estimator is the sample average of the squared residuals, bσ 2 = 1 n n bu i 2 i=1 = 1 n bu0 bu. An alternative estimator uses the formula s 2 = 1 n k n i=1 bu i 2 = 1 n k bu0 bu. This estimator adjusts the degree of freedom (df) of bu. 2 bu i is different from u i. The later is unobservable while the former is a by-product of OLS estimation. Ping Yu (HKU) Finite-Sample 7 / 29

Goodness of Fit Coefficient of Determination If X includes a column of ones, 1 0 bu = n i=1 bu i = 0, so y = by. Subtracting y from both sides of (1), we have ey i y i y = by i y + bu i e by i + bu i Since e by 0 bu = by 0 bu y 1 0 bu = β b X 0 bu y 1 0 bu = 0, SST ey 2 = ey 0 ey = e by 2 + 2 e by 0 bu + bu 2 = e by 2 + bu 2 SSE + SSR, (2) where SST, SSE and SSR mean the total sum of squares, the explained sum of squares, and the residual sum of squares (or the sum of squared residuals), respectively. Dividing SST on both sides of (2), 3 we have 1 = SSE SST + SSR SST. The R-squared of the regression, sometimes called the coefficient of determination, is defined as R 2 = SSE SST = 1 3 When can we conduct this operation, i.e., SST 6= 0? SSR SST = 1 bσ 2 bσ 2. y Ping Yu (HKU) Finite-Sample 8 / 29

Goodness of Fit More on R 2 R 2 is defined only if x includes a constant. It is usually interpreted as the fraction of the sample variation in y that is explained by (nonconstant) x. When there is no constant term in x i, we need to define so-called uncentered R 2, denoted as R 2 u, R 2 u = by0 by y 0 y. R 2 can also be treated as an estimator of ρ 2 = 1 σ 2 /σ 2 y. It is often useful in algebraic manipulation of some statistics. An alternative estimator of ρ 2 proposed by Henri Theil (1924-2000) called adjusted R-squared or "R-bar-squared" is where eσ 2 y = ey 0 ey/(n 1). R 2 = 1 s 2 eσ 2 y = 1 (1 R 2 ) n 1 n k R2, R 2 adjusts the degrees of freedom in the numerator and denominator of R 2. Ping Yu (HKU) Finite-Sample 9 / 29

Goodness of Fit Degree of Freedom Why called "degree of freedom"? Roughly speaking, the degree of freedom is the dimension of the space where a vector can stay, or how "freely" a vector can move. For example, bu, as a n-dimensional vector, can only stay in a subspace with dimension n k. Why? This is because X 0 bu = 0, so k constraints are imposed on bu, and bu cannot move completely freely and loses k degree of freedom. Similarly, the degree of freedom of ey is n freedom of ey is n 1 when n = 2. 1. Figure 1 illustrates why the degree of Table 2 summarizes the degrees of freedom for the three terms in (2). Variation Notation df SSE eby 0 e by k 1 SSR bu 0 bu n k SST ey 0 ey n 1 Table 2: Degrees of Freedom for Three Variations Ping Yu (HKU) Finite-Sample 10 / 29

Goodness of Fit 0 0 Figure: Although dim(ey) = 2, df(ey) = 1, where ey = (ey 1, ey 2 ) Ping Yu (HKU) Finite-Sample 11 / 29

Bias and Variance Bias and Variance Ping Yu (HKU) Finite-Sample 12 / 29

Bias and Variance Unbiasedness of the LSE Assumption OLS.2 implies that Then 0 E[ujX] = B @ y = x 0 β + u,e[ujx] = 0.. E[u i jx]. 1 C A = 0 B @. E[u i jx i ]. 1 C A = 0, where the second equality is from the assumption of independent sampling (Assumption OLS.0). Now, bβ = X 0 X 1 X 0 y = X 0 X 1 X 0 (Xβ + u) = β + X 0 X 1 X 0 u, so h E bβ i h βjx = E X 0 X i 1 X 0 ujx = X 0 X 1 X 0 E[ujX] = 0, i.e., b β is unbiased. Ping Yu (HKU) Finite-Sample 13 / 29

Bias and Variance Variance of the LSE Var bβjx = Var X 0 X 1 X 0 ujx = X 0 X 1 X 0 Var (ujx)x X 0 X 1 X 0 X 1 X 0 DX X 0 X 1. Note that i Var (u i jx) = Var (u i jx i ) = E hu i 2 jx i i E [u i jx i ] 2 = E hui 2 jx i σ 2 i, and Cov(u i,u j jx) = E u i u j jx E [u i jx]e u j jx = E u i u j jx i,x j E [u i jx i ]E u j jx j = E [u i jx i ]E u j jx j E [u i jx i ]E u j jx j = 0, so D is a diagonal matrix: D = diag σ 2 1,,σ2 n. Ping Yu (HKU) Finite-Sample 14 / 29

Bias and Variance continue... It is useful to note that X 0 n DX = x i x 0 i σ 2 i. i=1 In the homoskedastic case, σ 2 i = σ 2 and D = σ 2 I n, so X 0 DX = σ 2 X 0 X, and Var bβjx = σ 2 X 0 X 1. You are asked to show that n Var bβ j jx = w ij σ 2 i /SSR j,j = 1,,k, i=1 where w ij > 0, n i=1 w ij = 1, and SSR j is the SSR in the regression of x j on all other regressors. So under homoskedasticity, h i Var bβ j jx = σ 2 /SSR j = σ 2 / SST j (1 Rj 2 ),j = 1,,k, (why?), where SST j is the SST of x j, and Rj 2 is the R-squared from the simple regression of x j on the remaining regressors (which includes an intercept). Ping Yu (HKU) Finite-Sample 15 / 29

Bias and Variance Bias of bσ 2 Recall that bu = Mu, where we abbreviate M X as M, so by the properties of projection matrices and the trace operator, we have bσ 2 = 1 n bu0 bu = 1 n u0 MMu = 1 n u0 Mu = 1 n tr u0 Mu = 1 n tr Muu0. Then h E bσ 2 i X = 1 n tr E Muu 0 jx = 1 n tr ME uu 0 jx = 1 n tr(md). In the homoskedastic case, D = σ 2 I n, so Thus bσ 2 underestimates σ 2. h E bσ 2 i X = 1 n tr Mσ 2 = σ 2 n k n Alternatively, s 2 = n 1 k bu0 bu is unbiased for σ 2. This is the justification for the common preference of s 2 over bσ 2 in empirical practice. However, this estimator is only unbiased in the special case of the homoskedastic linear regression model. It is not unbiased in the absence of homoskedasticity or in the projection model.. Ping Yu (HKU) Finite-Sample 16 / 29

The Gauss-Markov Theorem The Gauss-Markov Theorem Ping Yu (HKU) Finite-Sample 17 / 29

The Gauss-Markov Theorem The Gauss-Markov Theorem The LSE has some optimality properties among a restricted class of estimators in a restricted class of models. The model is restricted to be the homoskedastic linear regression model, and the class of estimators are restricted to be linear unbiased. Here, "linear" means the estimator is a linear function of y. In other words, the estimator, say, e β, can be written as where A is any n k matrix of X. eβ = A 0 y = A 0 (Xβ + u) = A 0 Xβ + A 0 u, Unbiasedness implies that E[ e βjx] = E[A 0 yjx] = A 0 Xβ = β or A 0 X = I k. In this case, β e = β + A 0 u, so under homoskedasticity, Var eβjx = A 0 Var(ujX)A = A 0 Aσ 2. The Gauss-Markov Theorem states that the best choice of A 0 is (X 0 X) 1 X 0 in the sense that this choice of A achieves the smallest variance. Ping Yu (HKU) Finite-Sample 18 / 29

The Gauss-Markov Theorem continue... Theorem In the homoskedastic linear regression model, the best (minimum-variance) linear unbiased estimator (BLUE) is the LSE. Proof. Given that the variance of the LSE is (X 0 X) 1 σ 2 and that of β e is A 0 Aσ 2. It is sufficient to show that A 0 A (X 0 X) 1 0. Set C = A X(X 0 X) 1. Note that X 0 C = 0. Then we calculate that A 0 A (X 0 X) 1 = C + X(X 0 X) 1 0 C + X(X 0 X) 1 (X 0 X) 1 = C 0 C + C 0 X(X 0 X) 1 + (X 0 X) 1 X 0 C + (X 0 X) 1 X 0 X(X 0 X) 1 (X 0 X) 1 = C 0 C 0. Ping Yu (HKU) Finite-Sample 19 / 29

The Gauss-Markov Theorem Limitation and Extension of the Gauss-Markov Theorem The scope of the Gauss-Markov Theorem is quite limited given that it requires the class of estimators to be linear unbiased and the model to be homoskedastic. This leaves open the possibility that a nonlinear or biased estimator could have lower mean squared error (MSE) than the LSE in a heteroskedastic model. MSE: for simplicity, suppose dim(β ) = 1; then eβ 2 MSE eβ? = E β = Var eβ + Bias eβ 2. To exclude such possibilities, we need asymptotic (or large-sample) arguments. Chamberlain (1987) shows that in the model y = x 0 β + u, if the only available information is E[xu] = 0 or (E[ujx] = 0 and E[u 2 jx] = σ 2 ), then among all estimators, the LSE achieves the lowest asymptotic MSE. Ping Yu (HKU) Finite-Sample 20 / 29

Multicollinearity Multicollinearity Ping Yu (HKU) Finite-Sample 21 / 29

Multicollinearity Multicollinearity If rank(x 0 X) < k, then b β is not uniquely defined. This is called strict (or exact) multicollinearity. This happens when the columns of X are linearly dependent, i.e., there is some α 6= 0 such that Xα = 0. Most commonly, this arises when sets of regressors are included which are identically related. For example, if X includes a column of ones and both dummies for male and female. When this happens, the applied researcher quickly discovers the error as the statistical software will be unable to construct (X 0 X) 1. Since the error is discovered quickly, this is rarely a problem for applied econometric practice. The more relevant issue is near multicollinearity, which is often called "multicollinearity" for brevity. This is the situation when the X 0 X matrix is near singular, or when the columns of X are close to be linearly dependent. This definition is not precise, because we have not said what it means for a matrix to be "near singular". This is one difficulty with the definition and interpretation of multicollinearity. Ping Yu (HKU) Finite-Sample 22 / 29

Multicollinearity continue... One implication of near singularity of matrices is that the numerical reliability of the calculations is reduced. A more relevant implication of near multicollinearity is that individual coefficient estimates will be imprecise. We can see this most simply in a homoskedastic linear regression model with two regressors y i = x 1i β 1 + x 2i β 2 + u i, and In this case, Var bβjx = σ 2 n 1 1 ρ n X0 X = ρ 1 1 ρ ρ 1 1 =. σ 2 1 ρ n(1 ρ 2 ) ρ 1 The correlation indexes collinearity, since as ρ approaches 1 the matrix becomes singular. σ 2 /n(1 ρ 2 )! as ρ! 1. Thus the more "collinear" are the regressors, the worse the precision of the individual coefficient estimates.. Ping Yu (HKU) Finite-Sample 23 / 29

Multicollinearity continue... In the general model recall that y i = x 1i β 1 + x 0 2i β 2 + u i, Var bβ σ 1 jx = 2 SST 1 (1 R1 2). (3) Because the R-squared measures goodness of fit, a value of R1 2 close to one indicated that x 2 explains much of the variation in x 1 in the sample. This means that x 1 and x 2 are highly correlated. When R1 2 approaches 1, the variance of β b 1 explodes. 1/(1 R1 2 ) is often termed as the variance inflation factor (VIF). Usually, a VIF larger than 10 should arise our attention. Intuition: β 1 means the effect on y as x 1 changes one unit, holding x 2 fixed. When x 1 and x 2 are highly correlated, you cannot change x 1 while holding x 2 fixed, so β 1 cannot be estimated precisely. Multicollinearity is a small-sample problem. As larger and larger data sets are available nowadays, i.e., n >> k, it is seldom a problem in current econometric practice. Ping Yu (HKU) Finite-Sample 24 / 29

Hypothesis Testing: An Introduction Hypothesis Testing: An Introduction Ping Yu (HKU) Finite-Sample 25 / 29

Hypothesis Testing: An Introduction Basic Concepts null hypothesis, alternative hypothesis point hypothesis, one-sided hypothesis, two-sided hypothesis - We consider only the point null hypothesis in this course. simple hypothesis, composite hypothesis acceptance region and rejection or critical region test statistic, critical value type I error and type II error size and power significance level, statistically (in)significant Ping Yu (HKU) Finite-Sample 26 / 29

Hypothesis Testing: An Introduction Summary One hypothesis testing includes the following steps. 1 specify the null and alternative. 2 construct the test statistic. 3 derive the distribution of the test statistic under the null. 4 determine the decision rule (acceptance and rejection regions) by specifying a level of significance. 5 study the power of the test. Step 2, 3 and 5 are key since step 1 and 4 are usually trivial. Of course, in some cases, how to specify the null and the alternative is also subtle, and in some cases, the critical value is not easy to determine if the asymptotic distribution is complicated. Ping Yu (HKU) Finite-Sample 27 / 29

LSE as a MLE LSE as a MLE Ping Yu (HKU) Finite-Sample 28 / 29

LSE as a MLE LSE as a MLE Another motivation for the LSE can be obtained from the normal regression model: Assumption OLS.4 (normality): ujx N(0,σ 2 ) or ujx N(0,I n σ 2 ). That is, the error u i is independent of x i and has the distribution N(0,σ 2 ), which obviously implies E[ujx] = 0 and E[u 2 jx] = σ 2. The average log likelihood is l n β,σ 2 = 1 n 1 n ln p exp (y i x 0 i β!! )2 2πσ 2 2σ 2 so b β MLE = b β LSE. = It is not hard to show that b β i=1 1 2 log(2π) 1 2 log σ 2 1 n n i=1 (y i x 0 i β )2 2σ 2, βjx = (X 0 X) 1 X 0 ujx N 0,σ 2 (X 0 X) 1. But recall the trade-off between efficiency and robustness, which can be applied here. Anyway, this is part of the classical theory in least squares estimation. We will neglect this section and proceed to the asymptotic theory of the LSE which is more robust and does not require the normality assumption. Ping Yu (HKU) Finite-Sample 29 / 29