Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Similar documents
LECTURE 2 LINEAR REGRESSION MODEL AND OLS

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

the error term could vary over the observations, in ways that are related

Heteroskedasticity. We now consider the implications of relaxing the assumption that the conditional

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Linear Regression. Junhui Qian. October 27, 2014

Homoskedasticity. Var (u X) = σ 2. (23)

Final Exam. Economics 835: Econometrics. Fall 2010

Least Squares Estimation-Finite-Sample Properties

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Intermediate Econometrics

Simple Linear Regression: The Model

Introductory Econometrics

Econometrics Summary Algebraic and Statistical Preliminaries

ECON The Simple Regression Model

1. The OLS Estimator. 1.1 Population model and notation

Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Linear Regression with Time Series Data

The Statistical Property of Ordinary Least Squares

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Intermediate Econometrics

2. Linear regression with multiple regressors

Advanced Econometrics I

Applied Health Economics (for B.Sc.)

Review of Econometrics

Applied Econometrics (QEM)

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Simple Linear Regression Model & Introduction to. OLS Estimation

The Simple Regression Model. Part II. The Simple Regression Model

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

1. The Multivariate Classical Linear Regression Model

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Linear Models in Econometrics

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Multivariate Regression Analysis

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Chapter 2: simple regression model

Introductory Econometrics

ECON 4160, Autumn term Lecture 1

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Applied Statistics and Econometrics

Lecture 3: Multiple Regression

MIT Spring 2015

ECON3150/4150 Spring 2015

ECNS 561 Multiple Regression Analysis

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Introductory Econometrics

Statistics II. Management Degree Management Statistics IIDegree. Statistics II. 2 nd Sem. 2013/2014. Management Degree. Simple Linear Regression

Chapter 2 The Simple Linear Regression Model: Specification and Estimation

STAT 100C: Linear models

Introduction to Econometrics

Econometrics I Lecture 3: The Simple Linear Regression Model

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

1 Correlation between an independent variable and the error

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

The Simple Regression Model. Simple Regression Model 1

Multiple Regression Model: I

Estimating Deep Parameters: GMM and SMM

Introduction to Econometrics

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

Regression with time series

1 Appendix A: Matrix Algebra

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity

Linear Regression with Time Series Data

GMM and SMM. 1. Hansen, L Large Sample Properties of Generalized Method of Moments Estimators, Econometrica, 50, p

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Estimating Estimable Functions of β. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 17

INTRODUCTORY ECONOMETRICS

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Missing dependent variables in panel data models

Multiple Regression Analysis

Introduction to Estimation Methods for Time Series models. Lecture 1

Environmental Econometrics

Lecture 6: Dynamic panel models 1

10. Time series regression and forecasting

3. Linear Regression With a Single Regressor

Econometrics I KS. Module 1: Bivariate Linear Regression. Alexander Ahammer. This version: March 12, 2018

Linear Regression with Time Series Data

EMERGING MARKETS - Lecture 2: Methodology refresher

STAT 100C: Linear models

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Ordinary Least Squares Regression

Lecture 3 Stationary Processes and the Ergodic LLN (Reference Section 2.2, Hayashi)

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Econometrics - 30C00200

Multiple Linear Regression

Y i = η + ɛ i, i = 1,...,n.

The Simple Linear Regression Model

We begin by thinking about population relationships.

Quick Review on Linear Multiple Regression

Generalized Method of Moments (GMM) Estimation

Making sense of Econometrics: Basics

Estimation of Dynamic Regression Models

Transcription:

Linear models Linear models are computationally convenient and remain widely used in applied econometric research Our main focus in these lectures will be on single equation linear models of the form y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i where we have: a scalar outcome variable y i for observation i a set of K explanatory variables x 1i, x 2i,..., x Ki and a sample of n observations indexed i = 1, 2,..., n 1

y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i The linear specification is not quite as restrictive as it may seem at first sight The dependent variable and each of the explanatory variables may be obtained as known, non-linear transformations of other underlying variables For example, we may have y i = ln Y i, or we may have x 3i = x 2 2i The restrictive features of the linear model are that the parameters β 1, β 2,......,β K enter linearly, and the error term u i enters additively 2

y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i Linear models will usually have an unrestricted intercept term For example, with K = 2, we may have x 1i = 1 for all i = 1, 2,...n, in which case the parameter β 1 corresponds to the intercept of a straight line With K = 3, we may have x 1i = 1 for all i = 1, 2,...n, in which case the parameter β 1 corresponds to the intercept of a plane More generally, if x 1i = 1 for all i = 1, 2,...n, the parameter β 1 corresponds to the intercept of a K-dimensional hyperplane 3

The error term u i reflects the combined effect of all additional influences on the outcome variable y i This may include the effect of relevant variables not included in our set of K explanatory variables, and also any non-linear effects of our included variables (x 1i, x 2i,..., x Ki ) 4

Notation y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i Letting x i and β denote the K 1 column vectors x i = x 1i x 2i. x Ki and β = β 1 β 2. β K we can write the model as y i = x iβ + u i for i = 1,..., n 5

Sometimes you may see the i subscripts omitted and the model written as or y = β 1 x 1 + β 2 x 2 +... + β K x K + u y = x β + u for a typical observation (population model) 6

We can also stack the observations for a sample of size n, using y = y 1 y 2. y n, X = x 1 x 2. x n = x 11 x 21 x K1 x 12 x 22 x K2.... x 1n x 2n. x Kn u = u 1 u 2. u n where y and u are n 1 column vectors and X is an n K matrix, to obtain y = Xβ + u NB. It should be clear from the context whether y and u denote n 1 vectors as here, or scalars for a typical observation, as on the previous slide 7

Ordinary least squares (OLS) The OLS estimator is a widely used method for estimating the unknown parameter vector β from the data on y i and x 1i,..., x Ki for a sample of i = 1,..., n observations The OLS estimator finds the parameter vector β which minimizes the sum of the squared errors u i = y i x i β = u i(β) Formally β OLS = arg min β = arg min β n i=1 u 2 i(β) = arg min β n (y i x iβ) 2 i=1 u(β) u(β) = arg min(y Xβ) (y Xβ) β 8

This squared error loss function penalizes large values of the error term, but tolerates small values of the error term The OLS estimator finds a value of the parameter vector β which avoids large residuals, while tolerating many small deviations from the fitted model This estimator can be shown to have several desirable properties in linear models of this kind One drawback is that the OLS parameter estimates can be sensitive to the inclusion or exclusion of outliers (influential observations) from the sample, and it is usually good practice to check the robustness of the results to changes in the sample 9

The OLS estimator To obtain the OLS estimator, we minimize φ(β) = n i=1 u 2 i (β) with respect to β, where u i (β) = y i β 1 x 1i... β K x Ki Consider φ(β) β k = Setting n ( ) ui (β) 2u i (β) β k i=1 φ(β) β k = 0 = n 2u i (β)x ki i=1 n x ki u i (β) = 0 for k = 1,..., K i=1 for k = 1,..., K These K first-order conditions can be written more compactly as X u(β) = 0, since 10

X u(β) = x 11 x 12 x 1n.... x K1 x K2 x Kn u 1 (β) u 2 (β). u n (β) = n x 1i u i (β) i=1. n x Ki u i (β) i=1 = 0. 0 where X is K n, u(β) is n 1, and so X u(β) is K 1 The first-order conditions thus set X u(β) = 0 Now using u(β) = y Xβ, we obtain X (y Xβ) = 0 X y X Xβ = 0 11

or a system of K normal equations for the K unknown elements of β X y = X Xβ Now provided (X X) 1 exists, the parameter vector β which minimizes φ(β) = n u 2 i i=1 (β) satisfies β = (X X) 1 X y This gives the OLS estimator as β OLS = (X X) 1 X y ( n ) 1 ( n ) = x i x i x i y i i=1 i=1 recalling that y i = x i β + u i and x i is a K 1 vector 12

The OLS estimator can also be written as ( ) 1 ( 1 n β OLS = x i x 1 i n n i=1 n i=1 x i y i ) In the simple case of an intercept and a single explanatory variable, i.e. y i = β 1 + β 2 x 2i + u i, this expression implies that β 2 = 1 β 1 = y β 2 x 2 n n i=1 (x 2i x 2 )(y i y) n i=1 (x 2i x 2 ) 2 1 n where β 1 and β 2 are the OLS estimates of the intercept and slope parameters, and y and x 2 are the sample means of y i and x 2i 13

β 1 = y β 2 x 2 y = β 1 + β 2 x 2 The fitted relationship using OLS passes through the sample means (this result generalizes to models with more than one explanatory variable) β 2 = 1 n n i=1 (x 2i x 2 )(y i y) n i=1 (x 2i x 2 ) 2 1 n The estimated slope parameter is the ratio of the sample covariance between x 2i and y i and the sample variance of x 2i (this result does not generalize to models with more than one explanatory variable) 14

β 1 = y β 2 x 2 β 2 = 1 n n i=1 (x 2i x 2 )(y i y) n i=1 (x 2i x 2 ) 2 1 n Replacing y i by y i = y i y and replacing x 2i by x 2i = x 2i x 2 would leave the OLS estimate of the slope parameter unchanged, and make the OLS estimate of the intercept zero This result does generalize to models with more than one explanatory variable 15

We can obtain the same OLS estimates of the slope parameters either from the model y i = β 1 + β 2 x 2i +... + β K x Ki + u i or from the model expressed in terms of deviations from sample means y i = β 2 x 2i +... + β K x Ki + u i Unless stated otherwise, all the linear models we consider will either contain an intercept term, or be expressed in terms of deviations of original variables from their sample means 16

In the special case with K = 1 and x 1i = 1 for i = 1, 2,..., n, i.e. the model y i = β 1 + u i, the first-order condition n i=1 u i(β 1 ) = 0 implies that ( n n ) ( n ) (y i β 1 ) = y i nβ 1 = 0 y i = nβ 1 i=1 i=1 i=1 so that the OLS estimator β 1 = 1 n n i=1 y i = y is the sample mean of y i 17

We should check that, by finding the solution to the first-order conditions, we have located a minimum and not a maximum of φ(β) = u(β) u(β), where u(β) = y Xβ β OLS solves the first-order conditions, so that X û = 0, where û = y X β OLS For any candidate value of the parameter vector, say β, let d = β β OLS, or equivalently write β = β OLS + d Then ũ = y X β = y X( β OLS + d) = y X β OLS Xd = û Xd And so ũ ũ = (û Xd) (û Xd) 18

Now since X û = 0, and hence û X = 0, we have φ( β) = ũ ũ = (û Xd) (û Xd) = û û û Xd d X û + d X Xd = û û + d X Xd = û û + v v where v = Xd is an n 1 column vector Since v v is a scalar sum of squares, we know that v v 0, with v v = 0 if and only if v = 0 Thus φ( β) φ( β OLS ) = û û, with φ( β) = φ( β OLS ) iff v = Xd = 0 19

The OLS estimator β OLS does indeed minimize φ(β) = u(β) u(β), as desired Moreover if rank(x) = K, the only K 1 vector d that satisfies Xd = 0 is d = 0, so that φ( β) = φ( β OLS ) iff β = β OLS In this case, the solution to the first-order conditions gives us the unique minimum Note that if rank(x) < K, then (X X) 1 does not exist, and there is not a unique solution to the normal equations for all K elements of the parameter vector β 20

This situation, known as perfect multicollinearity, is easily avoided in practice by not including any explanatory variables that are perfect linear combinations of a set of other explanatory variables included in the model For example, if the model includes x 2i and x 3i, we could not also include a variable x 4i = x 2i x 3i, or x 5i = x 2i + x 3i 21

If you are familiar with matrix calculus, you can verify that minimizing φ(β) = u(β) u(β) = (y Xβ) (y Xβ) with respect to the parameter vector β yields the first-order conditions and the second-order conditions φ(β) β = 2X (y Xβ) = 0 2 φ(β) β β = 2X X > 0 if the K K matrix X X is positive definite, confirming that we obtain a unique minimum if (X X) 1 exists 22

Regression The general regression model with additive errors is written in matrix notation as y = E(y X) + u This is simply an additive decomposition of the random vector y into its conditional expectation given X, and a vector of additive deviations from this conditional expectation To motivate an initial focus on regression models, we can note some properties of the conditional expectation function 23

Suppose y i is a random variable, x i is a K 1 random vector, and we are interested in predicting the value taken by y i using only information on the values taken by the random variables in x i ; that is, we consider predictions of the form h(x i ), where h(x i ) is a function which maps from the K dimensional vector x i to scalars The particular function h (x i ) that minimises the mean squared error MSE(h) = E[y i h(x i )] 2 is unique, and given by the conditional expectation function h (x i ) = E(y i x i ) 24

NB. Notation - Previously, for 2 random variables Y and X, we wrote the conditional expectation of Y given some value taken by X as E(Y X = x), and noted that this can be viewed as a function of x The conditional expectation function is often denoted by E(Y X) Evaluated at a particular value of X, this function returns the conditional expectation of Y given that X takes that particular value For 2 random variables y i and x i, this becomes E(y i x i ) And in the case where x i is a random vector, this function evaluated at a particular value of x i returns the conditional expectation of y i given that each element of the random vector x i takes those particular values 25

The conditional expectation function is also the unique function of x i with the orthogonality property that, for every other bounded function g(x i ): E {[y i h (x i )]g(x i )} = 0 Some properties of the conditional expectation function which follow from this orthogonality condition include: i) If x i = 1 (or any other constant) then E(y i x i ) = E(y i ) ii) If y i and (all the elements of) x i are independent, then E(y i x i ) = E(y i ) iii) If another random vector z i (y i, x i ), then E(y i x i, z i ) = E(y i x i ) iv) If y 1i and y 2i are random variables and a 1 and a 2 are constants, then E[(a 1 y 1i + a 2 y 2i ) x i ] = a 1 E(y 1i x i ) + a 2 E(y 2i x i ) 26

Linear Regression A linear regression model is obtained if the conditional expectation of y given X is a linear function of X, i.e. E(y X) = Xβ This gives y = Xβ + u or y i = x iβ + u i for i = 1,..., n with E(u i x i ) = 0 (scalar) and E(u X) = 0 (n 1 vector) 27

One motivation for the OLS estimator is that it will be shown to have some desirable properties as an estimator of the parameter vector β in a class of linear regression models One context in which the conditional expectation of y i given x i would be a linear function of x i is the case in which the random vector (y i, x i ) has a normal distribution, i.e. when all the variables considered in the model are jointly normal Suppose we have K = 2, x 1i = 1, and the random vector y i x 2i N µ y µ x2, σ2 y σ yx2 σ yx2 σ 2 x 2 28

Then the marginal distribution y i N(µ y, σ 2 y) is also normal And the conditional distribution y i x 2i N(β 1 + β 2 x 2i, σ 2 y β 2 2σ 2 x 2 ) where β 1 = µ y β 2 µ x2 and β 2 = σ yx2 /σ 2 x 2 The conditional distribution is also normal, with the conditional expectation E(y i x 2i ) = E(y i x i ) = β 1 + β 2 x 2i, a linear function of x i NB. Compare these expressions for β 1 and β 2 with the expressions we had for the OLS estimators β 1 and β 2 in the case with K = 2 Sample means replace expected values, and sample (co-)variances replace (co-)variances 29

More generally, in empirical work in economics, we may have little reason to believe the conditional expectation of y given X is the linear function Xβ The OLS estimator is also commonly used in models where we may not believe the linear conditional expectation specification We may still be interested in the linear projection of y on X, denoted E (y X) = Xβ The predicted values of ŷ = X β OLS still have an optimality property, in the sense of minimising the mean squared error within the class of predicted values that can be constructed as linear functions of X 30

Linear models for which we do not take seriously the linear specification of the conditional expectation function are not, strictly speaking, linear regression models, although this term is often used more generally in applied econometrics The OLS estimates of the parameters in linear models may still provide a useful summary of patterns (partial correlations) in the data, although we should be extremely cautious about attaching any causal significance to the estimated parameters (since correlation does not imply causation) This is the case whether or not we consider the linear function Xβ to be the conditional expectation function E(y X) 31

Example: Suppose the process which generates the outcomes y i is the linear process y i = γ 1 + γ 2 x 2i + γ 3 x 3i + u i with u i (x 2i, x 3i ) so that E(u i x 2i, x 3i ) = 0 and E(u i x 2i ) = E(u i x 3i ) = 0 Then we have E(y i x 2i, x 3i ) = γ 1 + γ 2 x 2i + γ 3 x 3i Suppose the process which generates the explanatory variable x 3i is the linear process x 3i = δ 1 + δ 2 x 2i + v i with v i x 2i so that E(v i x 2i ) = 0 32

We also have E(x 3i x 2i ) = δ 1 + δ 2 x 2i Now suppose we have data on y i and x 2i, but not on x 3i We still have a linear conditional expectation function E(y i x 2i ) = γ 1 + γ 2 x 2i + γ 3 E(x 3i x 2i ) = γ 1 + γ 2 x 2i + γ 3 (δ 1 + δ 2 x 2i ) = (γ 1 + γ 3 δ 1 ) + (γ 2 + γ 3 δ 2 )x 2i = β 1 + β 2 x 2i 33

If our interest is in predicting the value taken by y i using only information on the value taken by x 2i, we will be interested in estimating the parameters of this linear conditional expectation function If however we are interested in learning about the ceteris paribus effect of a change in x 2i on the outcome y i, holding x 3i constant, this is given by the parameter γ 2 in the process which generates y i, and we have β 2 = γ 2 +γ 3 δ 2 γ 2 (except in special cases) In this case, we will need to do something different to learn about γ 2 ; we will return to this problem later in the course 34

[Example from economics: y i is a measure of wages, x 2i is a measure of educational attainment, x 3i is an unmeasured characteristic such as ability] For now, notice that linearity of the conditional expectation function E(y i x 2i ) does not endow the OLS estimates of the parameters of this linear function with any causal significance 35

Classical linear regression models To establish properties of the OLS estimator in particular versions of the linear model, we need to make some more specific assumptions about the model We start by considering assumptions under which we can derive properties that hold in small samples (exact finite sample properties), as well as in large samples Specific versions of the linear model that allow us to derive these exact finite sample properties are sometimes referred to as classical linear regression models 36

Later we will see that the OLS estimator may still have some desirable properties in large samples (asymptotic properties) under weaker assumptions 37

The classical linear regression model with fixed regressors Some treatments start by assuming that the error term (u i ) and the outcome variable (y i ) are random variables, or are stochastic, but that the explanatory variables (x i1,..., x Ki ) are not random variables, or are nonstochastic The explanatory variables are then described as being fixed in repeated samples, or fixed This reflects the development of statistical methods for linear regression models to analyze the outcomes of controlled experiments (e.g. drug trials), in which the level of the dose is controlled by the researcher 38

This assumption is rarely appropriate for the kinds of data used in applied econometrics, which are mostly not generated by controlled experiments (except perhaps when considering laboratory data in experimental economics, or randomised control trials in some contexts) For example, a labour economist may be interested in modelling labour supply as a function of wages, but may also be interested in modelling the determinants of wages In the first case the wage variable is an explanatory variable, in the second case the wage variable is the outcome of interest 39

Similarly a macroeconomist may be interested in the effects of exchange rates on imports and exports, but may also be interested in modelling exchange rates It makes no sense to assume that the data on wages or exchange rates are stochastic when these variables appear as dependent variables, but to treat them as non-stochastic when they appear as explanatory variables in other models 40

The classical linear regression model with fixed regressors can be stated as y = Xβ + u E(u) = 0, Var(u) = σ 2 I X is non-stochastic and full rank or equivalently as E(y) = Xβ Var(y) = σ 2 I X is non-stochastic and full rank Notice that Var(u) = σ 2 I implies Cov(u i, u j ) = 0 for all i j 41

Under these assumptions we can show that E( β OLS ) = β The OLS estimator β OLS is said to be an unbiased estimator of the parameter vector β Var( β OLS ) = σ 2 (X X) 1 Var( β OLS ) Var( β) for any other unbiased estimator β that is a linear function of the random vector y NB. For K > 1, this means that the K K matrix Var( β) Var( β OLS ) 0 (i.e. Var( β) Var( β OLS ) is positive semi-definite) 42

The OLS estimator β OLS is said to be the effi cient estimator in the class of linear, unbiased estimators This last result is known as the Gauss-Markov theorem, establishing that β OLS is the minimum variance linear unbiased estimator of the parameter vector β, or the best linear unbiased estimator (BLUE), in the classical linear regression model with fixed regressors Recall that β OLS = (X X) 1 X y = Ay where in this setting A = (X X) 1 X is a non-stochastic K n matrix β OLS = Ay is thus a linear function of the random vector y This is what is meant here by the term linear estimator 43

β OLS = (X X) 1 X y = Ay Note that, as a linear function of the random vector y, β OLS is also a random vector The random vector β OLS has an expected value equal to the true parameter vector (E( β OLS ) = β), where the expectation is taken over the (unspecified) distribution of the u errors, or equivalently (since y = Xβ + u and Xβ is non-stochastic) over the (unspecified) distribution of the y outcomes Equivalently, since the true value of β is not stochastic, E( β OLS β) = 0, so the expected deviation from the true parameter vector is zero 44

The random vector β OLS has a variance (Var( β OLS ) = σ 2 (X X) 1 ) which is smaller (in a matrix sense) than that of any other linear, unbiased estimator Recalling that Var( β OLS ) = E[( β OLS E( β OLS ))( β OLS E( β OLS )) ] = E[( β OLS β)( β OLS β) ], the expected value of the squared deviation from the true parameter vector is minimised, within this class of estimators 45

The results that the OLS estimator β OLS is unbiased, and effi cient in the class of linear, unbiased estimators, hold for any sample size n > K (finite sample properties) Although this does not mean that having more data would not be useful: adding more observations to the sample will generally lower Var( β OLS ), giving a more precise estimate of the parameter vector β, since the term (X X) 1 will generally be smaller (in a matrix sense) 46

What do these properties mean? Imagine conducting a sequence of controlled experiments, with the same sample size n, the same values for the explanatory variables x 1,..., x n, and the same true parameter vector β Different draws of the u 1,..., u n errors (in the different experiments), or equivalently different draws of the y 1,..., y n outcomes, would give different values for β OLS, from the same (fixed) values of x 1,..., x n and the same true parameter vector β 47

We would observe a distribution of the OLS estimates β OLS of the parameter vector β (we can do this using generated data in the context of Monte Carlo simulations) If we repeated the experiment enough times, with independent draws of u 1,..., u n, (so that the sample mean converges to the expected value), then on average we would estimate the true value of the parameter vector, whatever the sample size n used in each experiment - this is the unbiased property As well as being correct on average, the distribution of the OLS estimates would also have the lowest possible variance, in the class of linear, unbiased estimators, in these repeated experiments - this is the effi ciency property 48

In practice, we typically observe only one sample, and we are typically not in a position to conduct repeated, controlled experiments Nevertheless, if we want the estimator of the parameter vector that we compute from our sample to have no expected deviation from the true parameter vector, and to have the smallest possible variance within the class of linear estimators with this property, we know that the OLS estimator β OLS has these properties in the classical linear regression model (with fixed regressors) 49

For any n 1 random vector y, non-stochastic K 1 vector a, and nonstochastic K n matrix B, the K 1 random vector z = a + By has E(z) = a + BE(y) Var(z) = BVar(y)B Using these facts, we can easily show that E( β OLS ) = AE(y) = AXβ = (X X) 1 X Xβ = β and Var( β OLS ) = AVar(y)A = A(σ 2 I)A = σ 2 AA = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1 noting that (X X) 1 is a symmetric matrix, so that [(X X) 1 ] = (X X) 1 50

We will not prove the Gauss-Markov theorem for this model, but rather show that similar results can be established in a version of the classical linear regression model that is more relevant for analyzing most kinds of data encountered in applied econometric research From now on, some or all of the explanatory variables in the matrix X will be assumed to be stochastic, i.e. to be random variables (unless stated otherwise) 51

The classical linear regression model with stochastic regressors The classical linear regression model with stochastic regressors can be stated as y = Xβ + u E(u X) = 0, Var(u X) = σ 2 I X has full rank (with probability one) or equivalently as E(y X) = Xβ Var(y X) = σ 2 I X has full rank (with probability one) 52

It may be instructive to derive this specification, starting with a linear model for observation i y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i = x iβ + u i for i = 1, 2,..., n First assume that the conditional expectation of y i given x i is linear, giving E(y i x i ) = x i β, or equivalently E(u i x i ) = 0. This is a linear conditional expectation assumption Also assume that the conditional variance of y i given x i is common to all the observations i = 1, 2,..., n, giving Var(y i x i ) = σ 2, or equivalently Var(u i x i ) = σ 2. This is a conditional homoskedasticity assumption 53

Further assume that the observations on (y i, x i ) are independent over i = 1, 2,..., n, so that E(y i x 1, x 2,..., x n ) = E(y i x i ) = x iβ for all i = 1, 2,..., n or E(u i x 1, x 2,..., x n ) = E(u i x i ) = 0 for all i = 1, 2,..., n and Var(y i x 1, x 2,..., x n ) = Var(y i x i ) = σ 2 for all i = 1, 2,..., n or Var(u i x 1, x 2,..., x n ) = Var(u i x i ) = σ 2 for all i = 1, 2,..., n 54

Independence also implies that Cov(y i, y j x 1, x 2,..., x n ) = 0 for all i j or Cov(u i, u j x 1, x 2,..., x n ) = 0 for all i j Collecting these implications of independence gives the stronger forms of the linear conditional expectation assumption E(y X) = Xβ or E(u X) = 0 and of the conditional homoskedasticity assumption Var(y X) = σ 2 I or Var(u X) = σ 2 I required for the classical linear regression model with stochastic regressors 55

Adding the assumption that X has full rank completes the specification Since the n K matrix X now contains random variables, some authors prefer to state the full rank assumption in the form rank(x) = K with probability one Although since this condition cannot be guaranteed with random draws from interesting distributions, even this is something of a fudge Again we can note that the case of perfect multicollinearity (or rank deficiency) is easily avoided in practice 56

Stated in the form E(u i x 1, x 2,..., x n ) = 0 for i = 1, 2,..., n the linear conditional expectation assumption is sometimes referred to as strict exogeneity This implies that the error term u i for observation i is uncorrelated with the explanatory variables in x j for all observations i, j = 1, 2,..., n, not only that u i is uncorrelated with x i Similarly this can be shown to imply that the explanatory variables in x i for observation i are uncorrelated with the error terms u j for all observations i, j = 1, 2,..., n, not only that x i is uncorrelated with u i 57

Notice that, in a time series setting, this strict exogeneity assumption rules out the presence of lagged dependent variables among the set of explanatory variables, since the dependent variable for observation i 1 is necessarily correlated with the error term for observation i 1 To show this more formally, we temporarily switch notation from i = 1,..., n to t = 1,..., T to denote the observations Suppose that K = 2, with x 1t = 1 for all t (intercept term), and x 2t = y t 1 (lagged dependent variable), giving the first-order autoregressive (AR(1)) model y t = β 1 + β 2 y t 1 + u t 58

y t = β 1 + β 2 y t 1 + u t = (1, y t 1 ) and y t+1 = β 1 + β 2 y t + u t+1 = (1, y t ) β 1 β 2 β 1 β 2 + u t = x tβ + u t + u t+1 = x t+1β + u t+1 In this case, x t+1 contains y t, so that E(y t x 1,..., x t+1,..., x T ) = y t E(y t x t ) = x tβ = β 1 + β 2 y t 1 59

We cannot maintain the independence assumption that was used to derive the classical linear regression model, and a different specification will be needed to handle dynamic models with lagged dependent variables in a time series setting NB. We will not be able to obtain exact finite sample properties for the OLS estimator in such models, but we will still be able to obtain useful large sample properties 60

The assumption that Cov(u i, u j x 1, x 2,..., x n ) = 0 for all i j also rules out the case of serially correlated errors in time series models, where for example we may want to specify that Cov(u t, u t 1 x 1, x 2,..., x T ) 0 allowing for forms of temporal dependence or persistence in the error terms In a cross-section setting, the strict exogeneity assumption rules out any correlation between the error terms for different observations, or any crosssection dependence 61

Combined with the assumption of conditional homoskedasticity, this is sometimes stated more strongly as the requirement that, conditional on the explanatory variables in X, the error terms for different observations are independently and identically distributed (iid errors) Although notice that we have not specified the form of the distribution function from which these error terms are drawn And, in particular, we have not assumed that the errors are drawn from a normal distribution 62

We now consider the finite sample properties of the OLS estimator in the setting of the classical linear regression model with stochastic regressors We first use the facts that, for any n 1 vector y that is stochastic (i.e. random) conditional on X, any K 1 vector a that is non-stochastic (i.e. known) conditional on X, and any K n matrix B that is non-stochastic conditional on X, the K 1 vector z = a + By is stochastic conditional on X, with E(z X) = a + BE(y X) Var(z X) = BVar(y X)B 63

Recall that β OLS = (X X) 1 X y = Ay Since A = (X X) 1 X is known (i.e. not stochastic) conditional on X, while y is a random vector conditional on X, we have that β OLS is a random vector conditional on X The conditional expectation of β OLS is E( β OLS X) = AE(y X) = AXβ = (X X) 1 X Xβ = β since A = (X X) 1 X is not stochastic conditional on X This shows that the OLS estimator β OLS is unbiased, conditional on the realized values of the stochastic regressors X observed in our sample 64

Again this implies that E( β OLS β X) = 0, so that given the realized values of the regressors, we have zero expected deviation from the true parameter vector The expectation is taken over the (unspecified) conditional distribution of u X, or equivalently (since y = Xβ + u and Xβ is not stochastic conditional on X) over the (unspecified) conditional distribution of y X The thought experiment here is that we fix the value of X and calculate β OLS for many different (independent) draws of u X or y X On average we estimate the true value of the parameter vector 65

We can also show that β OLS is unbiased in a weaker, unconditional sense, using the Law of Iterated Expectations E(z) = E[E(z X)] where the expectation E[.] is taken over the distribution of X This gives the unconditional expectation of β OLS as E( β OLS ) = E[E( β OLS X)] = E[β] = β since the true parameter vector β is not stochastic 66

This unconditional unbiasedness property is more relevant in most economic contexts, since it is rarely meaningful to think in terms of fixing the values of the explanatory variables (this is why we want to treat the regressors as stochastic) In practice, different samples we could use to estimate β will contain different values of the explanatory variables X, as well as different values of the outcome variable y Again we can state this property as E( β OLS β) = 0, so the expected deviation from the true parameter vector is zero 67

The conditional variance of β OLS is Var( β OLS X) = AVar(y X)A = A(σ 2 I)A = σ 2 AA = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1 noting that (X X) 1 is a symmetric matrix, as before The Law of Iterated Expectations can be used to show that the unconditional variance is Var( β OLS ) = σ 2 E[(X X) 1 ] But the expression for the conditional expectation Var( β OLS X) = σ 2 (X X) 1 is more useful here 68

This will be used in the development of hypothesis tests about the true value of (elements of) the parameter vector, which are conducted conditional on the observed values of the regressors in our sample We also have a conditional version of the Gauss-Markov theorem, stating that Var( β OLS X) Var( β X) for any other unbiased estimator β that is a linear function of the random vector y conditional on X Given the realized values of X, the OLS estimator β OLS is effi cient in the class of linear, unbiased estimators 69

To prove this version of the Gauss-Markov theorem, we let β = Ãy for a K n matrix à that is non-stochastic conditional on X (i.e. β is a linear function of y conditional on X) For β to be unbiased conditional on X, we require E( β X) = ÃE(y X) = ÃXβ = β Hence we require ÃX = I (here a K K identity matrix) We also have Var( β X) = ÃVar(y X)à = σ 2 Ãà 70

Write à = (A+D) where A = (X X) 1 X and the K n matrix D = à A Consider ÃX = (A + D)X = AX + DX We know that AX = (X X) 1 X X = I, which implies ÃX = I + DX For β to be unbiased conditional on X, we require ÃX = I, which implies DX = 0 (a K K matrix with every element zero) Note that DX = 0 implies DX(X X) 1 = DA = 0 (using the symmetry of (X X) 1 ), which in turn implies that [DA ] = AD = 0 71

Now consider ÃÃ = (A + D)(A + D) = AA + AD + DA + DD If β is unbiased conditional on X, we have AD = DA = 0, and this simplifies to ÃÃ = AA + DD Now since Var( β X) = σ 2ÃÃ, we have Var( β X) = σ 2 AA + σ 2 DD Recall that Var( β OLS X) = σ 2 (X X) 1 = σ 2 AA So we have Var( β X) = Var( β OLS X) + σ 2 DD For any K n matrix D, the matrix DD is positive semi-definite Since σ 2 = Var(u i X) > 0, we also have σ 2 DD is positive semi-definite Thus Var( β X) Var( β OLS X) is positive semi-definite, which is what is meant by the matrix inequality statement Var( β X) Var( β OLS X) 72

Two implications of this result can also be noted Var( β OLS X) = Var( β 1 X) Cov( β 1, β 2 X) Cov( β 1, β K X) Cov( β 2, β 1 X) Var( β 2 X) Cov( β 2, β K X)...... Cov( β K, β 1 X) Cov( β K, β 2 X) Var( β K X) where β k denotes the k th element of β OLS for k = 1,..., K Variance matrices are symmetric, since Cov( β j, β k X) = Cov( β k, β j X) 73

Similarly Var( β X) and Var( β X) Var( β OLS X) are symmetric matrices A symmetric matrix that is positive semi-definite has non-negative numbers on its main diagonal So Var( β X) Var( β OLS X) positive semi-definite implies that the diagonal elements Var( β k X) Var( β k X) 0 or Var( β k X) Var( β k X) for each k = 1,..., K For each element of the parameter vector, the OLS estimator has the smallest variance in the class of estimators that are linear and unbiased, conditional on X 74

Any linear combination of the β k parameters can be expressed in the form θ = h β, for some non-stochastic K 1 vector h For example, β 2 β 1 is obtained using h = ( 1, 1, 0,..., 0) so that h β = β 1 + β 2 = β 2 β 1 The estimator θ OLS based on β OLS is θ OLS = h βols The estimator θ based on β is θ = h β We have Var( θ OLS X) = h Var( β OLS X)h and Var( θ X) = h Var( β X)h So that Var( θ X) Var( θ OLS X) = h [Var( β X) Var( β OLS X)]h 75

For any K 1 vector h and positive semi-definite matrix G, the scalar h Gh 0 Hence Var( θ X) Var( θ OLS X) 0 or Var( θ X) Var( θ OLS X) For any linear combination of the β k parameters, the OLS estimator has the smallest variance in the class of estimators that are linear and unbiased, conditional on X Our previous result for individual elements of the parameter vector β can also be obtained as a special case of this result; for example, using h = (1, 0,..., 0) gives h β = β 1 76

Fitted values The fitted value for observation i in our sample is ŷ i = x i β OLS The vector of fitted values for all n observations in our sample is ŷ = X β OLS = X(X X) 1 X y = P y where the n n matrix P = X(X X) 1 X The matrix P has the property that P X = X The matrix P is sometimes called the projection matrix 77

Residuals The residual for observation i is û i = y i x i β OLS = y i ŷ i The vector of residuals for all n observations in our sample is û = y X β OLS = y ŷ = y X(X X) 1 X y = (I P )y = My where the n n matrix M = I P = I X(X X) 1 X The matrix M has the property that MX = 0 The matrix M is sometimes called the annihilator 78

The K normal equations used to obtain the OLS estimator imply that, for the OLS residuals in our sample, we have X û = 0, or n x 1i û i i=1 0 n x 2i û i i=1 0 =.. n x Ki û i 0 i=1 This implies that, by construction, the OLS residuals û 1,..., û n in our sample are uncorrelated with each of the K explanatory variables included in the model 79

This sample property of the OLS residuals follows directly from the definition of the OLS estimator, and tells us nothing about the validity of the linear conditional expectation assumption E(u i x i ) = E(u i x 1i, x 2i,..., x Ki ) = 0 or the (stronger) strict exogeneity assumption E(u i x 1, x 2,..., x n ) = 0 made in the classical linear regression model with stochastic regressors 80

Estimation of σ 2 The variance parameter σ 2 is usually unknown For the result that Var( β OLS X) = σ 2 (X X) 1 to be useful in practice, we need an estimator for σ 2 The OLS estimator is σ 2 OLS = û û n K = n û 2 i i=1 n K where û = y X β OLS is the vector of OLS residuals 81

The estimator σ 2 OLS can be shown to be an unbiased estimator of σ 2 in the classical linear regression models With stochastic regressors, this holds both conditional on X, and also unconditionally Recall that in the special case with K = 1 and x 1i = 1 for i = 1, 2,..., n (that is, in the model y i = β 1 + u i ), the OLS estimator β 1 = y i, so that the OLS residuals û i = y i y i In this case, σ 2 OLS simplifies to the sample variance s 2 = 1 n 1 n i=1 (y i y i ) 2 For independent observations with E(y i ) = µ and Var(y i ) = σ 2, it follows that the sample variance is an unbiased estimator of σ 2 82

OLS and Maximum Likelihood The OLS estimator that we derived by minimizing the sum of the squared errors can also be obtained as a (conditional) Maximum Likelihood (ML) estimator in a particular case of the linear model If the n 1 random vector y has the normal distribution y N(µ, Σ), its probability density function is ( ) f y (y = a) = f y (a) = (2π) n 1 2 Σ 2 w exp 2 where w = u Σ 1 u, u = a µ and Σ 1 2 1 = det(σ) 83

If we add to our specification of the classical linear regression model with stochastic regressors the further assumption that u X N(0, σ 2 I) or, equivalently, that y X N(Xβ, σ 2 I), we obtain the conditional probability density function ( u f y X (a X) = (2π) n 2 (σ 2 ) u ) n 2 exp 2σ 2 where u = a Xβ, Σ 1 = ( 1 σ 2 ) I, w = ( u u σ 2 ) and Σ = (σ 2 ) n Viewed as a conditional density function, the argument is a, with both the realized value of X and the true values of the parameters β and σ 2 taken as given 84

But we can also view the expression (2π) n 2 (σ 2 ) n 2 exp ( u u 2σ 2 ) as a function of the parameters β and σ 2, with u = y Xβ, and the data on y and X now taken as given This gives the (conditional) likelihood function for the sample data on (y, X) as ( u L(β, σ 2 ) = (2π) n 2 (σ 2 ) u ) n 2 exp 2σ 2 (Conditional) maximum likelihood estimators find the values of the parameters β and σ 2 that maximize this (conditional) likelihood function given the sample data Or, equivalently, which maximize the (conditional) log-likelihood function 85

( n L(β, σ 2 ) = ln L(β, σ 2 ) = ln 2π 2) with u = y Xβ ( n 2) ln(σ 2 ) ( ) 1 u u 2σ 2 Notice that only the final term in the log-likelihood function depends on β Maximizing L(β, σ 2 ) with respect to β is thus equivalent to minimizing the sum of squared errors u u = n u 2 i i=1 with respect to β Hence the (conditional) maximum likelihood estimator β ML in the classical linear regression model with normally distributed errors is identical to the OLS estimator β OLS 86

Strictly this is a conditional maximum likelihood estimator, since we have only specified the conditional distribution of y X, or equivalently the conditional distribution of u X But this is commonly shortened to maximum likelihood estimator [Starting from the joint density of (y, X), f y,x, we can always factorise this joint density into the conditional density f y X times the marginal density f X The distinction between (true) maximum likelihood estimators based on f y,x and conditional maximum likelihood estimators based on f y X only matters in (unusual) cases where f y X and f X have parameters in common] 87

OLS and Generalized Method of Moments The OLS estimator can also be derived as a (Generalized) Method of Moments (GMM) estimator in the linear model formulated as y i = x iβ + u i with E(u i ) = 0 and E(x i u i ) = 0 for i = 1,..., n Method of Moments estimators find the value of β that sets the sample analogue of the population moment condition E(x i u i ) = 0 equal to zero The sample analogue of an expected value is the sample mean 88

So we find the value of β that sets 1 n But 1 n n i=1 x i u i (β) = ( 1 n) X u(β) n i=1 x i u i (β) = 0, where u i (β) = y i x i β And the value of β that sets X u(β) = 0 is again the OLS estimator β OLS So we also have β OLS = β GMM in this linear model We will look at properties of ML and GMM estimators in more general settings later in the course Both approaches produce an estimator with some desirable properties in versions of the linear model 89