ECONOMETRICS (I) MEI-YUAN CHEN. Department of Finance National Chung Hsing University. July 17, 2003

Size: px

Start display at page:

Download "ECONOMETRICS (I) MEI-YUAN CHEN. Department of Finance National Chung Hsing University. July 17, 2003"

Brendan Butler
5 years ago
Views:

1 ECONOMERICS (I) MEI-YUAN CHEN Department of Finance National Chung Hsing University July 17, 2003 c Mei-Yuan Chen. he L A EX source file is ec471.tex.

2 Contents 1 Introduction 1 2 Reviews of Statistics Random Variables Estimation Hypothesis esting Random Sampling Model, Projection, and Regression Random Sampling Regression Linear Models Linear Projection Assumptions on the Regression Errors Classical Linear Regression Models Simple Linear Regression Hypothesis esting Prediction Multiple Linear Regression Geometric Interpretations Measures of Goodness of Fit Properties of the OLS Estimators Bias Variance-Covariance Matrix of Regression Error Variance-Covariance Matrix of OLS Estimator Gauss-Markov heorem OLS Estimation of Error Variance Gaussian Quasi-MLE and MVUE Distribution of ˆβ in Normal Regression Models Method of Moments Estimation 44 2

3 7 Asymptotic Distribution heory Some Basic Mathematical Concepts Some Inequalities Modes of Convergence Order Notations Consistency and Asymptotic Normality of OLS Estimators Consistency Asymptotic Normality Linear Hypothesis esting: Finite Sample and Large Sample ests Finite Sample ests t- and F -ests An Alternative Approach Large Sample ests t- and F -tests Wald est Lagrange Multiplier est Confidence Regions Power of the ests Multicollinearity Near Multicollinearity Digress: Dummy Variables Generalized Least Squares heory GLS Estimators Feasible GLS esting for Heteroskedasticity Consistent Estimation of Covariance Matrices Generalized Method of Moments Endogeneity Instrument Variables

4 12.3 Reduced Form Identification Instrument Variables Estimation GMM Estimator SLS Estimator Distribution of GMM Estimator Optimal Weight Matrix Estimation of the Efficient Weight Matrix Nonlinear Regression Models NLLS Estimation Concentration Computation Using Linearization Asymptotic Distribution Regression Models with Limited Dependent Variables A Binary Dependent Variable: the Linear Probability Models Logit and Probit Models for Binary Response he Bootstrap An Example Definition of the Bootstrap he Empirical Distribution Function References 107 4

5 1 Introduction Economists have proposed numerous theories to characterize the relationships between economic variables; whether these theories are supported by real world data is an empirical issue. By econometrics we mean the application of statistical and mathematical methods to the analysis of economic data, with a purpose of verifying or refuting economic theories. One of the most commonly used econometric techniques is regression analysis. In the nineteenth century, Sir Francis Galton ( ) studied the relationship between the heights of children and their parents. He observed that although tall parents tended to have tall children and short parents tended to have short children, there was a tendency for children s heights to converge toward the average. He termed this as a regression toward mediocrity. Contemporary regression analysis is concerned with describing and evaluating the relationship between a dependent variable and one or more explanatory variables. his involves formulating an econometric model, estimating its unknown parameters, and drawing statistical inference about the estimated results. 2 Reviews of Statistics 2.1 Random Variables A random variable is a variable whose values are determined by an experiment of chance (i.e., governed by a probability distribution). We use capital letter to denote a random variable and lower case to denote its value. 1. Discrete random variable X. Probability: P{X = x}. Probability distribution: P{X a} = {i: x i a} P{X = x i}. 2. Continuous random variable X. Probability density function (p.d.f.): f(x). Cumulative distribution function (c.d.f.) F (a) = P{X a} = a f(x) dx. 1

6 he behavior of a random variable is completely determined by its probability density function. Moments are numerical measures summarizing certain behavior of a random variable, e.g., expected value and variance. 1. Expected value: E(X) = µ. If X is discrete, E(X) = i x ip{x = x i }. If X is continuous, E(X) = xf(x) dx. If c is nonstochastic, E(c) = c, and E(cX) = c E(X). 2. Variance: var(x) = E(X µ) 2 = E(X 2 ) µ 2 = σ 2. If X is discrete, var(x) = i (x i µ) 2 P{X = x i }. If X is continuous, var(x) = (x µ)2 f(x) dx. If c is nonstochastic, var(c) = 0, and var(cx) = c 2 var(x). he behavior of two (or more) random variables is determined by their joint probability density function. 1. Joint p.d.f. f XY (x, y) = P{X = x, Y = y}. 2. Joint c.d.f. F XY (a, b) = P{X a, Y b} = b a f XY (x, y) dx dy. 3. Marginal p.d.f. f X (x) = f XY (x, y) dy; f Y (y) = f XY (x, y) dx. 4. Conditional p.d.f. f(x y) = f XY (x, y)/f Y (y); f(y x) = f XY (x, y)/f X (x). 5. If f XY (x, y) = f X (x)f Y (y), then X and Y are said to be independent. he linear association between two random variables are characterized by their covariance (or correlation). 1. cov(x, Y ) = E((X µ X )(Y µ Y )) = E(XY ) µ X µ Y = σ XY. 2. corr(x, Y ) = σ XY /(σ X σ Y ) = ρ XY and 1 ρ XY 1. Note that corr(x, Y ) is nothing but the covariance between Z X and Z Y, where Z X = [X E(X)]/ var(x) and Z Y = [Y E(Y )]/ var(y ) are Z-scores of X and Y. 2

7 3. If X and Y are independent, then cov(x, Y ) = 0; the converse is not true. 4. E(X + Y ) = E(X) + E(Y ); var(x + Y ) = var(x) + var(y ) + 2cov(X, Y ). Some frequently used random variables are: Normal random variable X N(µ, σ 2 ). (X µ)/σ N(0, 1). If X 1,... X m are independent N(0, 1), then Z = m X2 i χ2 m. If X N(0, 1) and Y χ 2 m are independent, then W = X/ Y/m t m. If X χ 2 n and Y χ 2 m are independent, then U = (X/n)/(Y/m) F n,m. 2.2 Estimation ypically, we do not know the population characteristics θ (e.g., mean and variance) because we not know the probabilistic structure governing the random variable. Hence, we collect data to estimate these unknown parameters. 1. Point estimation: An estimator is a function (a rule) of sample data; an estimate is its particular value. Given a sample x 1,..., x n, an estimator of θ can be represented as ˆθ = g(x 1,..., x n ). Examples: Given a sample (x 1, y 1 ),..., (x n, y n ): An estimator of mean: the sample average x = n x i/n. An estimator of variance: the sample variance n (x i x) 2 /(n 1). An estimator of covariance: the sample covariance n (x i x)(y i ȳ)/(n 1). An estimator of correlation: the sample correlation n (x i x)(y i ȳ) [ n (x i x) 2 ] 1/2 [ n (y i ȳ) 2 ] 1/2. 2. Criteria to evaluate an estimator: Unbiasedness: E(ˆθ) = θ. Efficiency: If ˆθ 1 and ˆθ 2 are both unbiased estimators, then ˆθ 1 is said to be more efficient than ˆθ 2 if var(ˆθ 1 ) < var(ˆθ 2 ). 3

8 Mean Square Error: E(ˆθ θ) 2. estimators. his criterion allows us to compare biased 3. Interval estimation: Instead of providing a particular estimate of an unknown parameter, it may be desirable to provide a range of values which may contain the true parameter. o do this, we first specify a confidence coefficient γ, which is a probability, say, hen construct two functions g 1 (x 1,..., x n ) and g 2 (x 1,... x n ) such that P{g 1 (x 1,, x n ) θ g 2 (x 1,, x n )} = γ. he interval (g 1, g 2 ) is called the confidence interval. In words, we are 95% sure that this interval would contain the parameter θ. Example: x i are drawn from independent N(µ, 1). Let γ = Consider an estimator x = n x i/n for µ. It can be verified that x N(µ, 1/n) so that n( x µ) N(0, 1). From the table of the standard normal random variable, P{ 1.96 < n( x µ) < 1.96} = Hence, the 95% confidence interval of µ is ( x 1.96/ n < µ < x / n). 2.3 Hypothesis esting heory (or prior belief) may suggest that the true parameter θ equals a particular value a. Hence, we may be interested in testing the null hypothesis H 0 : θ = a against the alternative hypothesis H a : θ a (or θ > a). 1. est statistic : it typically involves the difference between the estimate and the hypothesized value, e.g., = n( x a) is used to test H 0 : µ = a. A test statistic is a random variable, hence has a distribution from which we can check its probability, e.g., N(0, 1) under the null hypothesis. A large value of is considered to be improbable, hence suggests rejection of the null hypothesis. 2. Significance level α: a probability that we would tolerate when we incorrectly reject the null hypothesis. (his probability is also known as the type I error.) Given α, 4

9 a critical value c α is such that P{ > c α } = α. We reject H 0 if the observed value = t is such that t > c α, and we say that = t is significant at the level α. 3. Power: the probability of rejecting the null hypothesis when it is indeed false. he type II error is the probability of incorrectly accepting the null hypothesis. Hence, the power of a test is (1 type II error). 4. p-value: given an observed test statistic = t, the probability of observing more extreme (i.e., t and t). hat is, the p-value is the α at which = t is just significant. 5

10 3 Random Sampling Model, Projection, and Regression 3.1 Random Sampling Suppose an eonometrician has the observational data {w i, i = 1,..., n} = {w 1, w 2,..., w n }, where each w i is a vector of numerical values which represent the characteristics of individuals. ypically, the data can be written as w 1 (y 1, x 1 ) w 2 (y 2, x 2 ) = y.. i R, x i R k w n (y n, x n ) y 1 x 11 x x 1k y 2 x 21 x x 2k = y n x n1 x n2... x nk = (y, x 1, x 2,..., x k ). If this data is cross-sectional (data w i, i = 1,..., n were observed at a certain time and i represents individual ), it is reasonable to assume they are mutually independent ( spatial data are an exception). Furthermore, if the data are symmetrically gathered (e.g., randomly), it is also reasonable to model each observation as a random draw from the same probability distribution. hus, the data are independent and identical distributed, or i.i.d. We call this a random sample. 3.2 Regression In regression, we want to find the central tendency of the conditional distribution of y given x = x i. A standard measure of central tendency is the mean. he conditional analog is the conditional mean. Let f(y, x) denote the joint density of (y, x), then the conditional density f(y x = x i ) = f(y, x = x i) f x (x = x i ) 6

11 exists, where f x (x = x i ) = f(y, x = x i)dy is the marginal density of x at x i. he conditional mean is defined as the function m(x i ) = E(y x = x i ) = yf(y x = x i )dy. Note that this definition requires the existence of densities. he conditional mean m(x i ) = E(y x = x i ) is a function, meaning that when x equals x i, then the expected value of y is m(x i ). Clearly, it is a random variable since it is a function of random variable x i. he regression error e i is defined to be the difference between y i given at x = x i and its conditional mean: e = (y x = x i ) m(x i ). By construction, this yields the formula (y x = x i ) = m(x i ) + e. (1) For the joint observed data (x i, y i ), i = 1,..., n, the considered regression can be expressed as y i = m(x i ) + e i, i = 1,..., n. It is worth emphasizing that no assumptions have been imposed to develop (1), other than that (y, x) have a joint distribution and E y <. Proposition 3.1 Properties of the regression errors e i 1. E(e i x i ) = E(e i ) = E[h(x i )e i ] = 0 for all function h( ). 4. E(x i e i ) = 0. Proof: 7

12 1. By the definition of e i and the linearity of conditional expectation, E(e i x i ) = E[(y i m(x i )) x i ] = E(y i x i ) E[m(x i ) x i ] = m(x i ) m(x i ), as E[m(x i ) x i ] = m(x i ) = By the law of iterated expectations and the first result E(e i ) = E[E(e i x i )] = E[0] = By essentially the same argument, E[h(x i )e i ] = E{E[h(x i )e i x i ]} = E{h(x i )E[e i x i ]} = E{h(x i ) 0} = Follows from the third result setting h(x i ) = x i. he final result implies that e i and x i are uncorrelated. It is important to understand that despite being uncorrelated, in general e i need not be independent of x i. Generally, the following equations y i = m(x i ) + e i E(e i x i ) = 0, i, are often stated jointly as the regression framework. It is important to understand that this is a framework, not a model, because no restrictions have been placed on the joint distribution of the data. hese equations hold true by definition. A regression model imposes further restrictions on the joint distribution; most typically, restrictions on the permissible class of regression function m(x). 8

13 3.3 Linear Models While m(x) in general can take any shape, a parametric family {m(x, β) : β R k } is typically picked to simplify estimation and interpretation. Sometimes, the form of m(x, β) is given by an economic theory or model. Most often, however, we consider a linear form for convenience and data coherence. A linear model for m(x) is written as m(x i ) = β 1 + β 2 x i2 + + β k x ik, where β = (β 1,..., β k ) is the parameter vector. In matrix notation, m(x i ) = x iβ, where x i = (1, x 2i,..., x ki ). hen the linear regression model becomes y i = x iβ + e i (2) E(e i x i ) = 0 his is a model because m( ) has been restricted to the linear form. While linearity is substantively restricted, it has still a great deal of flexibility. For example, if x i is real-valued and m(x i ) = β 1 + x i β 2 + x 2 i β x k 1 i β k is a polynomial, then a linear regression model still holds, by the redefinition of x i as (1, x i, x 2 i,..., xk 1 i ). he linear conditional mean model is illustrated in the following figure. 3.4 Linear Projection he linear regression model (2) implies E(x i e i ) = 0 as E(x i e i ) = E[x i (y i x iβ)] = E{E[x i (y i x iβ) x i ]} = E{x i [E(y i x i ) x iβ]} = E(x i 0) = 0. 9

14 y E(y x = x 3 ) E(y x = x 2 ) E(y x = x 1 ) y = α 0 + β 0 x x x 1 x 2 x 3 Figure 1: An illustration of the Linear Conditional Mean. his condition is sufficient for many asymptotic results. It is interesting to observe that in linear models, there is always a vector β such that this equation holds. his vector β may be called the linear projection coefficient or linear predictor. Proposition 3.2 For any random variables (y i, x i ), let β = [E(x i x i)] 1 E(x i y i ) (3) and e i = y i x iβ. hen E(x i e i ) = 0. Proof: E(x i e i ) = E[x i (y i x iβ)] = E{x i [y i x ie(x i x i) 1 E(x i y i )]} 10

15 = E{E{x i [y i x ie(x i x i) 1 E(x i y i )] x i }} = E{x i E(y i x i ) x i x i(x i x i) 1 x i E(y i x i )} = 0. If β is defined as in (3), then E(x i e i ) = 0 holds by construction. It does not necessarily follow that E(e i x i ) = 0. his only holds if the true conditional mean of y i is x iβ, i.e., m(x i ) = x iβ, which is substantive restriction. hus the linear regression assumption that E(e i x i ) = 0 is more restrictive than the linear projection construction. It turns out that for most issues in statistical inferences, the projection assumption is sufficient. herefore, the more general assumption E(x i e i ) = 0 is adopted. For econometric practice, however, it is typical desirable for x iβ to represent the conditional mean of y i, rather than a simple linear projection. So while it is not necessary for inference on β, it may be necessary for inference on an economic relationship of interest. 3.5 Assumptions on the Regression Errors While the regression motivation leads naturally to the model (2), at times it is more convenient to adopt assumptions which are either more restrictive or less restrictive. he standard types of models considered by econometricians and their strength and weakness are discussed as the follows. All the models are based on the decomposition y i = x iβ + e i. (4) In addition, all models normalized the error so that E(e i ) = 0 and presume a finite variance E(e 2 i ) = σ2 <. Definition 3.1 he Linear Projection Model is (4) plus E(x i e i ) = 0. he advantage of he linear projection model is that it is true by construction, and many inferential results hold under this broad condition. he disadvantage is that the coefficient vector β may not have useful economic interpretations without additional structure. Definition 3.2 he Linear Regression Model is (4) plus E(e i x i ) = 0. 11

16 his model leads naturally from the derivation of the conditional mean function. he primary advantage is that the parameter β is easily interpretable. Definition 3.3 he Homoskedastic Regression Model is the Linear Regression Model plus E(e 2 i x i ) = σ 2. (5) his model adds the auxiliary assumption (5) that the regression is conditionally homoskedastic. his assumption greatly simplifies many theoretical arguments and calculations, and it therefore very useful in illustrative arguments. Many formulae simplify under this assumption, and as a result, alternative estimators and techniques are utilized. he danger in this assumption is that these simplifications result in incorrect answers and inferences if indeed the homoskedasticity assumption is false. Another meaningful justification for assumption (5) is that while it may not be precisely true in the data, it may be approximately true, and in some applications the cost of imposing homoskedasticity on the estimates may be less than the cost of using the more general techniques appropriate for the linear regression model. Definition 3.4 he Classical Regression Model is (4) plus that e i is independent of x i. Usually, x i is assumed to be nonstochastic. his model is more restrictive than the homoskedastic regression model, and is a common starting point in classical econometrics textbooks. Definition 3.5 he Normal Regression Model is (4) plus that e i is independent of x i and distributed as N(0, σ 2 ). he above five models are strictly nested, with the first (the linear projection model) the less restrictive, and the last (the normal regression model) the most restrictive. he conditional variance function is var(y i x = x i ) = E(e 2 i x = x i ) = σ 2 (x i ) which is (potentially) a function of x i. Just as the conditional mean function may take any form, so may the conditional variance function (other than the restriction that it is non-negative). Given the random variable x i, the conditional variance is σi 2 = σ2 (x i ). 12

17 In the general case where σ 2 (x i ) is not a constant function, so σi 2 is different across i, we say that the error e i is heteroskedastic. On the other hand, when the function σ 2 (x i ) is a constant so that the conditional variance σi 2 all equal the same constant value σ 2, we say that the error e i is homoskedastic. 13

18 4 Classical Linear Regression Models In classical analysis, the tools of regression has been applied to study how response variable y is affected by the independent variables x of an experiment in the Lab. Usually, the values of x are designed by scientists so that they are controllable. herefore, x are also called the controlled variables or designed variables. hus, the explanatory variables are assumed to be nonstochastic in the classical regression analysis. 4.1 Simple Linear Regression It is typical and convenient to describe an economic relationship using a linear model. Hence, given a set of economic data, one would like to find a linear equation (straight line) that best fits the data. We know that a good estimator is the one has smallest mean squared errors. regression analysis, we are trying to estimate the dependent variable y with a set of explanatory variables x. hat is, we want to find an estimator ŷ = f(x) to estimate y. hen the mean squared errors of ŷ is mse(ŷ) = E[(ŷ y) 2 ]. As we know that the arithmetic average of sample observations is nothing but the expectation value evaluated with the sample relative frequencies as its probabilities. herefore, the sample counter part of the mse(ŷ) is the arithmetic average, n (y i f(x i )) 2 /n. In linear regression content, the estimator f(x) is restricted to a linear function form, i.e., f(x) = β 1 x 1 + β 2 x β k x k. he discussion of simple linear regression focuses on k = 2. hus, we want to find an estimator which makes n (y i α βx i ) 2 /n as small as possible. his is the ordinary least square estimator we are going to discuss. Given the linear conditional mean E(y x) = α 0 +β 0 x is assumed (usually it is unknown) from an economic or financial theory, a linear regression model is specified as y = α+βx+u. Under the believe of the representativity on obtained sample observations {x i, y i } n, the relations y i = α 0 + β 0 x i + e i, i = 1,..., n are believed and the regression model is appropriate for sample observations, i.e., y i = α + βx i + u i, i = 1,..., n. he relations y i = α 0 +β 0 x i +e i, i = 1,..., n is sometimes called the identification of relation between y and x, denoted as ID1 hereafter. hat is ID 1: y i = α 0 + β 0 x i + e i, i = 1,..., n. he OLS estimators of α and β are obtained by minimizing the average of squared In 14

19 errors: f(α, β) = 1 n (y i α βx i ) 2. he first order conditions are α f(α, β) = 2 1 (y n i ˆα n ˆβ n x i ) = 0, (6) α f(α, β) = 2 1 (y n i ˆα n ˆβ n x i )x i = 0, (7) which are also called normal equations. From (6) we obtain ˆα n = 1 y n i ˆβ 1 n x n i = ȳ n ˆβ n x n, (8) and by plugging this ˆα n into (7) we get 1 y n i x i = (ȳ n ˆβ n x n ) 1 x n i + ˆβ 1 n n so that ˆβ n ( 1 n ) x i (x i x n ) = 1 n x 2 i x i (y i ȳ n ). (9) It follows from (8) and (9) that the OLS estimators of α and β are n ˆβ n = (y i ȳ n )(x i x n ) n (x i x n ) 2, (10) ˆα n = ȳ n ˆβ n x n. (11) Note that ˆβ n exists uniquely if n (x i x n ) 2 is not equal to zero deterministically. It is obvious n (x i x n ) 2 = 0 if all x i s are constant and also could be zero when x i is stochastic. herefore, we have to impose the following assumption to have ˆβ n uniquely and deteriministically, A1: x i, i = 1,..., n are not all constant and nonstochastic. he equation ŷ = ˆα n + ˆβ n x is the regression line. he values ŷ i are called fitted values, and e i = y i ŷ i are called residuals. Note that by normal equations (6), n e i = 0 so that n y i = n ŷi and ȳ n = ŷ. Also, note that by (7), n x ie i = 0 so that n ŷie i = 0. he OLS estimators have the following properties under some appropriate assumptions. 15

20 1. Given A1 OLS estimators are linear estimators in y i, i.e., ˆβn = n k iy i and ˆα n = n h iy i. ˆβ n = = = n (x i x n )y i n (x i x n ) 2 x i x n n (x i x n ) 2 y i k i y i. and ˆα n = ȳ n ˆβ n x n = y i /n = = k i y i x n ( ) 1 n k i x n h i y i. Note that n k i = 0 and n k2 i = 1/ n (x i x n ) 2. y i 2. Given ID 1: y i = α 0 + β 0 x i + e i, i = 1,..., n and A1, the OLS estimators are conditional unbiased. First observe that, denote X = (x 1, x 2,..., x n ), ˆβ n = = n (x i x n )y i n (x i x n ) 2 n (x i x n )(α 0 + β 0 x i + e i ) n (x i x n ) 2 under ID1 = α 0 n (x i x n ) n (x i x n ) 2 + β 0 n (x i x n )x i n (x i x n ) 2 = β 0 + = β 0 + n (x i x n )e i n (x i x n ) 2 k i e i. n + (x i x n )e i n (x i x n ) 2 16

21 o prove the unbiasedness, take expectation to both sides in above equation, ( n E( ˆβ n X) = E β 0 + (x ) i x n )e i n (x i x n ) 2 X ( n = β 0 + E (x ) i x n )e i n (x i x n ) 2 X n = β 0 + (x i x n )E(e i X) n (x i x n ) 2 = β 0. As, by iterated expectation, E( ˆβ n ) = E{E[ ˆβ n X]} = E{β 0 } = β 0. hat is, ˆβ n is also unconditional unbiased. Besides, it can be seen that E(ˆα n X) = E(ȳ n ˆβ n x n X) = E( y i /n ˆβ n x n X) = E( (α 0 + β 0 x i + ɛ i )/n ˆβ n x n X) = E(α 0 + (β 0 ˆβ n ) x n + = α 0 + E(β 0 ˆβ n ) x n + = α 0, e i /n X) E(e i X)/n since ˆβ n is conditionally unbiased for β 0 so that E(β 0 ˆβ n X) = 0. Alternatively, as E(ȳ n X) = α 0 + β 0 x n, it follows that Note that E(ˆα n X) = E(ȳ n ˆβ n x n X) = α 0. ˆα n = α 0 + (β 0 ˆβ n ) x n + e i /n 17

22 = α 0 = α 0 + = α 0 + k i x n e i + e i /n [1/n k i x n ]e i h i e i. 3. Under the homoskedastic liner model, i.e., ID 1 plus E(e i ) = 0 and var(e i ) = σ 2 0 It can be shown that σ 2ˆα n := var(ˆα n ) = σ 2 0 ( 1 n + x 2 ) n n (x i x n ) 2, σ 2ˆβn := var( ˆβ n ) = σ 2 0 n (x i x n ) 2, := cov(ˆα σˆαn ˆβn n, ˆβ n ) = σ0 2 x n n (x i x n ) 2. First, we observe that E(y i ) = E(α 0 + β 0 x i + ɛ i ) = α 0 + β 0 x i, var(y i ) = E[(y i E(y i )) 2 ] = E(α 0 + β 0 x i + ɛ i α 0 β 0 x i ) 2 ] = E(ɛ 2 i ) = σ0 2 cov(y i, y j ) = E[(y i E(y i ))(y j E(y j ))] = E(ɛ i ɛ j ) = 0. by [A.3] hen, we prove above results as follows: σ 2ˆβn = var( = k i y i ) ki 2 var(y i ) + 2 = σ 2 0 ki 2 = j=i+1 σ 2 0 n (x i x n ) 2. k i k j cov(y i, y j ) 18

23 Besides, Finally, σ 2ˆα n = var( = h i y i ) h 2 i var(y i ) + 2 = σ 2 0 = σ 2 0 = σ 2 0 h 2 i (1/n k i x n ) 2 j=i+1 (1/n 2 2k i x n /n + ki 2 x 2 n) = σ0[1/n 2 2 x n k i + x 2 n = σ 2 0 σˆαn ˆβn = cov( h i h j cov(y i, y j ) ki 2 ] ( 1 n + x 2 ) n n (x i x n ) 2. k i y i, h i y i ) = E{[ k i y i E( k i y i )][ h i y i E( h i y i )]} = E{[ k i (y i E(y i ))][ h i (y i E(y i ))]} = h i k i var(y i ) + 2 = σ 2 0 = σ 2 0 h i k i (1/n k i x n )k i = σ 2 0/n k i σ0 2 x n j=i+1 k 2 i h i k j cov(y i, y j ) 19

24 = x n σ 2 0 n (x i x n ) (Gauss-Markov heorem) result says that, given ID 1, A2 ˆα n and ˆβ n have the smallest variance (the most efficient) among all linear and unbiased estimators of α 0 and β 0, i.e., they are the Best Linear Unbiased Estimators (BLUE). proof: Let β n be any other linear estimator in y i than ˆβ n so that it can be written as β n = = (k i + c i ) y i (k i + c i )(α 0 + β 0 x i + ɛ i ) = α 0 (k i + c i ) + β 0 (k i + c i ) x i + (k i + c i ) ɛ i, given c i 0, i = 1,..., n. By unbiasedness of β n, n (k i + c i ) = 0 and n (k i + c i ) x i = 1. hus, given n k i = 0 and n k i x i = 1, c i = 0 x i c i = 0. As ( ) var( β n ) = var (k i + c i ) y i = = = (k i + c i ) 2 var(y i ) + 2 (k i + c i ) 2 var(y i ) ki 2 σ0 2 + k i k j cov(y i, y j ) j=i+1 c 2 i σ k i c i σ

25 and k i c i = (xi x n )c i (xi x n ) 2 = xi c i x n ci (xi x n ) 2 = 0, we have var( β n ) = σ 2 0 (xi x n ) 2 + σ2 0 c 2 i var( ˆβ n ). herefore, ˆβ n has the smallest variance among linear and unbiased estimators. 5. ˆσ 2 n = n e2 i /(n 2) is unbiased for σ2 0. As e i = y i ŷ i = y i ˆα n ˆβ n x i = (α 0 β 0 x i + ɛ i ) (ȳ n ˆβ n x n ) ˆβ n x i = (α 0 β 0 x i + ɛ i ) ( (α 0 + β 0 + ɛ i )/n ˆβ n x n ) ˆβ n x i = β 0 x i + ɛ i β 0 x n ɛ n ˆβ n x n ˆβ n x i = ( ˆβ n β 0 )(x i x n ) + (ɛ i ɛ n ), e 2 i = (ɛ i ɛ n ) + ( ˆβ n β 0 ) 2 (x i x n ) 2 2( ˆβ n β 0 ) (ɛ i ɛ n )(x i x n ). Observe that E[ (ɛ i ɛ n ) 2 ] = E( ɛ 2 i n ɛ 2 n) = var(ɛ i ) nvar( ɛ n ) = nσ 2 0 n(σ 2 0/n) = (n 1)σ

26 And, E[( ˆβ n β 0 ) 2 (x i x n ) 2 ] = Finally, as ˆβ n = β 0 + n k iɛ i, and E[( ˆβ n β 0 )ɛ i ] = E[( = (x i x n ) 2 E( ˆβ n β 0 ) 2 (x i x n ) 2 [σ0/ 2 = σ 2 0. k i ɛ i )ɛ i ] = k i E(ɛ 2 i ) = k i σ 2 0, E[( ˆβ n β 0 ) ɛ n ] = E[( k i ɛ i )( ɛ i /n)] his implies hus, E[ 2( ˆβ n β 0 ) E( = 2[ = 2 = 1 n k i σ0 2 = 0. (ɛ i ɛ n )(x i x n )] (x i x n )E[( ˆβ n β 0 )ɛ i ]] + 2 (x i x n )k i σ0 2 = 2σ 2 0. (x i x n ) 2 ] (x i x n )E[( ˆβ n β 0 ) ɛ n ] e 2 i ) = (n 1)σ0 2 + σ0 2 2σ0 2 = (n 2)σ0. 2 We have proved that E(ˆσ 2 n) = σ

27 6. As σ 2 0 is unknown, var(ˆα n) and var( ˆβ n ) can be estimated by ( s 2ˆα n := var(ˆα 1 n ) = ˆσ n 2 n + x 2 ) n n (x i x n ) 2, s 2ˆβn := var( ˆβ n ) = ˆσ 2 n n (x i x n ) 2, sˆαn ˆβn := cov(ˆα n, ˆβ n ) = σ0 2 x n n (x i x n ) 2. Since ˆσ n 2 is unbiased for σ0 2, s2ˆα n, s 2ˆβn and are all unbiased for σ sˆαn 2ˆα ˆβn n, σ 2ˆβn, and, respectively. σˆαn ˆβn Note that, the identification plus the assumptions mentioned previously are usually called the classical assumptions: A1 y i = α 0 + β 0 x i + ɛ i, i = 1,..., n. A2 x i are nonstochastic and nonconstant. A3 E(ɛ i ) = 0. A4 E(ɛ 2 i ) = σ2 0, and E(ɛ iɛ j ) = 0 for all i j. A5 ɛ i are i.i.d. N(0, σ0 2) Hypothesis esting o perform hypothesis testing, we now assume assumption 5 (ɛ i are i.i.d. N(0, σ0 2 )) holds. As assumption 5 implies assumptions 3 and 4, previous results remain valid. From assumption 5 we have 1. y i are independent N(α 0 + β 0 x i, σ 2 0 ). 2. ˆα n N(α 0, σ 2ˆα n ) and ˆβ n N(β 0, σ 2ˆβn ). 3. ˆα n α 0 σˆαn N(0, 1) and ˆβ n β 0 σ ˆβn N(0, 1). 23

28 4. n e2 t /σ 2 0 = (n 2)ˆσ2 n/σ 2 0 χ2 n 2. Also, ˆα n and ˆβ n are independent of ˆσ 2 n. proof: As e i = ( ˆβ n β 0 )(x i x n ) + (ɛ i ɛ n ), we have e 2 i = (ɛ i ɛ n ) 2 + ( ˆβ n β 0 ) 2 (x i x n ) 2 2( ˆβ n β 0 ) (x i x n )(ɛ i ɛ n ). (12) For the first term in (12), we know (ɛ i ɛ n ) 2 = = = = = [(ɛ i E(ɛ i )) ( ɛ n E(ɛ i ))] 2 [ɛ i E(ɛ i )] [ ɛ n E(ɛ i )] 2 [(ɛ i E(ɛ i ))( ɛ n E(ɛ i ))] [ɛ i E(ɛ i )] 2 + [ ɛ n E(ɛ i )] 2 ( ) 2( ɛ n E(ɛ i ) ɛ i ne(ɛ i ) [ɛ i E(ɛ i )] 2 + [ ɛ n E(ɛ i )] 2 2( ɛ n E(ɛ i ) (n ɛ n ne(ɛ i )) [ɛ i E(ɛ i )] 2 + [ ɛ n E(ɛ i )] 2 2n( ɛ n E(ɛ i ) 2, 24

29 hus, (ɛ i ɛ n ) 2 /σ0 2 = [ ] ɛi E(ɛ i ) 2 + σ [ ] ( ɛn E(ɛ 2 i ) 2 σ 0 / n [ ɛn E(ɛ i ) σ 0 χ 2 (n) + χ 2 (1) 2χ 2 (2) = χ 2 (n 1). Next, for the second term in (12), ( ˆβ n β 0 ) 2 (x i x n ) 2 /σ0 2 = ˆβ n β 0 σ0 2/ n (x i x n ) 2 χ 2 (1). Finally, for the last term in (12), as ˆβ n β 0 = n k iɛ i 2( ˆβ n β 0 ) (x i x n )(ɛ i ɛ n )/σ0 2 = 2( ˆβ n β 0 ) (x i x n )ɛ i /σ0 2 = 2( ˆβ n β 0 )/σ 2 0 = 2( ˆβ n β 0 )/σ 2 0 = 2 ( ˆβ n β 0 ) 2 σ 0 2 n (x i x n) 2 = 2 ˆβ n β 0 ] 2 (x i x n ) 2 x i x n n (x i x n ) 2 ɛ i (x i x n ) 2 ( ˆβ n β 0 ) σ 2 0 n (x i x n) 2 2N(0, 1) 2 2χ 2 (1). 2 25

30 herefore, (n 2)ˆσ 2 n σ 2 0 = e 2 i /σ0 2 χ 2 (n 1) + χ 2 (1) 2χ 2 (1) = χ 2 (n 2). 5. ˆα n α 0 sˆαn t n 2 and ˆβ n β 0 s ˆβn t n 2. proof: ˆα n α 0 sˆαn = = ˆα n α 0 ˆσ 2 n [1/n + x 2 n/ n (x i x n ) 2 ] ˆα n α 0 σ 2 0 [1/n+ x 2 n/ n (x i x n) 2 ] [(n 2)ˆσ n/σ ]/(n 2) N(0, 1) = t(n 2). χ 2 (n 2) n 2 Similarly, ˆβ n β 0 s ˆβn = = ˆβ n β 0 ˆσ 2 n / n (x i x n ) 2 ˆβ n β 0 σ 2 0 / n (x i x n) 2 (n 2)ˆσ n/σ ]/(n 2) N(0, 1) = t(n 2). χ 2 (n 2) n 2 o test the null hypothesis H 0 : α 0 = a or H 0 : β 0 = b we can use t-tests. 1. One-sided test: H a : β 0 > b. Under the null hypothesis, τ ˆβn = ( ˆβ n b)/s ˆβn t n 2. Given the significance level γ and degrees of freedom n 2, the critical value c γ,n 2 can be found in the t-table, and we reject H 0 if τ ˆβn > c γ,n 2. Similarly, we can test against H a : α 0 > a by checking whether τˆαn = (ˆα n a)/sˆαn > c γ,n 2. 26

31 H a : β 0 < b. Reject H 0 if τ ˆβn < c γ,n wo-sided test: For H a : β 0 b, reject H 0 if τ ˆβn > c γ/2,n 2 or τ ˆβn < c γ/2,n 2. For H a : α 0 a, reject H 0 if τˆαn > c γ/2,n 2 or τˆαn < c γ/2,n he (1 γ) confidence intervals for β 0 and α 0 are ( ˆβ n s ˆβn c γ/2,n 2, ˆβn + s ˆβn c γ/2,n 2 ), (ˆα n sˆαn c γ/2,n 2, ˆα n + sˆαn c γ/2,n 2 ) Prediction Based on the regression line estimated with n observations, we can predict ŷ n+1 = ˆα + ˆβx n+1, provided that the new information x n+1 is available. Observe that the prediction error has mean zero E(ŷ n+1 y n+1 ) = E[(ˆα n + ˆβ n x n+1 ) (α 0 + β 0 x n+1 + ɛ n+1 )] = 0 and variance E(ŷ n+1 y n+1 ) 2 = E[(ˆα n α 0 ) + ( ˆβ n β 0 )x n+1 ɛ n+1 ] 2 = var(ˆα n ) + var( ˆβ n )x 2 n+1 + σ x n+1 cov(ˆα n, ˆβ n ) = σ0 ( n + (x n+1 x) 2 ) n (x i x) 2. Hence, we have better prediction if x n+1 is close to x. 4.2 Multiple Linear Regression More generally, we may postulate a linear model with k explanatory variables to represent the identification equation: of y: y = β 10 x 1 + β 20 x β k0 x k + e. Given a sample of observations, this specification can also be expressed as the identification condition: y = Xβ 0 + e, (13) 27

32 where β 0 = (β 10 β 20 β k0 ) is the vector of unknown parameters, and y and X contain all the observations of the dependent and explanatory variables, i.e., y 1 x 11 x 12 x 1k y 2 x 21 x 22 x 2k y =, X =..,..... y x 1 x 2 x k where each column vector of X contains observations for an explanatory variable. he basic identifiability requirement of this specification is that the number of regressors, k, is strictly less than the number of observations,, such that the matrix X is of full column rank k. hat is, the model does not contain any redundant regressor. It is also typical to set the first explanatory variable as the constant one so that the first column vector of X is a 1 vector of ones, l. o summary, the identification condition is ID 1: y t = β 10 + β 20 x t2 + β 30 x t3 + + β k0 x tk + e t, t = 1,...,. Our objective now is to find a k-dimensional regression hyperplane that best fits the data (y, X). In the light of Section 4.1, we must minimize the average of the sum of squared errors: Q(β) := 1 (y Xβ) (y Xβ). (14) he first order conditions for the OLS minimization problem, also known as the normal equations, are: β Q(β) = β (y y 2y Xβ + β X Xβ)/ = 2X (y Xβ)/ set = 0, the last equality can also be written as X Xβ set = X y which is known as the normal equation. o have a unique solution for the system equation for β, (X X) 1 has to exist. his is first assumption has to be satisfied to have solution for β uniquely. [A2] he k data matrix X is full column rank. 28

33 Given that X is of full column rank, X X is p.d. and hence invertible. he solution to the normal equations can then be expressed as ˆβ = (X X) 1 X y. (15) It is easy to see that the second order condition is also satisfied because 2 β Q(β) = 2(X X)/ is p.d. Hence, ˆβ is the minimizer of the OLS criterion function and known as the OLS estimator for β. As the matrix inverse is unique, the OLS estimator is also unique. he vector of OLS fitted values is ŷ = X ˆβ, and the vector of OLS residuals is ê = y ŷ. By the normal equations, X ê = 0 so that ŷ ê = 0. When the first regressor is the constant one, X ê = 0 implies that l ê = êt = 0. It follows that y t = ŷt, and the sample average of the data y t is the same as the sample average of the fitted values ŷ t. If X is not of full column rank and then its column vectors satisfies an exact linear relationship, this is also known as the problem of exact multicollinearity. In this case, without loss of generality we can write x 1 = γ 2 x γ k x k, where x i is the ith column of X and γ 2,..., γ k are not all zero. hen, for any number a 0, β 1 x 1 = (1 a)β 1 x 1 + aβ 1 (γ 2 x γ k x k ). he linear specification (13) is thus observationally equivalent to Xβ := (1 a)β 1 x 1 + (β 2 + aβ 1 γ 2 )x (β k + aβ 1 γ k )x k, where the elements of β vary with a and therefore could be anything. hat is, the parameter vector β is not identified when exact multicollinearity is present. Practically, 29

34 when X is not of full column rank, X X is not invertible, and there are infinitely many solutions to the normal equations X Xβ set = X y. Consequently, the OLS estimator ˆβ cannot be computed as (15). Exact multicollinearity usually arises from inappropriate model specifications. For example, including both total income, total wage income, and total non-wage income as regressors results in exact multicollinearity because total income is, by definition, the sum of wage and non-wage income.. It is also easy to verify that the magnitude of the coefficient estimates ˆβ i are affected by the measurement units of variables. hus, a larger coefficient estimate does not necessarily imply that the associated explanatory variable is more important in explaining the behavior of y. In fact, the coefficient estimates are not comparable in general. Remark: he OLS estimators are derived without resorting to the knowledge of the true relationship between y and X. hat is, whether y is indeed generated according to our linear specification is irrelevant to the computation of the OLS estimator; it does affect the properties of the OLS estimator, however. 4.3 Geometric Interpretations We know that the OLS estimation result has nice geometric interpretations. he vector of OLS fitted values can be written as ŷ = X(X X) 1 X y = P X y, here, and in what follows, P X = X(X X) 1 X is an orthogonal projection matrix. Hence, ŷ is the orthogonal projection of y onto span(x). he OLS residual vector is thus ê = y ŷ = (I P X )y, which is the orthogonal projection of y onto span(x) and orthogonal to ŷ and X. Consequently, ŷ is the best approximation of y, given the information contained in X. Figure 2 illustrates a simple case where the model contains only two explanatory variables. Let X = [X 1 X 2 ], where X 1 is k 1 and X 2 is k 2, and k 1 + k 2 = k. We can write y = X 1 β 1 + X 2 β 2 + random error, and ˆβ = ( ˆβ 1 ˆβ 2 ). Let P X1 = X 1 (X 1 X 1) 1 X 1 and P X 2 = X 2 (X 2 X 2) 1 X 2 denote the orthogonal projection matrices on span(x 1 ) and span(x 2 ), respectively. We have the following result. 30

35 x 1 P Xy = x 1 ˆβ1 + x 2 ˆβ2 x 1 ˆβ1 x 2 x 2 ˆβ2 y ê = (I P X )y span(x 1, x 2 ) Figure 2: he orthogonal projection of y onto span(x 1, x 2 ). heorem 4.1 (Frisch-Waugh-Lovell) Given a vector y, (I P X2 )y and (I P X1 )y can be uniquely decomposed into two orthogonal components: (I P X2 )y = (I P X2 )X 1 ˆβ1 + (I P X )y, (I P X1 )y = (I P X1 )X X2 ˆβ2 + (I P X )y. Proof: As I P X2 is in span(x 2 ) and I P X is in span(x) span(x 2 ), we have (I P X2 )(I P X ) = I P X. Hence, (I P X2 )y = (I P X2 )P X y + (I P X2 )(I P X )y = (I P X2 )X 1 ˆβ1 + (I P X2 )X 2 ˆβ2 + (I P X )y = (I P X2 )X 1 ˆβ1 + (I P X )y, and these two components are orthogonal because y P X (I P X2 )(I P X )y = y P X (I P X )y = 0. he second assertion follows similarly. 31

36 An implication of heorem 4.1 is that (I P X2 )X 1 ˆβ1 = (I P X2 )P X y is the orthogonal projection of (I P X2 )y onto span((i P X2 )X 1 ). hus, we can write ˆβ 1 = [X 1(I P X2 )X 1 ] 1 X 1(I P X2 )y, as can be directly verified from (15) using the matrix inversion formula. hat is, ˆβ1 can also be obtained by regressing (I P X2 )y on (I P X2 )X 1, where (I P X2 )y and (I P X2 )X 1 are, respectively, the residual vectors from two purging regressions: y on X 2 and X 1 on X 2. Moreover, the residual vector from regressing (I P X2 )y on (I P X2 )X 1 is the same as the residual vector from regressing y on X. Similarly, ˆβ 2 = [X 2(I P X1 )X 2 ] 1 X 2(I P X1 )y can be obtained by regressing (I P X1 )y on (I P X1 )X 2. Note that ˆβ 1 is not the same as the OLS estimator from regressing y on X 1, and that ˆβ 2 is not the same as the OLS estimator from regressing y on X 2, except when X 1 is orthogonal to X 2. From heorem 4.1 we can re-write (I P X2 )y = (I P X2 )P X y + (I P X )y as P X2 y = P X2 P X y. hus, a second implication of heorem 4.1 is that projecting y directly on span(x 2 ) is equivalent to performing iterated projections of y on span(x) then on span(x 2 ). Similarly, we have P X1 y = P X1 P X y. For an illustration of heorem 4.1 see Figure 3; see also Davidson & MacKinnon (1993) for more details. As an application, consider the model with X = [X 1 X 2 ], where X 1 contains the constant term and a time trend variable t, and X 2 includes k 2 other explanatory variables. hen, the OLS estimates of the coefficients of X 2 are the same as those obtained by regressing (detrended) y on detrended X 2, where detrended y and X 2 are obtained by regressing y and X 2 on X 1, respectively. 4.4 Measures of Goodness of Fit We have learned that for a given linear specification, the OLS method yields the best fit of data. In practice, one may postulate different linear models with different regressors and try to choose a particular one among them. It is therefore of interest to compare the 32

37 y (I P X )y span(x 1, x 2 ) x 1 (I P X2 )y P X y x 1 ˆβ1 (I P X2 )P X y x 2 x 2 ˆβ2 P X2 y Figure 3: An illustration of the Frisch-Waugh-Lovell heorem. performance across models. In this section we discuss how to measure the goodness of fit of models. A natural goodness-of-fit measure is the regression variance ˆσ 2 = ê ê/( k). his measure, however, is not invariant with respect to measurement units of the dependent variable. Instead, the following relative measures of goodness of fit are adopted in the linear regression analysis. Recall that y 2 t }{{} SS = ŷ 2 t }{{} RSS + ê 2 t }{{} ESS. where SS, RSS, and ESS denote total, regression, and error sum of squares, respectively. he non-centered coefficient of determination (or non-centered R 2 ) is defined to be the proportion of SS that can be explained by the regression hyperplane: R 2 = RSS SS = 1 ESS SS. (16) Clearly, 0 R 2 1, and the larger the R 2, the better the model fits the data. In particular, a model has a perfect fit if R 2 = 1, and it does not account for any variation 33

38 of y if R 2 = 0. Note that R 2 is non-decreasing in the number of variables in the model. hat is, adding more variables to a model will not reduce its R 2. As ŷ ŷ = ŷ y, we can also write R 2 = ŷ ŷ y y = (ŷ y) 2 (y y)(ŷ ŷ) = cos2 θ, where θ is the angle between y and ŷ. hat is, R 2 is a measure of the linear association between these two vectors. It is also easily verified that, when the model contains a constant term, (y t ȳ) 2 = }{{} Centered SS (ŷ t ŷ) 2 + }{{} Centered RSS ê 2 t }{{} ESS, where ŷ = ȳ = y t/. Analogous to (16), the centered coefficient of determination (or centered R 2 ) is defined as Centered R 2 = Centered RSS Centered SS = 1 ESS Centered SS. (17) his measure also takes on values between 0 and 1 and is non-decreasing in the number of variables in the model. In contrast with the non-centered R 2, this measure excludes the effect of the constant term in the model, and is hence invariant with respect to constant addition. If the model does not contain a constant term, the centered R 2 may be negative. As (y t ȳ)(ŷ t ȳ) = we immediately get (ŷ t ȳ) 2 (y t ȳ) 2 = (ŷ t ȳ) 2, [ (y t ȳ)(ŷ t ȳ)] 2 [ (y t ȳ) 2 ][ (ŷ t ȳ) 2 ]. hat is, the centered R 2 is also the squared sample correlation coefficient of y and ŷ. If R 2 is the only criterion to determine model adequacy, one would tend to select a model with more explanatory variables. he adjusted R 2, R 2, is the centered R 2 adjusted for the degrees of freedom: R 2 = 1 ê ê/( k) (y y ȳ 2 )/( 1). 34

39 It can also be shown that R 2 = 1 1 k (1 R2 ) = R 2 k 1 k (1 R2 ). hat is, R2 is the centered R 2 with a penalty term depending on model complexity and explanatory ability. Clearly, R2 < R 2 except for k = 1 or R 2 = 1. Note also that R 2 need not be increasing with the number of explanatory variables; in fact, R2 < 0 when R 2 < (k 1)/( 1). Remark: Models for different dependent variables are not comparable in terms of their R 2 because their total variations (i.e., SS) are different. For example, R 2 of models for y and log y are not comparable. 5 Properties of the OLS Estimators 5.1 Bias Proposition 5.1 If y = Xβ 0 + e, then ˆβ β 0 = (X X) 1 X e. Proof: Since y = Xβ 0 + e, ˆβ = (X X) 1 X y = (X X) 1 X (Xβ 0 + e) = (X X) 1 X Xβ 0 + (X X) 1 X e = β 0 + (X X) 1 X e. o have ˆβ to be unbiased for β 0, the following assumption has to be imposed: A3: E(e X) = 0. Proposition 5.2 Given ID1, A2 and A3, E(ˆβ β 0 X) = 0 and E(ˆβ ) = β 0. Proof: By the previous result, E[(ˆβ β 0 ) X] = E[(X X) 1 X e X] = (X X) 1 X E(e X) = 0, and E(ˆβ X) = β 0. And then applying the law of iterated expectations, E(ˆβ ) = E[E(ˆβ X)] = E(β 0 ) = β 0. 35

40 hus ˆβ is unbiased for β 0. Indeed, it is conditionally unbiased, conditional upon X, which is a stronger result. 5.2 Variance-Covariance Matrix of Regression Error he conditional variance-covariance matrix of the regression error vector e is D = var(e X) = E(ee X) E(e 2 1 x 1) E(e 1 e 2 x 1 ) E(e 1 e 3 x 1 ) E(e 1 e x 1 ) E(e 2 e 1 x 2 ) E(e 2 2 x 2) E(e 2 e 3 x 2 ) E(e 2 e x 2 ) = E(e 3 e 1 x 3 ) E(e 3 e 2 x 3 ) E(e 2 3 x 3) E(e 3 e x 3 )..... E(e e 1 x ) E(e e 2 x ) E(e e 3 x ) E(e 2 x ) when the data are random sample then (x t, e t ) is independent of (x s, e s ) for t s, thus E(e 2 t X) = E(e 2 t x t ) = σ 2 t E(e t e s X) = E(e t e s x t ) = E(e t x t ) E(e s x t ) = 0. hus in general D = var(e X) = σ σ σ (18) σ 2 when the data are random. Under the homoskedasticity restriction (5), E(e 2 t x t ) = σ0 2 for all t, then σ σ D = 0 0 σ0 2 0 = σ0i 2, (19) σ0 2 which is the classical assumption for the linear regression models. hat is A4: var(e X) = σ0 2I. 36

41 5.3 Variance-Covariance Matrix of OLS Estimator he conditional variance-covariance matrix for ˆβ is [ ] V = E (ˆβ β 0 )(ˆβ β 0 ) X Since ˆβ β 0 = (X X) 1 X e, V = E [ (X X) 1 X ee X(X X) 1 X ] = (X X) 1 X E[ee X]X(X X) 1 = (X X) 1 X DX(X X) 1, where D is defined in (18). It may be helpful to observe that X DX = x t x tσt 2. In the special case of (5), then σ 2 t = σ 2 0, D = σ2 0 I and X DX = X Xσ 2 0. hus, V simplifies to V = (X X) 1 X Xσ 2 0(X X) 1 = σ 2 0(X X) 1. heorem 5.1 In the linear regression model, V = (X X) 1 X DX(X X) 1. (20) If (5), E(e 2 t x t ) = σ 2 0, holds, V = σ 2 0(X X) 1. (21) he expression V = (X X) 1 X DX(X X) 1 is often called a sandwich formula, because the central variance matrix X DX is sandwiched between the moment matrices (X X) Gauss-Markov heorem heorem 5.2 (Gauss-Markov) In the linear regression model, ˆβ is the best linear unbiased estimator for β 0. 37

42 Proof: Consider an arbitrary linear estimator ˇβ = Ay = [(X X) 1 X + C]y, where C is an arbitrary non-zero matrix. ˇβ is unbiased if and only if CX = 0 since ˇβ = [(X X) 1 X + C]y = [(X X) 1 X + C](Xβ 0 + e) = β 0 + (X X) 1 X e + CXβ 0 + Ce. and E[ˇβ X] = β 0 + E[(X X) 1 X e X] + E[CXβ 0 X] + E[Ce X] = β 0 + CXβ 0. It follows that when ˇβ is unbiased var(ˇβ X) = E[(ˇβ β 0 )(ˇβ β 0 ) X] = E{[(X X) 1 X e + Ce][(X X) 1 X e + Ce] X} = (X X) 1 X σ0i 2 X(X X) 1 + (X X) 1 X C +CX(X X) 1 + Cσ0I 2 C = σ0(x 2 X) 1 + σ0cc 2, where the first term on the right-hand side is var(ˆβ ) and the second term is clearly p.s.d. hus, for any linear unbiased estimator ˇβ, var(ˇβ ) var(ˆβ ) is a p.s.d. matrix. 5.5 OLS Estimation of Error Variance Under the restriction of (5), E(e 2 t x t ) = σ 2 0 is another parameter under estimation. he OLS estimator for σ 2 0 is ˆσ 2 = ê ê k = 1 k ê 2 t, where k is the number of regressors. It is clear that ˆσ 2 is not linear in y. heorem 5.3 In the homoskedastic regression model, ˆσ 2 is an unbiased estimator for σ2 0. Proof: Recall that I P X is orthogonal to span(x). hen, ê = (I P X )y = (I P X )(Xβ 0 + e) = (I P X )e, 38

43 and E(ê ê X) = E[e (I P X )e X] = E[trace(ee (I P X )) X]. As the trace and expectation operators can be interchanged, we have that E(ê ê X) = trace(e[ee (I P X ) X]) = trace[d(i P X )]. By the fact that, trace(i P X ) = rank(i P X ) = k and D = σ 2 I, it follows that E(ê ê) = ( k)σ 2 0 and that E(ˆσ 2 ) = E(ê ê)/( k) = σ 2 0, proving the unbiasedness of ˆσ 2. he OLS estimation for variance-covariance matrix of ˆβ in the homoskedastic regression model becomes var(ˆβ ˆ ) = ˆσ 2 (X X) 1 which is unbiased for var(ˆβ ) = σ 2 0 (X X) 1 provided ˆσ 2 is unbiased for σ Gaussian Quasi-MLE and MVUE In normal regression, e t x t N(0, σ 2 ) and then the likelihood for a single observation is ( L t (β, σ 2 1 ) = (2πσ 2 exp 1 ) ) 1/2 2σ 2 (y t x tβ) 2. hen the log-likelihood function for the full sample y = (y 1,..., y ) is L (y ; β, σ 2 ) = log L t (y t ; β, σ 2 ) = 2 log(2π) 2 log(σ2 ) 1 2σ 2 (y Xβ) (y Xβ). he MLE ( β, σ 2 ) maximize L. he FOCs of the maximization problem are L β L (σ 2 ) = X (y Xβ)/σ 2 set = 0, = 2σ 2 + (y Xβ) (y Xβ) set 2σ 4 = 0, 39

44 which yield the MLEs of β 0 and σ 2 : β = (X X) 1 X y, σ 2 = (y X β ) (y X β )/ = ê ê/. Clearly, the MLE β is the same as the OLS estimator ˆβ, but the MLE β 2 is different from ˆσ 2. In fact, σ2 is biased estimator because E( σ2 ) = σ2 0 ( k)/ σ2 0. heorem 5.4 (Minimum Variance Unbiased Estimator, MVUE) In normal regression models, the OLS estimators ˆβ and ˆσ 2 (MVUE). are the minimum variance unbiased estimator Consider a collection of independent random variables z = (z 1,..., z ), where z t has the density function f t (z t, θ) with θ a r 1 vector of parameters. Let the joint log-likelihood function of z be L (z ; θ) = log f (z ; θ). hen the score function s (z ; θ) := log f (z ; θ) = 1 f (z ; θ) f (z ; θ) is the r 1 vector of the first order derivatives of log f with respect to θ. Under regularity conditions, differentiation and integration can be interchanged. density function f is the true density function of z, we have E[s (z 1 ; θ)] = f (z ; θ) f (z ; θ) f (z ; θ) dz ( ) = f (z ; θ) dz = 0. When the postulated hat is, s (z ; θ) has mean zero. he variance of s is the Fisher s information matrix: B (θ) := var[s (z ; θ)] = E[s (z ; θ) s (z ; θ) ]. Consider the r r Hessian matrix of the second order derivatives of log f : H (z ; θ) := 2 log f (z ; θ) ( ) 1 = f (z ; θ) [ f (z ; θ)] 1 = f (z ; θ) 2 f (z 1 ; θ) f (z ; θ) 2 [ f (z ; θ)][ f (z ; θ)], 40

Quick Review on Linear Multiple Regression

Quick Review on Linear Multiple Regression Mei-Yuan Chen Department of Finance National Chung Hsing University March 6, 2007 Introduction for Conditional Mean Modeling Suppose random variables Y, X 1,