Econometrics I. Introduction to Regression Analysis (Outline of the course) Marijus Radavi ius

Size: px

Start display at page:

Download "Econometrics I. Introduction to Regression Analysis (Outline of the course) Marijus Radavi ius"

Lillian Gregory
6 years ago
Views:

1 Econometrics I Introduction to Regression Analysis (Outline of the course) Marijus Radavi ius Vilnius 2012

2 Content 1 Introduction Motivation Terminology Least squares method Least squares problem Linear regression Least squares for linear regression Geometrical interpretation of the least squares Regression function and prediction Denition of regression function Best linear prediction and Gaussian systems Gauss-Markov model Gauss-Markov conditions Properties of least squares estimator Gaussian Regression Denition of Gaussian Regression model Hypothesis testing When Gaussian-Markov model does not hold 16 1

3 1 Introduction 1.1 Motivation Econometrics is based upon the development of statistical methods for estimating economic relationships, testing economic theories,and evaluating and implementing government and business policy. The most common application of econometrics is the forecasting of such important macroeconomic variables as interest rates, ination rates, and gross domestic product. While forecasts of economic indicators are highly visible and are often widely published, econometric methods can be used in economic areas that have nothing to do with macroeconomic forecasting. For example, we will study the eects of political campaign expenditures on voting outcomes. We will consider the eect of school spending on student performance in the eld of education. The general purpose of course "Econometrics I" is to teach students to apply least squares method, construct and select regression models, as well as to develop statistical perception and ability to formalize real (economic) problems. Let us cite William H.Greene ([2], p.7): "The linear regression model is the single most useful tool in the econometrician's kit. Though to an increasing degree in the contemporary literature, it is often only the departure point for the full analysis, it remains the device used to begin almost all empirical research." 1.2 Terminology If a variable y of interest is inuenced by variables x 1,..., x k, this can be expressed mathematically by equation y = f(x), x := (x 1,..., x k ), (1) where f is some (usually) unknown function and A is the transposition of a matrix A. (Thus x is a column-vector). Actually, in economics, as opposed to natural sciences, any economic index is inuenced or related to lots of indices and factors. It is impossible to take them all into account and to measure. Thus, in practice, equation (1) is not valid, however it can hold approximately provided x contains all essential variables and the impact of the rest is comparatively small. This leads to the following model, called the regression model: y = f(x) + ε endogenious variable = regression function }{{} response = trend, signal }{{} systematic component 2 + errors }{{} + noise }{{} random component (2)

4 The variables y and x have several dierent names. According to old-fashioned tradition, y is called the dependent variable, while x is called the independent variable. The variable y is also called the explained variable, the response (variable), the predicted variable, or the regressand. The variable x is also called the explanatory variable, the control variable, the predictor (variable), the regressor, or the covariate. The terms dependent variable and independent variable are still rather frequently used in econometrics. But be aware that the label "independent" here does not refer to the statistical notion of independence between random variables. The terms explained and explanatory variables are probably the most descriptive. Response and control are used mostly in the experimental sciences (biometrics, technometrics), where the variable x is under the experimenter's control. The systematic component describes the expected values of endogenous variable y, i.e., "typical" or expected behavior of a (economic) system under conditions determined by the exogenous variable x while the random component ε describes uncertainty or random deviation of the system from its "typical" value. Degree of "uncertainty" or extent of random deviation is expressed by corresponding variances and covariances. When predicting y by x, the error term ε in (2) is omitted since it is "unpredictable", and we come back to equation (1). Hence, the function f determines the prediction rule of y through x. It is called the regression function (of y by x). Thus, the general problem of econometric analysis can be stated as follows: On the basis of available data to specify (identify) structure of the systematic component and covariance structure of the random component. To choose convenient parametrization for the specied structures and to estimate the unknown parameters. To make econometric inference and to give (economic) interpretation of the results. When dealing with structures it is important to be familiar with a dierent forms of their representations: symbolic specication, analytical formulas and available data representation in matrix language, as well as their graphical representation (graphical models). 3

5 2 Least squares method 2.1 Least squares problem The least squares method is based on 1. Available data D = D N (x, y) := {(x(t), y(t)), t = 1,..., N}, (3) where y(t) R and x(t) R k (R k is the k-dimensional euclidean space) are observations of explained variable y and k-dimensional explanatory variable x, respectively, in case t, N being a sample size. 2. A given class of possible F regression functions (prediction rules) f : R k R, f is a measurable function. 3. Goodness-of-t measure dened as (empirical) quadratic risk, i.e. average losses with the quadratic loss function (u v) 2, u, v R, ˆR(f) = ˆR 2 (f D) = N [y(t) f(x(t))] 2. (4) t=1 It is a special case of a risk with a general loss function l(u; v), u, v R, ˆR l (f) = ˆR l (f D) = N l (y(t); f(x(t))). (5) t=1 Least squares problem: Find ˆf F, ˆf = ˆf( D), which minimizes the risk ˆR(f D) in the class F. In short, ˆf := arg min ˆR(f D). (6) f F Example 1. Suppose F = {f : f const}. Then ˆf(x) ȳ := (1/N) N t=1 y(t), i.e. the average value of y's a well-known central characteristic of data. Example 2. Suppose F is the same as in the previous example, but the risk is dened with absolute losses l 1 (u; v) := u v, u, v R. Hence ˆR 1 (f) := ˆR l1 (f D) := N y(t) f(x(t)) (7) t=1 and the problem is to nd ˆf 1 := arg min f F ˆR 1 (f D). (8) 4

6 It is straightforward to check that ˆf 1 (x) median(y(t), t = 1,..., N), i.e. another well-known central characteristic of data. Example 3. (Multiple regression) Suppose F = F L where F L := {f : f(x) = β 0 + β x, β 0 R, β R k }. (9) is the class of all (continuous) linear mapping f : R k R. Example 3 describes the main object of interest of this course. 2.2 Linear regression Let D N (x, y) be data given in (3). Multiple linear regression model (with an intercept) (MLR) for D N (x, y) is determined by equations y(t) = β 0 + β x(t) + ε(t), t = 1,..., N. (10) Sometimes it is natural to assume that an expected value of y when all x's are zeros is also zero. This implies that intercept β 0 = 0. Multiple linear regression model without an intercept (MLR0) for D N (x, y) is determined by equations y(t) = β x(t) + ε(t), t = 1,..., N. (11) In matrix form, models (10) and (11) are written as follows: and Y = X β + E, (12) Y = Xβ + E, (13) respectively. Here X is the design matrix in the model (MLR0) X := ( X 1 X 2... X k ), (14) X i := ( x i (1), x i (2),..., x i (N) ), i = 1,..., k, (15) and X is the design matrix in the model (MLR) X := ( 1 N X ) = ( ) 1 N X 1 X 2... X k, (16) 1 N := ( ) 1, 1,..., 1 R }{{} N, (17) β := N times ( β 0, β ) R k+1. (18) 5

7 2.3 Least squares for linear regression Consider (MLR0) model in matrix form (13). For f F L with f(x) = f β (x) := β x, let us denote ˆR(β) := ˆR(f β ). Then the problem is ˆR(β) = ˆR(β D) = 1 N Y Xβ 2 min β R k. (19) The least squares estimator (LSE) ˆβ of β (in model (13)) (assuming that X X is not degenerate) is given by the equation Similarly, in model (12), LSE of β is equal to ˆβ = ( X X ) 1 X Y, (20) ˆ β = ( ˆβ0, ˆβ ) ( ) 1 = X X X Y. (21) More transparent representation of ˆ β is the following ˆβ 0 = ȳ ˆβ x, (22) ˆβ = (Ẋ ) 1Ẋ Ẋ Ẏ. (23) Here ȳ is the average of y's, x is the vector of averages of exploratory variables x, Ẋ := ( Ẋ 1 Ẋ 2... Ẋ k ), (24) Ẏ and Ẋi are vectors of centered values of the response variable y and i-th explanatory variable x i, respectively: Ẏ := Ẋ i := (. y(1) ȳ, y(2) ȳ,..., y(n) ȳ) (25) ( ), x i (1) x i, x i (2) x i,..., x i (N) x i i = 1,..., k, (26) 2.4 Geometrical interpretation of the least squares It follows from (22) that, in the (MLR) model (the model with the intercept!), the hyperplane determined by the least squares estimator ˆf(x) = ˆβ 0 + ˆβ x of the regression function f goes through the point ( x, ȳ), i.e., ˆf( x) = ȳ. It is clear, that the best linear prediction (least squares approximation) of Y is Ŷ = X ˆβ. (27) Let L = L X := span{x 1,..., X k } = {Xβ, β R k } (28) 6

8 denote the linear subspace generated by N-dimensional column-vectors X 1,..., X k. Then Ŷ = ΠY, (29) where Π = Π X denotes the orthogonal projector onto the linear subspace L X. In the case of (MLR) model, formulas analogous to (28) and (29) are valid for centered variables as well. It suces all vectors and matrices in (28) and (29) replace by their centered versions. Pitagor's theorem implies Y 2 = Ŷ 2 + Y Ŷ 2, (30) and Ẏ 2 = Ẏ 2 + Ê 2. (31) Here Ê := Y Ŷ = Ẏ Ẏ stands for the vector of the least squares residuals. Thus, we derive variance analysis identity or the formula of variance decomposition: σ 2 y = σ 2 ŷ + σ 2ˆε. (32) Here σ 2 y is the total variance of y. σŷ 2 is the variance of ŷ, the predicted values of y. It represents the part of variation of y explained by variation of explanatory variables x using the tted MLR model. σ2ˆε is the variance of the residuals ˆε. It represents the part of variation of y which can not be explained by variation of explanatory variables x using a MLR model. The variance analysis identity is often presented via sum of squares of the corresponding terms in (32) (i.e., in non-normalized way) Here SST = SSM + SSR. SST = Ẏ 2 is the total sum of squares. SSM = Ẏ 2 is the model sum of squares, i.e., sum of squares of the (centered) least squares approximations (predictions). SSR = Ê 2 is the residual sum of squares. 7

9 R-square or determination coecient R 2 := SSM SST = σ2 ŷ σ 2 y = 1 σ2ˆε σ 2 y = 1 SSR SST [0, 1] (33) represents the (relative) part of the variance of y explained by variation of explanatory variables x via the tted MLR model. R 2 = 1 implies that MLR model ts the data precisely without any errors (all residuals are zeros). R 2 = 0 implies that ˆβ = 0 k, i.e. the explanatory variables x explains nothing and the best prediction of y's is a constant ȳ. If (MLR0) model is applied, the corresponding formulas yields noncentered R- square or noncentered determination coecient R 2 0. Using geometrical interpretation Ẏ and Ê it is easy to derive the following convenient formulas ˆβ (Ẋ ) Ẏ ˆẎ R 2 Ẏ = Ẏ Ẏ = Ẏ = ˆẎ 2, (34) 2 Ẏ 2 ( ) ˆβ X Y Ŷ 2 R0 2 = = Y Y Y. (35) 2 Notice that Ẏ prediction ˆẎ = Ẋ ˆβ, ˆẎ Ẏ = ˆẎ ˆẎ = ˆẎ 2, the denominator Ẏ Ẏ = Ẏ 2, and Ẋ Ẏ has been already calculated in ˆβ formula (23). 8

10 3 Regression function and prediction We would like to investigate properties of the least squares estimator. The rst step in this direction could be replacing the empirical characteristics with the corresponding theoretical characteristics (the plug-in principal) and to study (theoretical) properties they imply. This enables us to formalize the denition of the regression function in a natural way. 3.1 Denition of regression function It is clear that the theoretical substitute of the (empirical) quadratic risk ˆR(f) (see (4)) is the corresponding theoretical quadratic risk where (x, y), x R k, y R is a random pair. R(f) = E [y f(x)] 2, (36) Denition 1. (Regression function) A measurable function f, f : R k R is called a regression function of y with respect to x R k if it minimizes the quadratic risk R(f) in the class of all measurable functions f : R k R. Proposition 1. Let (x, y) be a random pair, Ey 2 <. Then for any measurable function f : R k R with Ef 2 (x) < the following equality holds: Here, for brevity, E [y f(x)] 2 = E [(y f (x)] 2 + E [f(x) f (x)] 2. (37) denotes the conditional expectation of y given x. f (x) := E [y x] (38) Corollary 1. The measurable function E [y x] is the regression function. Denition 2. (Regression function) The regression function of y with respect to x R k is the conditional expectation E [y x] of y given x. The correlation ratio η 2 is dened as the maximal correlation between y and any measurable function of x: η 2 := max f F COR2 (y, f(x)). (39) From Proposition 1, the denition of η 2 (39) and (38), we derive that η 2 = COR 2 (y, f (x)) = VAR(f (x)). (40) VAR(y) It is clear from (39) and (40) that η 2 [0, 1], η 2 = 1 implies y f (x) almost surely and η 2 = 0 implies y const = E(y) almost surely. 9

11 3.2 Best linear prediction and Gaussian systems When seeking the best prediction rule one can impose some (reasonable) restrictions on these rules thus replacing the set of all measurable rules with some its proper subset. The choice F = F L with F L being the class of all linear mappings dened in (9) yields the best linear prediction rule f L := arg min f F L R(f). (41) Actually, f L is the least squares prediction rule and hence it can be expressed through the mean and the covariance matrix of the random vector (x, y) as follows: f L (x) := E(y) + COV(y, x) (COV(x, x)) 1 (x E(x)). (42) Moreover, the variance of the best linear prediction error ε := y f L (x) is given by VAR(ε) = E(y f L (x)) 2 = VAR(y) COV(y, x) (COV(x, x)) 1 COV(x, y). (43) is dened as the max- The theoretical (or expected) determination coecient RE 2 imal correlation between y and any linear function of x: R 2 E := max f F L COR 2 (y, f(x)). (44) Similarly as in case of (empirical) determination coecient (R-sqaure) R 2 it is established that RE 2 = COR 2 (y, f L (x)) = VAR(f L(x)). (45) VAR(y) It is obvious that η 2 RE 2 and generally the equality does not hold. However, it does hold if (x, y) is a Gaussian system. Denition 3. (Gaussian system) A pair (ξ, η) of random vectors ξ and η is called Gaussian system i the distribution of the random vector (ξ, η ) is Gaussian. Proposition 2. (Regression function for Gaussian system) Suppose (x, y) is a Gaussian system. Then E[y x] := E(y) + COV(y, x) (COV(x, x)) 1 (x E(x)). (46) and, for VAR[y x] := E [(y E[y x]) 2 x], VAR[y x] = VAR(y) COV(y, x) (COV(x, x)) 1 COV(x, y). (47) In other words, for Gaussian system (x, y) the regression function E[y x] = f L (x) and η 2 = R 2 E. 10

12 4 Gauss-Markov model Gauss-Markov model (GMm) is though simple but rather general model which ensures good properties of the least squares estimator of the linear regression coecients. 4.1 Gauss-Markov conditions Let D N (x, y) be given data. It is said that the data D N (x, y) satisfy Gauss-Markov model (GMm) if the following assumptions are fullled: (GM1) Model specication. (MLR0) model specication holds Y = Xβ + E, Y R N, β R k. (48) In fact, this is not a real restriction, rather a denition of the parameters β and the errors E. (GM2) Mean and covariances of errors. 1. Mean EE = 0 N, 2. Covariance COV(E, E) = σ 2 I N. Here σ 2 > 0 is an additional (to β) unknown parameter of the model, I N is the identity matrix of order N. (GM3) Independence of regressors and errors. The design matrix X and the error vector E are independent. In particular, this assumption is satised if X is non-random. (GM4) Uniqueness. The matrix X X is (almost surely) non-degenerate, i.e. rank(x) = k. This condition ensures that there exists the unique least squares estimator ˆβ of β. (GM5) The set of unknown regression coecients. The set of possible values of the unknown parameters β (denoted B) is the whole Euclidean space R k. A rather frequent alternative to B = R k is B = B A where B A := {β R k : Aβ = a}, A is a given m k matrix, rank(a) = m < k, a R m is a given vector. Remark. The Gauss-Markov model is a partial (or semiparametric) probabilistic model. It determines only moments of the observations up to second order, i.e., only means, variances and covariances. Thus, the Gauss-Markov model does not alow to calculate any probabilities. One can only to estimate probabilities by making use of Chebyshev-type inequalities. 11

13 4.2 Properties of least squares estimator In this section we will discuss the following properties of the least squares estimator ˆβ in the Gauss-Markov model: unbiasedness, consistency, eciency. Unbiasedness. The least squares estimator ˆβ in the Gauss-Markov model is unbiased. Consistency. consistent if The least squares estimator ˆβ in the Gauss-Markov model is ( (X tr X ) 1) 0, N. (49) Here tr(a) stands for the trace of a quadratic matrix A (i.e., the sum of the diagonal elements of A). Eciency. Let A be a given m n matrix, θ := Aβ. Gauss-Markov Theorem. Suppose that Gauss-Markov model holds. The estimator ˆθ := A ˆβ of θ where ˆβ is the least squares estimator of β is ecient in the class of all linear unbiased estimators of θ. The covariance matrix of the least squares estimator ˆβ in the Gauss-Markov model is cov ( ˆβ, ˆβ) = σ 2 ( X X ) 1. (50) In the Gauss-Markov model with the intercept β 0, the covariance matrix of the least squares estimator ˆβ of regression coecients β is cov ( ˆβ, ˆβ) = σ2 ( ) ( 1 cov(x, x) = σ 2 Ẋ Ẋ ) 1, (51) N where cov(x, x) is the empirical k k covariance matrix of the explanatory variables x, Ẋ is the design matrix of centered values of the explanatory variables x. 12

14 5 Gaussian Regression The Gauss-Markov model is only a partial (or semiparametric) probabilistic model. It determines only moments of the observed data up to second order, i.e., only means, variances and covariances. Hence, the Gauss-Markov model does not alow to calculate any probabilities and herewith to use likelihoods, construct condence intervals and to test hypotheses. A natural additional condition to the Gauss-Markov model leads to the Gaussian regression model (GRm), a complete probabilistic (regression) model. 5.1 Denition of Gaussian Regression model Let us assume that for (x, y) observation data D = D N (x, y) := {(x(t), y(t)), t = 1,..., N} Gauss-Markov model is valid with the condition (GM2) replaced with (tightened to) the condition (GR2). (GR2) Distribution of errors. The vector of random errors E is normally distributed with zero means and a spherical covariance matrix: E Normal N (0 N, σ 2 I N ), σ 2 > 0. (52) The brief mathematical form to write the Gaussian regression model (without making use of the unobservable error vector E) is the following: [Y X] Normal N (Xβ, σ 2 I N ). Let ˆβ ML and ˆσ ML 2 denote the maximum likelihood estimators (MLe) of the unknown GRm parameters β and σ 2, respectively. Proposition 1. (Maximum likelihood estimator) Suppose that data D N (x, y) satisfy Gaussian regression model. Then ˆβ ML = ˆβ where ˆβ is the least squares estimator of β and ˆσ ML 2 is the empirical variance of the residuals Ê := Y X ˆβ ˆσ 2 ML := 1 N N (y(t) ˆβ x(t)) 2 = 1 Y X N ˆβ 2. (53) t=1 The general asymptotic theory of maximum likelihood estimation implies that ˆβ ML and ˆσ ML 2 are asymptotically (as N ) ecient (in the class of all statistics). 13

15 5.2 Hypothesis testing Consider a general linear hypothesis testing problem H 0 : Aβ = a versus H 1 : Aβ a. (54) Here A is a given m k matrix, rank(a) = m, a R m is a given vector. Many problems important from the econometric viewpoint can be expressed as a general linear hypothesis testing problem. The standard ones are the following two problems: Testing statistical insignicance of a regression model. Assume Gauss- Markov model without intercept. The testing problem states H 0 : β = 0 k versus H 1 : β 0 k. (55) It follows from (54) with A = I k and a = 0 k. Testing statistical insignicance of separate regressors. For each i {1,..., k}, the testing problem states H (i) 0 : β i = 0 versus H (i) 1 : β 0. (56) For testing (54), the following criteria can be used: likelihood ratio, Fisher, and Student. It is convenient to introduce a new parameter θ := Aβ. Denote ˆθ := A ˆβ and s 2 := N N k ˆσ2 ML = 1 N k N (y(t) ˆβ x(t)) 2 = 1 Y X N k ˆβ 2. (57) t=1 The statistics s 2 is an unbiased estimator of σ 2 Fisher criterion. The general form of Fisher statistic is T F := m 1 (θ ˆθ) (ĈOV(ˆθ, ˆθ) ) 1 (θ ˆθ) (58) Here ĈOV(ˆθ, ˆθ) is an (unbiased) estimator of COV(ˆθ, ˆθ) obtained by estimating σ 2 with s 2, ĈOV(ˆθ, ˆθ) = s 2 A ( X X ) 1 A. (59) Thus, T F := (Aβ A ˆβ) (A ( X X ) 1 1 A ) (Aβ A ˆβ) ms 2. (60) Under the valid H 0, the Fisher statistic T F has the Fisher distribution with (N k, m) degrees of freedom. 14

16 The critical set of Fisher criterion for a signicance level α is K F α := {T F > F (N k,m) (1 α)} (61) Here F (N k,m) (p) denotes p-quantile of the Fisher distribution with (N k, m) degrees of freedom. Student criterion. If m = 1, i.e. θ R (is a scalar), the Fisher statistic T F := (θ ˆθ) 2 ĈOV(ˆθ, ˆθ) can be replaced by the corresponding Student statistic T S := θ ˆθ ĈOV(ˆθ, ˆθ) (62) which preserves information about the direction of deviations from the null hypothesis H 0. It is easy to derive that T S := A(β ˆβ) sλ A, λ 2 A := A ( X X ) 1 A, (63) and the critical set of Student criterion for a signicance level α is K S α := { T S > t N k (1 α/2)}. (64) Here t N k (p) denotes p-quantile of the Student distribution with N k degrees of freedom. Formula (64) is a consequence of well-known fact that the Student statistic T S, under the valid H 0, has the Student distribution with N k degrees of freedom. Likelihood ratio (LR) criterion. This criterion is usually introduced in standard course on mathematical statitic. Here we would like only to stress that the Fisher criterion in the Gaussian regression model is asymptotically equivalent to the LR criterion (for the same testing problem) when N. Remark 1. Condence intervals and regions for the parameters β and θ can be constructed by well-known relation between criteria of hypotheses testing and construction of condence regions or condence intervals. 15

17 6 When Gaussian-Markov model does not hold In this section, we discuss problems which arise when some assumptions of the Gauss-Markov model are violated. We try to answer the following questions: What are consequences of violation of a certain assumption of the Gauss- Markov model? How can these violations be detected (indentied)? What are remedies for undesirable consequences of these violations? This covers the following topics: Regresion model misspecication: violation of (GM1) includes nonlinearity of regression model, omission of relevant and inclusion of irrelevant regressors. Correlated and heteroscedastic errors: this item discusses violations of (GM2), generalized and feasible least squares method, transformations of the response variable. Stochastic regressors: violation of (GM3) is usually related to violations of exogeneity, e.g., errors in regressors. Multicolinearity: violation of (GM4) is usually referred to as multicolinearity, which, in a strict sense, rather rarely occur in practice but often takes latent forms yet leading to computational instability and statistical inconsistency. 16

18 References 1. Creel, M. (2006) Econometrics (version 0.80, 2006) 2. Greene, W.H. Econometric Analysis. 5th ed. Upper Saddle River, New Jersey: Prentice-Hall, Hill, R. C.; Griths, W. E.; Lim, G. C. (2011) Principles of Econometrics (4th ed.), John Wiley & Sons, Inc. 4. Peracchi, F. (2001) Econometrics, John Wiley & Sons, Inc. 5. Verbeek, M. (2004) A Guide to Modern Econometrics (2nd ed.), John Wiley & Sons, Inc. 6. Wooldridge, J.M. (2006) Introductory Econometrics: A Modern Approach (3rd edition.), South-Western College Publishing, Cincinnati, Ohio. 17

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate