An overview of applied econometrics

Similar documents
Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

Econometrics of Panel Data

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

The Simple Linear Regression Model

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Instrumental Variables and the Problem of Endogeneity

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

EC4051 Project and Introductory Econometrics

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

ECON Introductory Econometrics. Lecture 16: Instrumental variables

Advanced Econometrics


Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Club Convergence: Some Empirical Issues

Econometrics of Panel Data

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

POL 681 Lecture Notes: Statistical Interactions

Inference in Regression Model

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Econometrics Summary Algebraic and Statistical Preliminaries

The regression model with one stochastic regressor (part II)

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Reliability of inference (1 of 2 lectures)

Applied Statistics and Econometrics

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Econometrics of Panel Data

Homoskedasticity. Var (u X) = σ 2. (23)

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

[y i α βx i ] 2 (2) Q = i=1

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Applied Econometrics Lecture 1

Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

ECON The Simple Regression Model

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

Statistical Inference with Regression Analysis

Economics 113. Simple Regression Assumptions. Simple Regression Derivation. Changing Units of Measurement. Nonlinear effects

1 The Multiple Regression Model: Freeing Up the Classical Assumptions

Inference in Regression Analysis

Chapter 2: simple regression model

Chapter 15 Panel Data Models. Pooling Time-Series and Cross-Section Data

Applied Econometrics (QEM)

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Ordinary Least Squares Regression

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

Introduction to Econometrics. Multiple Regression (2016/2017)

Freeing up the Classical Assumptions. () Introductory Econometrics: Topic 5 1 / 94

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econ107 Applied Econometrics

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

FNCE 926 Empirical Methods in CF

Applied Quantitative Methods II

Econometrics II. Nonstandard Standard Error Issues: A Guide for the. Practitioner

Econometrics of Panel Data

1 Motivation for Instrumental Variable (IV) Regression

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Empirical Economic Research, Part II

Lectures 5 & 6: Hypothesis Testing

Wooldridge, Introductory Econometrics, 2d ed. Chapter 8: Heteroskedasticity In laying out the standard regression model, we made the assumption of

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Econ 836 Final Exam. 2 w N 2 u N 2. 2 v N

3. Linear Regression With a Single Regressor

Rockefeller College University at Albany

ECON3150/4150 Spring 2015

Applied Statistics and Econometrics. Giuseppe Ragusa Lecture 15: Instrumental Variables

Final Exam. Economics 835: Econometrics. Fall 2010

1 Regression with Time Series Variables

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Simple linear regression

Lecture 14 Simple Linear Regression

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Chapter 1. An Overview of Regression Analysis. Econometrics and Quantitative Analysis. What is Econometrics? (cont.) What is Econometrics?

TESTING FOR CO-INTEGRATION

LECTURE 11. Introduction to Econometrics. Autocorrelation

8. Instrumental variables regression

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

405 ECONOMETRICS Chapter # 11: MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED? Domodar N. Gujarati

Introduction to Econometrics. Heteroskedasticity

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

The regression model with one fixed regressor cont d

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Simple Linear Regression

ECNS 561 Multiple Regression Analysis

Econometric Methods. Prediction / Violation of A-Assumptions. Burcu Erdogan. Universität Trier WS 2011/2012

Økonomisk Kandidateksamen 2004 (I) Econometrics 2. Rettevejledning

Making sense of Econometrics: Basics

Scatter plot of data from the study. Linear Regression

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

EMERGING MARKETS - Lecture 2: Methodology refresher

Introductory Econometrics

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Dynamic Panels. Chapter Introduction Autoregressive Model

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Linear Models in Econometrics

Economics 345: Applied Econometrics Section A01 University of Victoria Midterm Examination #2 Version 1 SOLUTIONS Fall 2016 Instructor: Martin Farnham

Transcription:

An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical analyses. It pays very little attention to proving statistical properties of estimators and tests, but attempts to provide some background for the interpretation of these. The focus is mostly on cross-sectional techniques. Sections with more difficult material is marked with a. 2 The bivariate model We consider first the case where we have two variables x and y, and for the time being we assume that we know that x causes y and not vice versa. 1 We want to study the relationship between the two variables, and the simplest approach is the assume that there is a linear relationship. However, there will usually also be some noise, either from pure error of measurement in y or from other factors we have not included in the model. Call this noise ε, and assume it has mean zero 2 and constant variance σ 2. Say we have N observations indexed by i, then we have where α and β are two numbers we want to estimate from data. y i = α + βx i + ε i, (1) The interpretation of β in this model is that a one unit increase in x leads to a β unit increase in y. The remainder term ε i is assumed to contain factors not explained by x as well as measurement error in y. A crucial assumption is that x and ε are uncorrelated (c.f. Section 5). If we have guess ˆα and ˆβ, our best guess of y if we only know x would be ŷ = ˆα+ ˆβx. We want this guess to be as good as possible, so we want the errors y i ŷ i to be small. One approach would be to choose ˆα and ˆβ to minimize the sum of the standard deviation of the errors, i.e. minimize N i=1 y i ŷ i with regard to ˆα and ˆβ. As the absolute value is non-differentiable at zero, this leads to a quite complicated problem, and for a long time this was considered impossible to do for large data sets. Today, this is fairly easy to Preliminary and incomeplete version. Comments are welcome. 1 This assumption is discussed furtehr in Section 5. 2 This is an innocent assumption. If the mean was not zero, we could simply move the mean into α. 1

do with a computer, and most statistical packages will have a routine to perform this 3. However, the most common approach is instead to solve which yields the solutions min ˆα, ˆβ N i=1 (y i ŷ i ) 2 = min ˆα, ˆβ N i=1 ( y i ˆα ˆβ i ) 2, ˆα = ȳ ˆβ x i ˆβ = (y i ȳ) (x i x) i (x i x) 2 where ȳ and x are the means of y and x. These are the ordinary least squares (OLS) estimates, and are calculated by everything from slightly sophisticated pocket calculators to any decent statistical software package. estimates. If an article rapports regression results without any other qualifications, it refers to the OLS Once we have the estimates of α and β, it would also be nice to know something about how precise these estimates are. To do this, it is commonplace to calculate their variances. In the classical approach, x is assumed to be non-stochastic (hence with variance 0) and ε to have variance σ 2 and no covariance between individuals (cov (ε i, ε j ) = 0 for i j). From this it is fairly straightforward to calculate expressions ( ) for var (ˆα) and var ˆβ. These are found in most econometrics books, but we don t need the formulae for our purposes. A difficulty with this is that it is usually not meaningful to say that x is non-stochastic when we are working with economic data. That would imply that the values of x have been chosen by the econometrician, as would be the case for experimental sciences. But as we most of the time have to go out and collect data from a world we cannot design ourselves, it is more reasonable that x is stochastic. This means that there are two stochastic elements in (1), and the usual formulae for the variances of the estimates break down. There are two answers to this: 1. We can condition on given realizations of the x values, i.e. say that for worlds with the x-values we have, this is the variance of the estimates. 2. If N is large, there are some nice theorems in asymptotic theory 4 that tells us that the variance converges towards to usual expressions. When we have the variance, there are two things we would want to do: Test hypotheses and construct confidence intervals. Both these concepts are explained in most introductory statistics books, so I will not cover them in any detail. For small N, we assume that ε i N ( 0, σ 2). Then one can show that if ( ) var ˆβ = σβ 2, then ( ˆβ N β, σ 2 β), 3 In Stata, for instance, this is done with the command qreg. 4 Asymptotic theory is theory of estimation when we look at the case where the number of observations approaches infinity This is a unrealistic assumption, but makes it possible to analyse a large number of cases that are untractable with finite samples. We then have an idea that this holds approximately for finite, but large, samples. The value of large is different for different cases, but if N > 100 it is usually quite reasonable to start talking about asymptotic theory. 2

so the test statistic T = ˆβ β σβ 2 N (0, 1). The same also holds for α, but we are usually not that interested in α. As the variance of ε, σ 2, has to be estimated, we end up with ˆT = ˆβ β ˆσ 2 β where t is the t-distribution, in this case with N 2 degrees of freedom. Recall that the t-distribution looks much like the standard normal distribution, but with heavier tails. When the degrees of freedom increases, it converges to the normal distribution. To test some hypothesis we may have, for instance that β = 0, we insert the hypothesized value of β, calculate ˆT, and reject the hypothesis if ˆT is above the critical value corresponding to the chosen level of significance. In a number of papers you will see authors reporting t-values. Those are the test statistics related to the test of the parameter being zero. This is usually a test of major interest as β = 0 would mean that x doesn t affect y. If we have a theory of how x affects y, then this would be a test of one prediction of the theory, and hence one way to check the plausibility of the theory. Confidence intervals are not reported that often. A confidence interval with 1 α level of confidence is constructed from the expression t CI = ˆβ ± t α/2 ˆσ 2 β. Interpreting confidence intervals is not that straightforward. A fairly correct interpretation is that intervals constructed in this way cover the true parameter with probability 1 α. 2.1 R 2 and friends It is common to report some more information than simply the regression coefficients and the standard errors. Very often you will see the R 2 reported. R 2 is a measure of fit of the model. it goes from 0 (x doesn t explain anything) to 1 (x explains everything). It can be interpreted as the fraction of the total variation in y explained by x. A related number is R 2 -adjusted or R 2. This is R 2 with an adjustment for the number of explanatory variables (there can be more than one; see below). It doesn t have the same clear interpretation of fraction explained, but is interpreted in the same way. 2.2 Specification error I: Heteroskedasticity Above me made strong assumptions on the econometric model, particularly on the residuals ε i. If some of these assumptions are violated, we may get flawed results. In this section we look at some things that are likely to go wrong: We assumed above that ε has the same variance σ 2 for all individuals. This is not always the case, it could either be that we know that some individuals have larger variances than others (such as small countries versus large countries), or it could be that the variance is increasing in x. This is called heteroscedasticity. The 3

estimated coefficients are still unbiased and consistent 5, but the estimated standard errors of the estimates are now wrong. This also implies that most tests of the parameters are wrong. If we know how the variance varies between individuals, we can implement an optimal weighting to solve the problem, but this is usually not possible. However, there are techniques to estimate the correct standard errors when we have heteroscedasticity. This are usually referred to as robust standard errors, White-corrected standard errors, or sometimes something involving a sandwitch estimator. 2.3 Specification error II: Correlated errors It was also assumed above that the error terms for different observations are uncorrelated (or strictly speaking independent). This is not always the case. The most common case where this is not the case is if we have time series data, that is, observations are observations over time. Then it is not uncommon that ε does not change too much over a short period of time. This is referred to as autocorrelation, and a number of tests to detect this and procedures for estimation when it occurs have been developed. Details on this can be found in most econometrics books. Another case can occur in cross sectional data where the individuals in some way belong to different groups. One example could be a sample of individuals who live in different villages. Then those belonging to the same village share the same village characteristics, and this may create correlation between individuals from the same village. This is called clustering of the error terms, and there are techniques to produce correct estimates of the standard errors in the presence of clustered standard errors. 3 Non-linear models In a lot of cases the world is not linear as in (1), and we would then make a mistake by assuming linearity. In some cases we can do some tricks to make the model linear and then proceed with OLS, in other cases this is not possible. 3.1 Models that can be linearized In some cases we can transform either x or y to make a seemingly non-linear model linear. The most famous case is the Cobb-Douglas case y i = Ax β i e i where A and β are parameters to be estimated and e i the error term. If we take logs on both sides, we get ln y i = ln A + β ln x i + ln e i = α + β ln x i + ε i where α = ln A and ε i = ln e i. Hence by regressing the log of y on the log of x, we can estimate the parameters of the Cobb-Douglas model. 5 But they are inefficiently estimated. It is optimal to give higher weight to observations with low variance. 4

Notice also that in models where we regress the log of y on the log of x, the regression parameter β has an elasticity interpretation. If x increases by 1%, y increases by β%. If we regress the log of y on x in levels, a one unit increase in x leads to a 100 β% increase in y, and the other way round if we regress y in levels on x in logs. 3.2 Models that cannot be linearized In most cases it is not possible to transform the expression to something linear. We can still choose ˆα and ˆβ so they minimize the sums of squares, and his estimator has some nice properties. The largest problem is that there is no simple way to solve for ˆα and ˆβ; this has to be done by numerical techniques in a computer. Also, one can only show that the estimator has nice properties asymptotically, i.e. for a large sample size. For small sample sizes, the estimators may be far away from their true values and should be handled with caution. Simply minimizing the sums of squares is called non-linear least squares (NLS), and is included in most statistical packages. 4 Multiple regression So far we have only considered the regression of a single variable y on a single explanatory variable x. In most real life examples we have several explanatory variables. This is called multiple regression, and a typical model looks like y i = β 0 + β 1 x 1i + β 2 x 2i +... + β M x Mi + ε i where M is the number of explanatory variables. As above, we choose the βs to minimize the sum of squares. One slight change is the interpretation of the coefficients. Now the interpretation of β m is the effect on y by a one unit change in x m keeping the other x es constant. Although the formulae are more involved, standard errors are produced as before and have the same properties. For testing the properties of one parameter we can use t-test as before. A new possibility is that we want to test hypotheses involving several parameters. One possible null hypothesis could be β 1 = β 2 = β 3. This is a hypothesis in two parts, first β 1 = β 2 and second β 2 = β 3. We could believe that β 1 = β 3 would be a third constraint, but that constraint is redundant as it is implied by the two first. Hence this hypothesis involves two constraints. This important as it tells us the distribution of the test statistics. In cases with more than one constraint we use the F-test. The F-test is described by the degrees of freedom in the numerator and the denominator. The degrees of freedom in the numerator is equal to the number of constraints, the degrees of freedom in the denominator is the sample size minus the number of estimated parameters. As other tests, it will often be reported with a p-value which has the usual interpretation. 5

4.1 Collinearity One of the major reasons for using multiple regression is to remove the effect of nuisance parameters. Assume we have three variables y, x, and z. Variable x is correlated with z, and both have an effect on y. We want the effect of x alone on y. Then we need to include z in the regression as well, otherwise we would get the direct effect of x and the indirect effect through z. Take an example: We want to find the effect of corruption on growth, and we have measures of both for a sample of countries. But we fear that initial income also has an impact on growth, and poor countries are on average more corrupt than rich ones. If we don t include initial income, our variable for corruption will pick up both the true effect of corruption and the indirect effect of initial income. Including initial income is often referred to as controlling for initial income. A challenge arrises if we need to control for something we cannot observe. This is discussed under instrumental variables below. If some variables are highly correlated we have multicollinearity. This is not a proper problem but may be annoying. If x and z are highly correlated, and we regress y on x and z, the problem is that it is difficult to say whether it is x or z that is causing changes in y. This often leads to low t-values for both variables so they individually look insignificant. But if we perform a joint test on both variables it will be significant. Also, parameter estimates may fluctuate widely from including a few new data points or including a new variable. 5 Instrumental variables and causality The most common application of instrumental variables in applied papers is probably to attempt to assure an estimate of a causal relationship. Assume we want to have the causal effect of x on y, but that we fear that there may be a two-way causality, that both x affects y and that y affects x. A classical example is the relationship between prices and quantities. The standard solution is to find an instrument for x. A valid instrument, call it z, for x satisfies two conditions: 1. The instrument z is correlated with x 2. The instrument z does not have a direct effect on y The first condition is usually simple to fulfil. The second is the difficult part. We cannot check it by checking whether z and y are correlated because they should be because of the indirect effect through x. So we need an a priori justification for the validity of z. It is not possible to simply test whether z is a valid instrument! 6 Once we have the instrument we can use standard statistical packages to run the estimation for us. The method is somewhat changed, but for the end user things are essentially as before. 6 Some belive they can do this, but they are mostly wrong. A popular approach is an overidentification test. The problem with overidentification tests is that they are easy to manipulate and only allows us to test whether one instrument is valid conditional on another being valid. But they could easily both be invalid without the test telling us. There are some other approaches, for instance based on reverse regressions, that perform better. But this can still only give us an idea of the validity, and it should still mostly be based on the good argument. 6

Instrumental variables can also be used if we have an unobserved effect that we would have liked to control for. Assume y i = α + βx i + γw i + ε i, where x and w are correlated but w is not observed. Then simply regressing y on x will give biased results. If we can find a valid instrument for x, however, we can solve the problem. A valid instrument here is a variable z that is correlated with x, but uncorrelated with w and with no direct effect on y. An example of this is the effect of years of schooling on wages. We believe that abilities also have an impact on wages (more able people are more efficient and hence have higher wages), but abilities are also correlated with schooling (more able people have more schooling). The problem is that we cannot observe abilities. If we simply regress wages on schooling, we risk picking up both the effect of schooling and parts of the effect of abilities. So we need an instrument that is correlated with schooling but uncorrelated with abilities. A number of more or less fanciful suggestions have been made. One suggestion by Josh Angrist is to use numbers in the Vietnam Lottery. During the Vietnam War, young US men could avoid conscription by studying instead. Conscription was based on a lottery (hence random), and the time before a person was sent off to war was drawn at random and made public to the person. So those with a low number knew they would have to go to war soon unless they started studying. In the data, there is a correlation between a low number in the lottery and more schooling. As this number should be quite uncorrelated with abilities, it satisfies the requirements for a valid instrument. 6 Panel data So far we have looked at a sample where each unit is observed only once. Sometimes we have data where each unit is observed at several occasions. This is called a panel. An example is a sample of countries observed at different dates, for instance each year from 1990 to 2000. Panel data econometrics is a large field, and we will only scratch the surface. A typical panel data model for individuals i = 1,... N observed in time periods 7 t = 1,..., T is y it = α i + βx it + ε it. The new thing is that the intercept α i now is allowed to be different for different individuals. In a way, this allows individuals to be different (heterogeneous) in ways that are not observed by the econometrician. The two standard approaches to this model are the random effects estimator and the fixed effects estimator. The random effects estimator assumes that α i is a stochastic variable (with unknown mean and variance). The important assumption is that α i is independent of x, so they are not allowed to be correlated. As this is often not the case, this estimator is not that useful in all cases, but one can show that it is more efficient than the fixed effects estimator. 7 Here I assume a balanced panel, i.e. every individual is observed in every period. In the real world, unbalanced panels where some individuals are not oberved in some periods are very common. Most of the time, this has no big impact on how to use the data. 7

The fixed effects estimator treats the α i s as parameters that has to be estimated. This means that in addition to β, we estimate one intercept term α i for each individual in the sample. In most applied empirical research, the main reason for using panel data is that it allows us to some extent to control for unobserved heterogeneity. Say there is some variable z i that we believe is correlated with x it. If we do not include z in the analysis, we know that we are likely to get biased results. The problem arises when z is not observed. One approach is to use an instrument, but then we need to find a good instrument. A second approach is to use panel data techniques. This requires two things: That we have a panel (and the that x changes over time) and that it is reasonable that z does not change over time so it can be included as an individual specific fixed effect. One complication arrises if one of the explanatory variables (the xes) are lagged values of y, i.e. y observed at an earlier period. This occurs for instance in growth regressions where we want to control for convergence by including GDP/capita at the beginning of the period. If we use fixed effects panel data estimation on such data, the estimates are biased and only consistent if the number of time periods gets large. As we often only have a few periods, this is a problem. The usual solution is to use Arellano and Bond s GMM estimator (the really sophisticated people use Arellano and Bover instead). The details of the procedure is outside the scope of this note, but it gives nice results even if the number of periods is small, and the interpretation of the coefficients and standard errors are as in other fixed effects panel data models. 8