Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

Similar documents
Applied Microeconometrics (L5): Panel Data-Basics

Chapter 6 Stochastic Regressors

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

Logistic regression: Why we often can do what we think we can do. Maarten Buis 19 th UK Stata Users Group meeting, 10 Sept. 2015

Econometrics of Panel Data

EMERGING MARKETS - Lecture 2: Methodology refresher

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Microeconometrics. Bernd Süssmuth. IEW Institute for Empirical Research in Economics. University of Leipzig. April 4, 2011

Econometric Analysis of Cross Section and Panel Data

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

Multiple Linear Regression CIVL 7012/8012

A Course in Applied Econometrics Lecture 4: Linear Panel Data Models, II. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Financial Econometrics

Beyond the Target Customer: Social Effects of CRM Campaigns

More on Specification and Data Issues

Introduction to Econometrics

6. Assessing studies based on multiple regression

ECONOMETRICS HONOR S EXAM REVIEW SESSION

A Guide to Modern Econometric:

Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

Econometrics Honor s Exam Review Session. Spring 2012 Eunice Han

A Measure of Robustness to Misspecification

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

14.32 Final : Spring 2001

MKTG 555: Marketing Models

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

Empirical approaches in public economics

Chapter 2: simple regression model

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

What s New in Econometrics? Lecture 14 Quantile Methods

INTRODUCTION TO MULTILEVEL MODELLING FOR REPEATED MEASURES DATA. Belfast 9 th June to 10 th June, 2011

Chapter 14. Simultaneous Equations Models Introduction

Statistics, inference and ordinary least squares. Frank Venmans

Simple Regression Model (Assumptions)

Adding Uncertainty to a Roy Economy with Two Sectors

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

REED TUTORIALS (Pty) LTD ECS3706 EXAM PACK

Lecture 4: Linear panel models

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

1 Bewley Economies with Aggregate Uncertainty

The regression model with one stochastic regressor (part II)

Short T Panels - Review

ECONOMETRICS FIELD EXAM Michigan State University May 9, 2008

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

Topic 10: Panel Data Analysis

Changes in the Transitory Variance of Income Components and their Impact on Family Income Instability

Multilevel Statistical Models: 3 rd edition, 2003 Contents

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Development. ECON 8830 Anant Nyshadham

Simultaneous Equation Models Learning Objectives Introduction Introduction (2) Introduction (3) Solving the Model structural equations

Wooldridge, Introductory Econometrics, 3d ed. Chapter 16: Simultaneous equations models. An obvious reason for the endogeneity of explanatory

Estimation of Dynamic Nonlinear Random E ects Models with Unbalanced Panels.

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Club Convergence: Some Empirical Issues

Econometría 2: Análisis de series de Tiempo

Gibbs Sampling in Latent Variable Models #1

Limited Dependent Variables and Panel Data

Applied Quantitative Methods II

Spatial Regression. 13. Spatial Panels (1) Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement

Eviews for Panel Data. George Chobanov

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Comparing Change Scores with Lagged Dependent Variables in Models of the Effects of Parents Actions to Modify Children's Problem Behavior

Women. Sheng-Kai Chang. Abstract. In this paper a computationally practical simulation estimator is proposed for the twotiered

Non-linear panel data modeling

Econometrics. 7) Endogeneity

Part VII. Accounting for the Endogeneity of Schooling. Endogeneity of schooling Mean growth rate of earnings Mean growth rate Selection bias Summary

Controlling for Time Invariant Heterogeneity

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

AGEC 661 Note Fourteen

Econometrics Summary Algebraic and Statistical Preliminaries

Corporate Finance Data & The Role of Dynamic Panels. Mark Flannery, University of Florida Kristine W. Hankins, University of Kentucky

More on Roy Model of Self-Selection

ECON4515 Finance theory 1 Diderik Lund, 5 May Perold: The CAPM

New Developments in Econometrics Lecture 16: Quantile Estimation

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017

WISE International Masters

Next, we discuss econometric methods that can be used to estimate panel data models.

Functional Form. Econometrics. ADEi.

Panel data panel data set not

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data. Fred Mannering University of South Florida

Kausalanalyse. Analysemöglichkeiten von Paneldaten

Lecture-1: Introduction to Econometrics

Regression with time series

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

The multiple regression model; Indicator variables as regressors

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

What s New in Econometrics. Lecture 1

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

On the econometrics of the Koyck model

Econometrics of Panel Data

Regression - Modeling a response

Econometrics of Policy Evaluation (Geneva summer school)

Combining Difference-in-difference and Matching for Panel Data Analysis

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Transcription:

Chapter 1 Introduction What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

1.1 What are longitudinal and panel data? With regression data, we collect a cross-section of subjects. The interest is comparing characteristics of the subject, that is, investigating relationships among the variables. In contrast, with time series data, we identify one or more subjects and observe them over time. This allows us to study relationships over time, the so-called dynamic aspect of a problem. Longitudinal/panel data represent a marriage of regression and time series data. As with regression, we collect a cross-section of subjects. With panel data, we observe each subject over time. The descriptor panel data comes from surveys of individuals; a panel is a group of individuals surveyed repeatedly over time.

Example 1.1 - Divorce rates Figure 1.1 shows the 1965 divorce rates versus AFDC (Aid to Families with Dependent Children) for the fifty states. The correlation is -0.37. Counter-intuitive? - we might expect a positive relationship between welfare payments (AFDC) and divorce rates. 1965 Divorce Rates to AFDC Payments Divorce Rates 5 4 3 2 1 0 20 120 220 AFDC Payments

Example 1.1 - Divorce rates A similar figure shows a negative relationship for 1975 (the correlation is -0.425) Figure 1.2 shows both 1965 and 1975 data, with a line connecting each state The line represents a change over time (dynamic), not a cross-sectional relationship. Each line displays a positive relationship - as welfare payments increase so do divorce rates. This is not to argue for a causal relationship between welfare payments and divorce rates. The data are still observational. The dynamic relationship between divorce and AFDC is different from the cross-sectional relationship.

Figure 1.2 1965 and 1975 Divorce rates versus AFDC Comparing 1965 and 1975 Divorce Rates to AFDC Payments Divorce Rate 8 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 350 AFDC Payments

Some notation Longitudinal/panel data - regression data with double subscripts. Let y it be the response for the ith subject during the tth time period. We observe the ith subject over t=1,..., T i time periods, for each of i=1,..., n subjects. First subject - (y 11, y 12,..., y ) 1T 1 Second subject - (y 21, y 22,..., y ) 2T 2...... The nth subject - (y n1, y n2,..., y ) nt n

Prevalence of panel data analysis Importance in the literature Panel data are also known as cross-section time series data in the social sciences Referred to as longitudinal data analysis in the biological sciences ABI/INFORM - 326 articles in 2002 and 2003. The ISI Web of Science - 879 articles in 2002 and 2003. Important panel data bases Historically, we have: Panel Survey of Income Dyanmics (PSID) National Longitudinal Survey of Labor Market Experience (NLS) Financial and Accounting Compustat, CRSP, NAIC Market scanner databases See Appendix F

Appendix F. Selected Longitudinal and Panel Data Sets Table F.1 20 International Household Panel Studies Table F.2 5 Studies focused on youth and education Table F.3 4 Studies focused on the elderly and retirement Table F.4 7 miscellaneous studies, including election data, manufacturing data, medical expenditure data and insurance company data

1.2 Benefits and drawbacks of longitudinal data Several advantages of longitudinal data compared to data that are either purely cross-sectional (regression) or purely time series data. Having longitudinal data allows us to: Study dynamic relationships Study heterogeneity Reduce omitted variable bias With longitudinal data, one can also argue Estimators are more efficient Addresses the causal nature of relationships Main drawback - attrition

Dynamic relationships Static versus dynamic relationships Figure 1.1 showed a cross-sectional (static) relationship. We estimate a decrease of 0.95 % in divorce rates for each $100 increase in AFDC payments. Figure 1.2 showed a temporal (dynamic) relationship. We estimate an increase of 2.9% in divorce rates for each $100 increase in AFDC payments. From 1965 to 1975, AFDC payments increased an average of $59 and divorce rates increased 2.5%.

Historical approach In early panel data studies, pooled cross-sectional data were analyzed by estimating cross-sectional parameters using regression and using time series methods to model the regression parameter estimates, treating the estimates as known with certainty. Theil and Goldberger (1961) provide an early discussion on the advantages of estimating these two aspects simultaneously.

Dynamic relationships and time series analysis When studying dynamic relationships, univariate time series methods are the most well-developed. However, these methods do not account for relationships among different subjects. Multivariate time series accounts for relationships among a limited number of different subjects. Time series methods requires a fair number (generally, at least 30) observations to make reliable inferences.

Panel data as repeated time series With panel data, we observe several (repeated) subjects for each time period. By taking averages over subjects, our statistics are more reliable we require fewer time series observations to estimate dynamic patterns. For repeated subjects, the model is y it = µ + ε it, t=1,..., T i, i=1,..., n. Here, µ is the overall mean and ε it represents subject-specific dynamic patterns. Unfortunately, we don t get identical repeated looks. We hope to control for differences among subjects by introducing explanatory variables, or covariates. A basic model is y it = α + x it β + ε it, where x it is the explanatory variable. Introducing explanatory variables leaves us with only subject-specific dynamic patterns, that is, y it -(α + x it β ) = ε it

Heterogeneity Subjects are unique. In cross-sectional analysis, we use y it = α + x it β + ε it ascribe the uniqueness to " ε it ". In panel data, we have an opportunity to model this uniqueness. The modely it = α i + x it β + ε it is unidentifiable in cross-sectional regression. In panel data, we can estimate β and α 1,.., α n. Subject-specific parameters, such as α i, provide an important mechanism for controlling heterogeneity of individuals. Vocabulary: When {α i } are fixed, unknown parameters to be estimated, we call this a fixed effects model. When {α i } are drawn from an unknown population, that is, random variables, we call this a model with random effects.

Heterogeneity bias Suppose that a data analyst mistakenly uses the model y it = α + x it β + ε it when y it = α i + x it β + ε it is the true model. This is an example of heterogeneity bias, or a problem with aggregation with data. Similarly, one could have different (heterogeneous) slopes y it = α + x it β i + ε it or different intercepts and slopes y it = α i + x it β i + ε it

y y = α + βx 1 y y = α + β x 1 y = α + β x 3 y = α + β x 3 y = α + β x 2 y = α + β x 2 x x y y = α + β x 1 1 y = α + β x 3 3 y = α + β x 2 2 x

Omitted variables Panel data serves to reduce the omitted variable bias. When omitted variables are time constant, we can still get reliable estimates. Consider the true model y it = α + x it β + z i γ + ε it. Unfortunately, we cannot (or not thought to) measure z i. It is lurking or latent. By considering the changes y it* = y it - y i,t-1 = (α + x it β + z i γ + ε it ) - (α + x it-1 β + z i γ + ε it-1 ) = (x it - x it -1 ) β + ε it - ε it-1 ) = x * it β + ε * it we do not need to worry about the bias that ordinarily arises from the latent variable, z i. Introducing the subject-specific variable α i, accounts for the presence of many types of latent variables.

Efficiency of Estimators Subject-specific variables α i also account for a large portion of the variability in many data sets This reduces the mean square error Increases the efficiency (or reduces the standard errors) of our parameter estimators. With panel data, we generally have more observations than with time series or regression. A longitudinal data design may yield more efficient estimators than estimators based on a comparable amount of data from alternative designs. Suppose that the interest is in assessing the average change in a response over time, such as the divorce rate. A repeated cross-section yields Var y y 2 = Var y 1 + Longitudinal data design yields Var ( 1 ) Var y 2 ( y y ) = Var y + Var y 2 Cov( y y ) 1 2 1 2 1, 2

Causality and correlation Three ingredients necessary for establishing causality, taken from the sociology literature: A statistically significant relationship is required. The association between two variables must not be due to another, omitted, variable. The causal variable must precede the other variable in time. Longitudinal data are based on measurements taken over time and thus address the third requirement of a temporal ordering of events. Moreover, longitudinal data models provide additional strategies for accommodating omitted variables that are not available in purely cross-sectional data.

Drawbacks: Sampling Design (attrition) Selection bias may occur when a rule other than simple random sampling is used to select observational units Example endogeneous decisions by agents to join a labor pool or participate in a social program. Missing data Because we follow the same subjects over time, nonresponse typically increases through time. Example: US Panel Study of Income Dynamics (PSID): In the first year (1968), the nonresponse rate was 24%. By 1985, the nonresponse rate was about 50%.

1.3 Longitudinal data models Types of inference Primary. We are interested in the effect that an (exogenous) explanatory variable has on a response, controlling for other variables (including omitted variables). Forecasting. We would like to predict future values of the response from a specific subject. Conditional means. We would like to predict the expected value of a future response from a specific subject. Here, the conditioning is on latent (unobserved) characteristics associated with the subject. Types of applications -many

Social science statistical modeling A model based on data characteristics is known as a sampling based model. The model arises from a data generating process. In contrast, a structural model is a statistical model that represents causal relationships, as opposed to relationships that simply capture statistical associations. Why bother with an extra layer of theory when considering statistical models? Manski (1992) offers : Interpretation - the primary purpose of many statistical analyses is to assess relationships generated by theory from a scientific field. Structural models utilize additional information from an underlying functional field. If this information is utilized correctly, then in some sense the structural model should provide a better representation than a model without this information. (explanation) Particularly for public policy analysis, the goal of a statistical analysis is to infer the likely behavior of data outside of those realized (extrapolation).

Modeling issues With subject-specific parameters, there can be many parameters that describe the model Fixed versus random effects models Incorporating dynamic structure is important Econometric dynamic models (lagged endogenous) versus serial correlation approach Linear versus nonlinear (generalized linear) models Marginal versus hierarchical estimation approaches Parametric versus semiparametric models We wish to separate the effects of: the mean the cross-sectional variance and serial correlation structure

1.4 Historical notes The term panel study was coined in a marketing context when Lazarsfeld and Fiske (1938) Considered the effect of radio advertising on product sales. People buy a product would be more likely to hear the advertisement, or vice versa. They proposed repeatedly interviewing a set of people (the panel ) to clarify the issue. Econometrics Early economics applications include Kuh (1959), Johnson (1960), Mundlak (1961) and Hoch (1962). Biostatistics Wishart (1938), Rao (1959, 1965), Potthoff and Roy (1964) used multivariate analysis to consider the problem of polynomial growth curves of serial measurements from a single group of subjects. Grizzle and Allen (1969) introduced covariates