Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

Chapter 1 Introduction What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

1.1 What are longitudinal and panel data? With regression data, we collect a cross-section of subjects. The interest is comparing characteristics of the subject, that is, investigating relationships among the variables. In contrast, with time series data, we identify one or more subjects and observe them over time. This allows us to study relationships over time, the so-called dynamic aspect of a problem. Longitudinal/panel data represent a marriage of regression and time series data. As with regression, we collect a cross-section of subjects. With panel data, we observe each subject over time. The descriptor panel data comes from surveys of individuals; a panel is a group of individuals surveyed repeatedly over time.

Example 1.1 - Divorce rates Figure 1.1 shows the 1965 divorce rates versus AFDC (Aid to Families with Dependent Children) for the fifty states. The correlation is -0.37. Counter-intuitive? - we might expect a positive relationship between welfare payments (AFDC) and divorce rates. 1965 Divorce Rates to AFDC Payments Divorce Rates 5 4 3 2 1 0 20 120 220 AFDC Payments

Example 1.1 - Divorce rates A similar figure shows a negative relationship for 1975 (the correlation is -0.425) Figure 1.2 shows both 1965 and 1975 data, with a line connecting each state The line represents a change over time (dynamic), not a cross-sectional relationship. Each line displays a positive relationship - as welfare payments increase so do divorce rates. This is not to argue for a causal relationship between welfare payments and divorce rates. The data are still observational. The dynamic relationship between divorce and AFDC is different from the cross-sectional relationship.

Figure 1.2 1965 and 1975 Divorce rates versus AFDC Comparing 1965 and 1975 Divorce Rates to AFDC Payments Divorce Rate 8 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 350 AFDC Payments

Some notation Longitudinal/panel data - regression data with double subscripts. Let y it be the response for the ith subject during the tth time period. We observe the ith subject over t=1,..., T i time periods, for each of i=1,..., n subjects. First subject - (y 11, y 12,..., y ) 1T 1 Second subject - (y 21, y 22,..., y ) 2T 2...... The nth subject - (y n1, y n2,..., y ) nt n

Prevalence of panel data analysis Importance in the literature Panel data are also known as cross-section time series data in the social sciences Referred to as longitudinal data analysis in the biological sciences ABI/INFORM - 326 articles in 2002 and 2003. The ISI Web of Science - 879 articles in 2002 and 2003. Important panel data bases Historically, we have: Panel Survey of Income Dyanmics (PSID) National Longitudinal Survey of Labor Market Experience (NLS) Financial and Accounting Compustat, CRSP, NAIC Market scanner databases See Appendix F

Appendix F. Selected Longitudinal and Panel Data Sets Table F.1 20 International Household Panel Studies Table F.2 5 Studies focused on youth and education Table F.3 4 Studies focused on the elderly and retirement Table F.4 7 miscellaneous studies, including election data, manufacturing data, medical expenditure data and insurance company data

1.2 Benefits and drawbacks of longitudinal data Several advantages of longitudinal data compared to data that are either purely cross-sectional (regression) or purely time series data. Having longitudinal data allows us to: Study dynamic relationships Study heterogeneity Reduce omitted variable bias With longitudinal data, one can also argue Estimators are more efficient Addresses the causal nature of relationships Main drawback - attrition

Dynamic relationships Static versus dynamic relationships Figure 1.1 showed a cross-sectional (static) relationship. We estimate a decrease of 0.95 % in divorce rates for each $100 increase in AFDC payments. Figure 1.2 showed a temporal (dynamic) relationship. We estimate an increase of 2.9% in divorce rates for each $100 increase in AFDC payments. From 1965 to 1975, AFDC payments increased an average of $59 and divorce rates increased 2.5%.

Historical approach In early panel data studies, pooled cross-sectional data were analyzed by estimating cross-sectional parameters using regression and using time series methods to model the regression parameter estimates, treating the estimates as known with certainty. Theil and Goldberger (1961) provide an early discussion on the advantages of estimating these two aspects simultaneously.

Dynamic relationships and time series analysis When studying dynamic relationships, univariate time series methods are the most well-developed. However, these methods do not account for relationships among different subjects. Multivariate time series accounts for relationships among a limited number of different subjects. Time series methods requires a fair number (generally, at least 30) observations to make reliable inferences.

Panel data as repeated time series With panel data, we observe several (repeated) subjects for each time period. By taking averages over subjects, our statistics are more reliable we require fewer time series observations to estimate dynamic patterns. For repeated subjects, the model is y it = µ + ε it, t=1,..., T i, i=1,..., n. Here, µ is the overall mean and ε it represents subject-specific dynamic patterns. Unfortunately, we don t get identical repeated looks. We hope to control for differences among subjects by introducing explanatory variables, or covariates. A basic model is y it = α + x it β + ε it, where x it is the explanatory variable. Introducing explanatory variables leaves us with only subject-specific dynamic patterns, that is, y it -(α + x it β ) = ε it

Heterogeneity Subjects are unique. In cross-sectional analysis, we use y it = α + x it β + ε it ascribe the uniqueness to " ε it ". In panel data, we have an opportunity to model this uniqueness. The modely it = α i + x it β + ε it is unidentifiable in cross-sectional regression. In panel data, we can estimate β and α 1,.., α n. Subject-specific parameters, such as α i, provide an important mechanism for controlling heterogeneity of individuals. Vocabulary: When {α i } are fixed, unknown parameters to be estimated, we call this a fixed effects model. When {α i } are drawn from an unknown population, that is, random variables, we call this a model with random effects.

Heterogeneity bias Suppose that a data analyst mistakenly uses the model y it = α + x it β + ε it when y it = α i + x it β + ε it is the true model. This is an example of heterogeneity bias, or a problem with aggregation with data. Similarly, one could have different (heterogeneous) slopes y it = α + x it β i + ε it or different intercepts and slopes y it = α i + x it β i + ε it

y y = α + βx 1 y y = α + β x 1 y = α + β x 3 y = α + β x 3 y = α + β x 2 y = α + β x 2 x x y y = α + β x 1 1 y = α + β x 3 3 y = α + β x 2 2 x

Omitted variables Panel data serves to reduce the omitted variable bias. When omitted variables are time constant, we can still get reliable estimates. Consider the true model y it = α + x it β + z i γ + ε it. Unfortunately, we cannot (or not thought to) measure z i. It is lurking or latent. By considering the changes y it* = y it - y i,t-1 = (α + x it β + z i γ + ε it ) - (α + x it-1 β + z i γ + ε it-1 ) = (x it - x it -1 ) β + ε it - ε it-1 ) = x * it β + ε * it we do not need to worry about the bias that ordinarily arises from the latent variable, z i. Introducing the subject-specific variable α i, accounts for the presence of many types of latent variables.

Efficiency of Estimators Subject-specific variables α i also account for a large portion of the variability in many data sets This reduces the mean square error Increases the efficiency (or reduces the standard errors) of our parameter estimators. With panel data, we generally have more observations than with time series or regression. A longitudinal data design may yield more efficient estimators than estimators based on a comparable amount of data from alternative designs. Suppose that the interest is in assessing the average change in a response over time, such as the divorce rate. A repeated cross-section yields Var y y 2 = Var y 1 + Longitudinal data design yields Var ( 1 ) Var y 2 ( y y ) = Var y + Var y 2 Cov( y y ) 1 2 1 2 1, 2

Causality and correlation Three ingredients necessary for establishing causality, taken from the sociology literature: A statistically significant relationship is required. The association between two variables must not be due to another, omitted, variable. The causal variable must precede the other variable in time. Longitudinal data are based on measurements taken over time and thus address the third requirement of a temporal ordering of events. Moreover, longitudinal data models provide additional strategies for accommodating omitted variables that are not available in purely cross-sectional data.

Drawbacks: Sampling Design (attrition) Selection bias may occur when a rule other than simple random sampling is used to select observational units Example endogeneous decisions by agents to join a labor pool or participate in a social program. Missing data Because we follow the same subjects over time, nonresponse typically increases through time. Example: US Panel Study of Income Dynamics (PSID): In the first year (1968), the nonresponse rate was 24%. By 1985, the nonresponse rate was about 50%.

1.3 Longitudinal data models Types of inference Primary. We are interested in the effect that an (exogenous) explanatory variable has on a response, controlling for other variables (including omitted variables). Forecasting. We would like to predict future values of the response from a specific subject. Conditional means. We would like to predict the expected value of a future response from a specific subject. Here, the conditioning is on latent (unobserved) characteristics associated with the subject. Types of applications -many

Social science statistical modeling A model based on data characteristics is known as a sampling based model. The model arises from a data generating process. In contrast, a structural model is a statistical model that represents causal relationships, as opposed to relationships that simply capture statistical associations. Why bother with an extra layer of theory when considering statistical models? Manski (1992) offers : Interpretation - the primary purpose of many statistical analyses is to assess relationships generated by theory from a scientific field. Structural models utilize additional information from an underlying functional field. If this information is utilized correctly, then in some sense the structural model should provide a better representation than a model without this information. (explanation) Particularly for public policy analysis, the goal of a statistical analysis is to infer the likely behavior of data outside of those realized (extrapolation).

Modeling issues With subject-specific parameters, there can be many parameters that describe the model Fixed versus random effects models Incorporating dynamic structure is important Econometric dynamic models (lagged endogenous) versus serial correlation approach Linear versus nonlinear (generalized linear) models Marginal versus hierarchical estimation approaches Parametric versus semiparametric models We wish to separate the effects of: the mean the cross-sectional variance and serial correlation structure

1.4 Historical notes The term panel study was coined in a marketing context when Lazarsfeld and Fiske (1938) Considered the effect of radio advertising on product sales. People buy a product would be more likely to hear the advertisement, or vice versa. They proposed repeatedly interviewing a set of people (the panel ) to clarify the issue. Econometrics Early economics applications include Kuh (1959), Johnson (1960), Mundlak (1961) and Hoch (1962). Biostatistics Wishart (1938), Rao (1959, 1965), Potthoff and Roy (1964) used multivariate analysis to consider the problem of polynomial growth curves of serial measurements from a single group of subjects. Grizzle and Allen (1969) introduced covariates