Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

Longitudinal and Panel Data Preface / i Longitudinal and Panel Data: Analysis and Applications for the Social Sciences Table of Contents August, 2003 Table of Contents Preface i vi 1. Introduction 1.1 What are longitudinal and panel data? 1-1 1.2 Benefits and drawbacks of longitudinal data 1-4 1.3 Longitudinal data models 1-9 1.4 Historical notes 1-13 PART I - LINEAR MODELS 2. Fixed Effects Models 2.1 Basic fixed effects model 2-1 2.2 Exploring longitudinal data 2-5 2.3 Estimation and inference 2-10 2.4 Model specification and diagnostics 2-14 2.4.1 Pooling test 2-14 2.4.2 Added variable plots 2-15 2.4.3 Influence diagnostics 2-16 2.4.4 Cross-sectional correlation 2-17 2.4.5 Heteroscedasticity 2-18 2.5 Model extensions 2-19 2.5.1 Serial correlation 2-20 2.5.2 Subject-specific slopes 2-21 2.5.3 Robust estimation of standard errors 2-22 Further reading 2-23 Appendix 2A - Least squares estimation 2-24 2A.1 Basic Fixed Effects Model Ordinary Least Squares Estimation 2-24 2A.2 Fixed Effects Model Generalized Least Squares Estimation 2-24 2A.3 Diagnostic Statistics 2-25 2A.4 Cross-sectional Correlation 2-26 2. Exercises and Extensions 2-27 3. Models with Random Effects 3.1 Error components / random intercepts model 3-1 3.2 Example: Income tax payments 3-7 3.3 Mixed effects models 3-12 3.3.1 Linear mixed effects model 3-12 3.3.2 Mixed linear model 3-15 3.4 Inference for regression coefficients 3-16

ii / Longitudinal and Panel Data Preface 3.5 Variance component estimation 3-20 3.5.1 Maximum likelihood estimation 3-20 3.5.2 Restricted maximum likelihood 3-21 3.5.3 MIVQUE estimators 3-23 Further reading 3-25 Appendix 3A REML calculations 3-26 3A.1 Independence Of Residuals And Least Squares Estimators 3-26 3A.2 Restricted Likelihoods 3-26 3A.3 Likelihood Ratio Tests And REML 3-28 3. Exercises and Extensions 3-30 4. Prediction and Bayesian Inference 4.1 Prediction for one-way ANOVA models 4-2 4.2 Best linear unbiased predictors (BLUP) 4-4 4.3 Mixed model predictors 4-7 4.3.1 Linear mixed effects model 4-7 4.3.2 Linear combinations of global parameters and subjectspecific effects 4-7 4.3.3 BLUP residuals 4-8 4.3.4 Predicting future observations 4-9 4.4 Example: Forecasting Wisconsin lottery sales 4-10 4.4.1 Sources and characteristics of data 4-11 4.4.2 In-sample model specification 4-13 4.4.3 Out-of-sample model specification 4-15 4.4.4 Forecasts 4-16 4.5 Bayesian inference 4-17 4.6 Credibility theory 4-20 4.6.1 Credibility theory models 4-21 4.6.2 Credibility theory ratemaking 4-22 Further reading 4-25 Appendix 4A Linear unbiased prediction 4-26 4A.1 Minimum Mean Square Predictor 4-26 4A.2 Best Linear Unbiased Predictor 4-26 4A.3 BLUP Variance 4-27 4. Exercises and Extensions 4-28 5. Multilevel Models 5.1 Cross-sectional multilevel models 5-1 5.1.1 Two-level models 5-1 5.1.2 Multiple level models 5-5 5.1.3 Multiple level modeling in other fields 5-6 5.2 Longitudinal multilevel models 5-6 5.2.1 Two-level models 5-6 5.2.2 Multiple level models 5-10 5.3 Prediction 5-11 5.4 Testing variance components 5-13 Further reading 5-15 Appendix 5A High order multilevel models 5-16 5. Exercises and Extensions 5-19

Longitudinal and Panel Data Preface / iii 6. Random Regressors 6.1 Stochastic regressors in non-longitudinal settings 6-1 6.1.1 Endogenous stochastic regressors 6-1 6.1.2 Weak and strong exogeneity 6-3 6.1.3 Causal effects 6-4 6.1.4 Instrumental variable estimation 6-5 6.2 Stochastic regressors in longitudinal settings 6-7 6.2.1 Longitudinal data models without heterogeneity terms 6-7 6.2.2 Longitudinal data models with heterogeneity terms and strictly exogenous regressors 6-8 6.3 Longitudinal data models with heterogeneity terms and sequentially exogenous regressors 6-11 6.4 Multivariate responses 6-16 6.4.1 Multivariate regressions 6-16 6.4.2 Seemingly unrelated regressions 6-17 6.4.3 Simultaneous equations models 6-18 6.4.4 Systems of equations with error components 6-20 6.5 Simultaneous equation models with latent variables 6-22 6.5.1 Cross-sectional models 6-23 6.5.2 Longitudinal data applications 6-26 Further reading 6-29 Appendix 6A Linear projections 6-30 7. Modeling Issues 7.1 Heterogeneity 7-1 7.2 Comparing fixed and random effects estimators 7-4 7.2.1 A special case 7-7 7.2.2 General case 7-9 7.3 Omitted variables 7-12 7.3.1 Models of omitted variables 7-13 7.3.2 Augmented regression estimation 7-14 7.4 Sampling, selectivity bias, attrition 7-17 7.4.1 Incomplete and rotating panels 7-17 7.4.2 Unplanned nonresponse 7-18 7.4.3 Non-ignorable missing data 7-21 7. Exercises and Extensions 7-24 8. Dynamic Models 8.1 Introduction 8-1 8.2 Serial correlation models 8-3 8.2.1 Covariance structures 8-3 8.2.2 Nonstationary structures 8-4 8.2.3 Continuous time correlation models 8-5 8.3 Cross-sectional correlations and time-series cross-section models 8.4 Time-varying coefficients 8-9 8.4.1 The model 8-9 8.4.2 Estimation 8-10 8-7

iv / Longitudinal and Panel Data Preface 8.4.4 Forecasting 8-11 8.5 Kalman filter approach 8-13 8.5.1 Transition equations 8-14 8.5.2 Observation set 8-15 8.5.3 Measurement equations 8-15 8.5.4 Initial conditions 8-16 8.5.5 Kalman filter algorithm 8-16 8.6 Example: Capital asset pricing model 8-18 Appendix 8A Inference for the time-varying coefficient model 8-23 8A.1 The Model 8-23 8A.2 Estimation 8-23 8A.3 Prediction 8-25 PART II - NONLINEAR MODELS 9. Binary Dependent Variables 9.1 Homogeneous models 9-1 9.1.1 Logistic and probit regression models 9-2 9.1.2 Inference for logistic and probit regression models 9-5 9.1.3 Example: Income tax payments and tax preparers 9-7 9.2 Random effects models 9-9 9.3 Fixed effects models 9-13 9.4 Marginal models and GEE 9-16 Further reading 9-20 Appendix 9A Likelihood calculations 9-20 9A.1 Consistency Of Likelihood Estimators 9-21 9A.2 Computing Conditional Maximum Likelihood Estimators 9-22 9. Exercises and Extensions 9-23 10. Generalized Linear Models 10.1 Homogeneous models 10-1 10.1.1 Linear exponential families of distributions 10-1 10.1.2 Link functions 10-2 10.1.3 Estimation 10-3 10.2 Example: Tort filings 10-5 10.3 Marginal models and GEE 10-8 10.4 Random effects models 10-12 10.5 Fixed effects models 10-16 10.5.1 Maximum likelihood estimation for canonical links 10-16 10.5.2 Conditional maximum likelihood estimation for canonical links 10-17 10.5.3 Poisson distribution 10-18 10.6 Bayesian inference 10-19 Further reading 10-22 Appendix 10A Exponential families of distributions 10-23 10A.1 Moment Generating Functions 10-23 10A.2 Sufficiency 10-24 10A.3 Conjugate Distributions 10-24

Longitudinal and Panel Data Preface / v 10A.4 Marginal Distributions 10-25 10. Exercises and Extensions 10-27 11. Categorical Dependent Variables and Survival Models 11.1 Homogeneous models 11-1 11.1.1 Statistical inference 11-2 11.1.2 Generalized logit 11-2 11.1.3 Multinomial (conditional) logit 11-4 11.1.4 Random utility interpretation 11-6 11.1.5 Nested logit 11-7 11.1.6 Generalized extreme value distribution 11-8 11.2 Multinomial logit models with random effects 11-8 11.3 Transition (Markov) models 11-10 11.4 Survival models 11-18 Appendix 11A. Conditional likelihood estimation for 11-21 multinomial logit models with random effects APPENDICES Appendix A. Elements of Matrix Algebra A-1 A.1 Basic Definitions A-1 A.2 Basic Operations A-1 A.3 Further Definitions A-2 A.4 Matrix Decompositions A-2 A.5 Partitioned Matrices A-3 A.6 Kronecker (Direct) Products A-4 Appendix B. Normal Distribution A-5 Appendix C. Likelihood-Based Inference A-6 C.1 Characteristics of Likelihood Functions A-6 C.2 Maximum Likelihood Estimators A-6 C.3 Iterated Reweighted Least Squares A-8 C.4 Profile Likelihood A-8 C.5 Quasi-Likelihood A-8 C.6 Estimating Equations A-9 C.7 Hypothesis Tests A-11 C.8 Information Criteria A-12 C.9 Goodness of Fit Statistics A-13 Appendix D. Kalman Filter A-14 D.1 Basic State Space Model A-14 D.2 Kalman Filter Algorithm A-14 D.3 Likelihood Equations A-15 D.4 Extended State Space Model and Mixed Linear Models A-15 D.5 Likelihood Equations for Mixed Linear Models A-16 Appendix E. Symbols and Notation A-18 Appendix F. Selected Longitudinal and Panel Data A-24 Sets Appendix G. References A-28 Index A-39

vi / Longitudinal and Panel Data Preface Preface Intended Audience and Level This text focuses on models and data that arise from repeated measurements taken from a cross-section of subjects. These models and data have found substantive applications in many disciplines within the biological and social sciences. The breadth and scope of applications appears to be increasing over time. However, this widespread interest has spawned a hodgepodge of terms; many different terms are used to describe the same concept. To illustrate, even the subject title takes on different meanings in different literatures; sometimes this topic is referred to as longitudinal data and sometimes as panel data. To welcome readers from a variety of disciplines, I use the cumbersome yet more inclusive descriptor longitudinal and panel data. This text is primarily oriented to applications in the social sciences. Thus, the data sets considered here are from different areas of social science including business, economics, education and sociology. The methods introduced into text are oriented towards handling observational data, in contrast to data arising from experimental situations, that are the norm in the biological sciences. Even with this social science orientation, one of my goals in writing this text is to introduce methodology that has been developed in the statistical and biological sciences, as well as the social sciences. That is, important methodological contributions have been made in each of these areas; my goal is to synthesize the results that are important for analyzing social science data, regardless of their origins. Because many terms and notations that appear in this book are also found in the biological sciences (where panel data analysis is known as longitudinal data analysis), this book may also appeal to researchers interested in the biological sciences. Despite its forty-year history and widespread usage, a survey of the literature shows that the quality of applications is uneven. Perhaps this is because longitudinal and panel data analysis has developed in separate fields of inquiry; what is widely known and accepted in one field is given little prominence in a related field. To provide a treatment that is accessible to researchers from a variety of disciplines, this text introduces the subject using relatively sophisticated quantitative tools, including regression and linear model theory. Knowledge of calculus, as well as matrix algebra, is also assumed. For Chapter 8 on dynamic models, a time series course would also be useful. With this level of prerequisite mathematics and statistics, I hope that the text is accessible to quantitatively oriented graduate social science students who are my primary audience. To help students work through the material, the text features several analytical and empirical exercises. Moreover, detailed appendices on different mathematical and statistical supporting topics should help students develop their knowledge of the topic as they work the exercises. I also hope that the textbook style, such as the boxed procedures and an organized set of symbols and notation, will appeal to applied researchers that would like a reference text on longitudinal and panel data modeling. Organization The beginning chapter sets the stage for the book. Chapter 1 introduces longitudinal and panel data as repeated observations from a subject and cites examples from many disciplines in which longitudinal data analysis is used. This chapter outlines important benefits of longitudinal data analysis, including the ability to handle the heterogeneity and dynamic features of the data. The chapter also acknowledges some important drawbacks of this scientific methodology, particularly the problem of attrition. Furthermore, Chapter 1 provides an overview of the several types of models used to handle longitudinal data; these models are considered in greater detail in

Longitudinal and Panel Data Preface / vii subsequent chapters. This chapter should be read at the beginning and end of one s introduction to longitudinal data analysis. When discussing heterogeneity in the context of longitudinal data analysis, we mean that observations from different subjects tend to be dissimilar when compared to observations from the same subject that tend to be similar. One way of modeling heterogeneity is to use fixed parameters that vary by individual; this formulation is known as a fixed effects model and is described in Chapter 2. A useful pedagogic feature of fixed effects models is that they can be introduced using standard linear model theory. Linear model and regression theory is widely known among research analysts; with this solid foundation, fixed effects models provide a desirable foundation for introducing longitudinal data models. This text is written assuming that readers are familiar with linear model and regression theory at the level of, for example, Draper and Smith (1995) or Greene (1993). Chapter 2 provides an overview of linear models with a heavy emphasis on analysis of covariance techniques that are useful for longitudinal and panel data analysis. Moreover, the Chapter 2 fixed effects models provide a solid framework for introducing many graphical and diagnostic techniques. Another way of modeling heterogeneity is to use parameters that vary by individual yet that are represented as random quantities; these quantities are known as random effects and are described in Chapter 3. Because models with random effects generally include fixed effects to account for the mean, models that incorporate both fixed and random quantities are known as linear mixed effects models. Just as a fixed effects model can be thought of in the linear model context, a linear mixed effects model can be expressed as a special case of the mixed linear model. Because mixed linear model theory is not as widely known as regression, Chapter 3 provides more details on the estimation and other inferential aspects than the corresponding development in Chapter 2. Still, the good news for applied researchers is that, by writing linear mixed effects models as mixed linear models, widely available statistical software can be used to analyze linear mixed effects models. By appealing to linear model and mixed linear model theory in Chapters 2 and 3, we will be able to handle many applications of longitudinal and panel data models. Still, the special structure of longitudinal data raises additional inference questions and issues that are not commonly addressed in the standard introductions to linear model and mixed linear model theory. One such set of questions deals with the problem of estimating random quantities, known as prediction. Chapter 4 introduces the prediction problem in the longitudinal data context and shows how to estimate residuals, conditional means and future values of a process. Chapter 4 also shows how to use Bayesian inference as an alternative method for prediction. To provide additional motivation and intuition for Chapters 3 and 4, Chapter 5 introduces multilevel modeling. Multilevel models are widely used in educational sciences and developmental psychology where one assumes that complex systems can be modeled hierarchically; that is, modeling one level at a time, each level conditional on lower levels. Many multilevel models can be written as linear mixed effects models; thus, the inference properties of estimation and prediction that we develop in Chapters 3 and 4 can be applied directly to the Chapter 5 multilevel models. Chapter 6 returns to the basic linear mixed effects model but now adopts an econometric perspective. In particular, this chapter considers situations where the explanatory variables are stochastic and may be influenced by the response variable. In such circumstances, the explanatory variables are known as endogenous. Difficulties associated with endogenous explanatory variables, and methods for addressing these difficulties, are well known for cross-sectional data. Because not all readers will be familiar with the relevant econometric literature, Chapter 6 reviews these difficulties and methods. Moreover, Chapter 6 describes the more recent literature on similar situations for longitudinal data. Chapter 7 analyzes several issues that are specific to a longitudinal or panel data study. One issue is the choice of the representation to model heterogeneity. The many choices include

viii / Longitudinal and Panel Data Preface fixed effects, random effects and serial correlation models. Chapter 7.1 reviews important identification issues when trying to decide upon the appropriate model for heterogeneity. One issue is the comparison of fixed and random effects models that has received substantial attention in the econometrics literature. As described in Chapter 7, this comparison involves interesting discussions of the omitted variables problem. Briefly, we will see that time-invariant omitted variables can be captured through the parameters used to represent heterogeneity, thus handling two problems at the same time. Chapter 7 concludes with a discussion of sampling and selectivity bias. Panel data surveys, with repeated observations on a subject, are particularly susceptible to a type of selectivity problem known as attrition, where individuals leave a panel survey. Longitudinal and panel data applications are typically long in the cross-section and short in the time dimension. Hence, the development of these methods stem primarily from regression-type methodologies such as linear model and mixed linear model theory. Chapters 2 and 3 introduce some dynamic aspects, such as serial correlation, where the primary motivation is to provide improved parameter estimators. For many important applications, the dynamic aspect is the primary focus, not an ancillary consideration. Further, for some data sets, the temporal dimension is long, thus providing opportunities to model the dynamic aspect in detail. For these situations, longitudinal data methods are closer in spirit to multivariate time series analysis than to cross-sectional regression analysis. Chapter 8 introduces dynamic models, where the time dimension is of primary importance. Chapters 2 through 8 are devoted to analyzing data that may be represented using models that are linear in the parameters, including linear and mixed linear models. In contrast, Chapters 9 through 11 are devoted to analyzing data that can be represented using nonlinear models. The collection of nonlinear models is vast. To provide a concentrated discussion that relates to the applications orientation of this book, we focus on models where the distribution of the response cannot be reasonably approximated by a normal distribution and alternative distributions must be considered. We begin in Chapter 9 with a discussion of modeling responses that are dichotomous; we call these binary dependent variable models. Because not all readers with a background in regression theory have been exposed to binary dependent models such as logistic regression, Chapter 9 begins with an introductory section under the heading of homogeneous models; these are simply the usual cross-sectional models without heterogeneity parameters. Then, Chapter 9 introduces the issues associated with random and fixed effects models to accommodate the heterogeneity. Unfortunately, random effects model estimators are difficult to compute and the usual fixed effects model estimators have undesirable properties. Thus, Chapter 9 introduces an alternative modeling strategy that is widely used in biological sciences based on a so-called marginal model. This model employs generalized estimating equation (GEE), or generalized method of moments (GMM), estimators that are simple to compute and have desirable properties. Chapter 10 extends that Chapter 9 discussion to generalized linear models (GLMs). This class of models handles the normal-based models of Chapter 2 through 8, the binary models of Chapter 9 as well as additional important applied models. Chapter 10 focuses on count data through the Poisson distribution although the general arguments can also be used for other distributions. Like Chapter 9, we begin with the homogeneous case to provide a review for readers that have not been introduced to GLM. The next section is on marginal models that are particularly useful for applications. Chapter 10 follows with an introduction to random and fixed effects models. Using the Poisson distribution as a basis, Chapter 11 extends the discussion to multinomial models. These models are particularly useful in economic choice models that have seen broad applications in the marketing research literatures. Chapter 11 provides a brief overview of the economic basis for these choice models and then shows how to apply these to random effects multinomial models.

Longitudinal and Panel Data Preface / ix Statistical Software My goal in writing this text is to reach a broad group of researchers. Thus, to avoid excluding large segments of individuals, I have chosen not to integrate any specific statistical software package into the text. Nonetheless, because of the applications orientation, it is critical that the methodology presented be easily accomplished using readily available packages. For the course taught at the University of Wisconsin, I use the statistical package SAS. (Although many of my students opt to use alternative packages such as STATA and R. I encourage free choice!) In my mind, this is the analog of an existence theorem. If a procedure is important and can be readily accomplished by one package, then it is (or will soon be) available through its competitors. On the book web site, http://research.bus.wisc.edu/jfrees/book/pdatabook.htm users will find routines written in SAS for the methods advocated in the text, thus proving that they are readily available to applied researchers. Routines written for STATA and R are also available on the web site. For more information on SAS, STATA and R, visit their web sites: http://www.sas.com http://www.stata.com http://www.r-project.org References Codes In keeping with my goal of reaching a broad group of researchers, I have attempted to integrate contributions from different fields that regularly study longitudinal and panel data techniques. To this end, Appendix G contains the references that are subdivided into six sections. This subdivision is maintained to emphasize the breadth of longitudinal and panel data analysis and the impact that it has made on several scientific fields. I refer to these sections using the following coding scheme: B Biological Sciences Longitudinal Data E Econometrics Panel Data EP Educational Science and Psychology O Other Social Sciences S Statistical Longitudinal Data G General Statistics For example, I use Neyman and Scott (1948E) to refer to an article written by Neyman and Scott, published in 1948, that appears in the Econometrics Panel Data portion of the references. Approach This book grew out of lecture notes for a course offered at the University of Wisconsin. The pedagogic approach of the manuscript evolved from the course. Each chapter consists of an introduction to the main ideas in words and then as mathematical expressions. The concepts underlying the mathematical expressions are then reinforced with empirical examples; these data are available to the reader at the Wisconsin book web site. Most chapters conclude with exercises that are primarily analytic; some are designed to reinforce basic concepts for (mathematically) novice readers. Others are designed for (mathematically) sophisticated readers and constitute extensions of the theory presented in the main body of the text. The beginning chapters (2-5) also include empirical exercises that allow readers to develop their data analysis skills in a longitudinal data context. Selected solutions to the exercises are also available from the author through the web site. Readers will find that the text becomes more mathematically challenging as it progresses. Chapters 1 3 describe the fundamentals of longitudinal data analysis and are prerequisites for the

x / Longitudinal and Panel Data Preface remainder of the text. Chapter 4 is prerequisite reading for Chapters 5 and 8. Chapter 6 contains important elements necessary for reading Chapter 7. As described above, a time series analysis course would also be useful for mastering Chapter 8, particularly the Section 8.5 Kalman filter approach. Chapter 9 begins the section on nonlinear modeling. Only Chapters 1-3 are necessary background for the section. However, because it deals with nonlinear models, the requisite level of mathematical statistics is higher than Chapters 1-3. Chapters 10 and 11 continue the development of these models. I do not assume prior background on nonlinear models. Thus, in Chapters 9-11, the first section introduces the chapter topic in a non-longitudinal context that I call a homogeneous model. Despite the emphasis placed on applications and interpretations, I have not shied from using mathematics to express the details of longitudinal and panel data models. There are many students with excellent training in mathematics and statistics that need to see the foundations of longitudinal and panel data models. Further, there are now a number of texts and summary articles that are now available (and cited throughout the text) that place a heavier emphasis on applications. However, applications-oriented texts tend to be field-specific; studying only from such a source can mean that an economics student will be unaware of important developments in educational sciences (and vice versa). My hope is that many readers will chose to use this text as a technical supplement to an applications-oriented text from their own field. The students in my course come from the wide variety of backgrounds in mathematical statistics. To develop longitudinal and panel data analysis tools and achieve a common set of notation, most chapters contain a short appendix that develops mathematical results cited in the chapter. Further, there are four appendices at the end of the text that expand mathematical developments used throughout the text. A fifth appendix, on symbols and notation, further summarizes the set of notation used throughout the text. The sixth appendix provides a brief description of selected longitudinal and panel data sets that are used in several disciplines throughout the world.