Kausalanalyse Analysemöglichkeiten von Paneldaten
Warum geht es in den folgenden Sitzungen? Sitzung Thema Paneldaten Einführung 1 2 3 4 5 6 7 8 9 10 11 12 13 14 09.04.2008 16.04.2008 23.04.2008 30.04.2008 07.05.2008 14.05.2008 21.05.2008 28.05.2008 04.06.2008 11.06.2008 18.06.2008 25.06.2008 02.07.2008 09.07.2008 16.07.2008 Einführung und Überblick Allgemeines lineares Modell Kumulierte Querschnittsdaten I fällt aus Kumulierte Querschnittsdaten II Analysemöglichkeiten von Paneldaten (trotz Pfingstferien) Paneldatenanalyse kontinuierlicher Zielvariablen I Paneldatenanalyse kontinuierlicher Zielvariablen II Paneldatenanalyse kontinuierlicher Zielvariablen III Paneldatenanalyse kategorialer Zielvariablen I Paneldatenanalyse kategorialer Zielvariablen II Paneldatenanalyse kategorialer Zielvariablen III Ereignisdatenanalyse I Ereignisdatenanalyse II Ereignisdatenanalyse III 22.07.2008 Klausur (60 Minuten) 2
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. How to estimate models for panel data 3
Aims analysis of panel data rapidly growing field in statistics terminological heterogeneity novice users are easily lost I want to illustrate problems of panel data analysis give basic orientation on first choice methods introduce most important technical terms audience: beginners of panel analysis 4
Important distinction categorical dependent variables continuous dependent variables 5
Categorical dependent variables few discrete values employment status, political attitudes, number of children model probability of observing a certain value (category) of the variable Does the probability of being unemployed change with labor force experience? 6
Continuous dependent variables many different values (on a continuous scale) income, firm size, gross domestic product, amount of social expenditures model certain distributional characteristics of these variables (e.g., expected value) Does income on average increase with educational attainment? 7
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. How to estimate models for panel data 8
Definition: panel data repeated observations of the same individuals over time units: individuals, firms, nations, etc. 3 dimensions units i=1,, n variables: ν=1,, V (time-constant, time-dependent) measurements (panel waves): t=1,, T How to put this into a 2-dimensional data matrix? 9
Wide format ID Kids84 Kids85 Educ84 Educ85 1 0 0 12 12 2 2 2 9 9 3 0 1 10 11 4 1 2 8 8 5 3 3 13 13 6 2 2 15 15 7 0 1 9 10............... few measurements n individuals, T measurements size of data matrix n rows (units) V T columns (variables) 10
Long format ID Jahr Kids Educ 1 1984 0 12 1 1985 0 12............ 2 1984 2 9 2 1985 2 9............ 3 1984 0 10 3 1985 1 11............ 4 1984 1 8 4 1985 2 8............ 5 1984 3 13 5 1985 3 13............ 6 1984 2 15 6 1985 2 15............ 7 1984 0 9 7 1985 1 10............ 7 2000 2 13 many measurements n individuals, T measurements size of data matrix N = n T rows (observations) V columns (variables) Which observations belong to the same unit? Stata: tsset id jahr Hierarchical data set observations clustered within units Attention! N = n T looks like you have a lot of observations 11
Pooling pooled time series ID Jahr Kids Educ 1 1984 0 12 1 1985 0 12............ 2 1984 2 9 2 1985 2 9............ 3 1984 0 10 3 1985 1 11............ 4 1984 1 8 4 1985 2 8............ 5 1984 3 13 5 1985 3 13............ 6 1984 2 15 6 1985 2 15............ pooled cross-sections ID Jahr Kids Educ 1 1984 0 12 2 1984 2 9 3 1984 0 10 4 1984 1 8 5 1984 3 13 6 1984 2 15 7 1984 0 9............ 1 1985 0 12 2 1985 2 9 3 1985 1 11 4 1985 2 8 5 1985 3 13 6 1985 2 15 7 1985 1 10............ 7 2000 2 13............ Panel data = pooled time series and cross-section data (TSCS) 12
Micro and macro panels macro panel OECD data (countries over time) T n or even T > n micro panel household panel studies (e.g., GSOEP) n >> T Why is this distinction important macro: more info to analyze time dimension I focus on micro panels! 13
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. How to estimate models for panel data 14
Example 1: y continuous 545 males observed 1980-1987 (NLS Youth Sample) n = 545, T = 8, y = log hourly wage Source: Vella and Verbeek (1998) Log hourly wage: original data Year n Mean Serial correlation Sd log(y) y (t, t-1) (t, t=1) 1980 545 1.393 4.03 0.558 1981 545 1.513 4.54 0.531 0.454 0.454 1982 545 1.572 4.81 0.497 0.611 0.432 1983 545 1.619 5.05 0.481 0.690 0.408 1984 545 1.690 5.42 0.524 0.675 0.316 1985 545 1.739 5.69 0.523 0.664 0.356 1986 545 1.800 6.05 0.515 0.632 0.297 1987 545 1.866 6.47 0.467 0.693 0.310 high serial dependence decreases with time-lag between measurements 15
Example 2: y categorical 700 females observed 1970-1973 (NLS Women Sample) n = 700, T = 4, y = union membership (no, yes) Source: Stata Manual Year t 1970 1971 1972 1973 Union membership State probability First-order transition matrix Higher-order transition n (t, t+1) matrix (t=1, t+1) no member member no member member no member 535 76.43 91.96 8.04 91.96 8.04 member 165 23.57 25.45 74.55 25.45 74.55 no member 534 76.29 92.88 7.12 91.59 8.41 member 166 23.71 23.49 76.51 27.27 72.73 no member 535 76.43 93.08 6.92 90.84 9.16 member 165 23.57 26.67 73.33 33.94 66.06 no member 542 77.43 member 158 22.57 high serial dependence decreases with time-lag between measurements 16
Consequences conventional statistical methods assume independent observations consequently, estimated standard errors tend to be too low test statistics are too high p-values are too low significance tests may lead to erroneous conclusions 17
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. How to estimate models for panel data 18
Simple techniques Trend Serial dependence Sequence Continuous mean, standard deviation correlation graphs? tables? Categorical proportion transition probability graphs? tables? 19
Example 2: y categorical 1970 Sequence n % Total 0000 442 63.14 0001 24 3.43 0010 19 2.71 0011 7 1.00 0100 21 3.00 535 0101 3 0.43 0110 4 0.57 0111 15 2.14 1000 27 3.86 1001 3 0.43 1010 4 0.57 1011 8 1.14 1100 8 1.14 165 1101 7 1.00 1110 17 2.43 1111 91 13.00 no union member union member for example: four alternatives to get from y 70 =0 to y 73 =1 sequences too detailed, especially with large T 20
... continued alternatively, focus on one origin state plot probability of survival in the origin state alternatively, plot conditional transition probability Figure: Union membership over time and region probability of survival 0.00 0.25 0.50 0.75 1.00 0 1 2 3 4 wave region = other region = South 21
Example 1: y continuous 1981 1982 1983 1984 1985 1986 1987 Mean 2.937 3.473 2.971 3.777 4.052 3.293 3.096 3.174 2.541 3.086 3.229 2.620 2.691 2.606 2.236 2.685 1.972 2.232 2.554 2.905 2.966 2.990 3.132 2.625 2.338 2.416 2.514 2.648 2.636 2.719 2.769 2.550 2.510 2.593 2.564 2.465 2.613 2.645 2.282 2.530 2.391 2.356 2.562 2.636 2.643 2.807 2.813 2.523 2.412 2.315 2.508 2.599 2.700 2.741 2.663 2.518 1.962 2.276 2.195 2.428 2.723 2.966 3.065 2.454 1.574 1.442 2.547 2.991 3.011 3.099 3.097 2.437 2.531 2.171 2.197 2.455 2.069 2.429 2.602 2.377 0.030 0.688 0.676 0.647 0.955 1.564 1.301 0.793 0.898 0.970 0.804 1.306 1.477-0.981 0.791 0.782 0.541 0.195 0.383 1.104 1.066 1.202 1.046 0.777 0.289 0.906 0.339 1.213 1.502 1.277-0.191 0.764 1.079 0.840 1.508 1.021 0.543 0.115 0.313 0.763 0.075 0.172 0.948 1.227 0.435 0.639 1.678 0.760 0.606 0.865 0.512 0.684 1.000 1.435 1.177 0.738-1.417-0.670 0.703 0.713 0.820 1.069 1.039 0.414-0.149 0.181-0.036 0.397 0.973 0.563 0.759 0.333 no duplicate patterns simple data listing no structure visible 22
... continued Alternative: use line plots for each unit limited technique with many units Figure: Income trajectories for 19 men from high and low income groups log hourly wage -2 0 2 4 1980 1982 1984 1986 1988 Year 23
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. How to estimate models for panel data 24
Why different income trajectories? log hourly wage -2 0 2 4 1980 1982 1984 1986 1988 Year 25
... continued Explanatory factors (unit, context) time-constant variables Z e.g., ethnicity, national language time-dependent variables X e.g., labor force experience, economic growth time t e.g., calendar time, time since an event (birth, labor force entry) Notes misunderstanding: Z level, X change few variables time-constant by nature often only time-constant information available time is an indicator rather than a causal factor 26
Why serial dependence? 1. time-constant variables Z e.g., ethnicity and income 2. time-dependent variables X more complicated, but X often similar next year serial dependence of the X 3. Y is influenced by former values of Y e.g., bureaucratic behavior technical terms spurious state dependence: (1, 2) true state dependence: (3) dynamic models: (3) 27
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. How to estimate models for panel data 28
Types of longitudinal models independent variables level Y change ΔY level X, Z human capital theory gender-specific mortality change ΔX adaptive behavior many change processes level Y learning processes 29
Mathematics: y continuous Level Change E( yit ji ) = β 0 ( t) + β1x1 it + K+ βk xkit + γ 1z1i + K+ γ j z 14444 24444 31442444 3 time-dependent part time-constant part model in levels easily transformed in model of change both models conceptually equivalent however, empirical estimates different 30
Mathematics: y categorical Level Pr( yit = k) = G( β ( t) + β1x1 it + K+ βk xkit + γ 1z1i + K+ γ j z G( ) suitable distribution function normal probit, logistic logistic regression Change 0 ji Pr( yit = k yi = j, K, yi, t 1 = j) = G( β0( t) + β1x1 it + K+ γ1z1 i + 1 K conditional transition probability logistic regression for discrete event histories ) ) 31
Topics 1. Introduction 2. How to manage panel data 3. Independent observations? 4. Describing panel data 5. Explaining panel data 6. Modeling panel data 7. Estimating models for panel data 32
What to do? switch from E(y) and Pr(y) to observed data (include an error term) proceed with two alternative options: 1. treat serial dependence as a nuisance 2. explicitely model serial dependence y 1444 K it = β0 ( t) + β1x1 it + K+ βk xkit + eit + γ 1z1i + + γ j z ji + ui 144444 244444 3 24443 time-dependent part time-constant part u i (unit-specific effect) controls that observations for unit i have something in common that is not captured by X and Z 33
Strategies of estimation strategy continuous categorical control serial dependence level Y change ΔY robust std err s generalized estimating equations linear regression with fixed or random (unit) effects linear regression with first differences robust std err s generalized estimating equations logistic regression with fixed or random (unit) effects event history analysis 34
Finally
What I did not talk about statistical assumptions dynamic models models with reciprocal causation measurement error panel attrition missing data... 36
Important technical terms categorical / continuous variable (conditional) transition probability dynamic model event history analysis first differences (FD) fixed effects (FE) generalized estimating equations (GEE) hierarchical data macro / micro panel pooling random effects (RE) robust standard errors sequence serial correlation / dependence spurious / true state dependence survival probability transition matrix unit-specific effect wide / long format 37
More info Introductory textbook Chapters 13 and 14 of: Wooldridge, J. (2005): Introductory econometrics: a modern approach. South Western College Publishing. 38