Estimation of Dynamic Panel Data Models with Sample Selection

Similar documents
Estimating Panel Data Models in the Presence of Endogeneity and Selection

A Course in Applied Econometrics Lecture 4: Linear Panel Data Models, II. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Econometric Analysis of Cross Section and Panel Data

Efficient Estimation of Dynamic Panel Data Models: Alternative Assumptions and Simplified Estimation

Econometrics of Panel Data

Missing dependent variables in panel data models

Improving GMM efficiency in dynamic models for panel data with mean stationarity

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Linear dynamic panel data models

Semiparametric Estimation of a Sample Selection Model in the Presence of Endogeneity

Non-linear panel data modeling

1 Estimation of Persistent Dynamic Panel Data. Motivation

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Applied Microeconometrics (L5): Panel Data-Basics

EC327: Advanced Econometrics, Spring 2007

Transformed Maximum Likelihood Estimation of Short Dynamic Panel Data Models with Interactive Effects

Linear Panel Data Models

Women. Sheng-Kai Chang. Abstract. In this paper a computationally practical simulation estimator is proposed for the twotiered

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

CRE METHODS FOR UNBALANCED PANELS Correlated Random Effects Panel Data Models IZA Summer School in Labor Economics May 13-19, 2013 Jeffrey M.

17/003. Alternative moment conditions and an efficient GMM estimator for dynamic panel data models. January, 2017

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

Identification and Estimation of Nonlinear Dynamic Panel Data. Models with Unobserved Covariates

GMM ESTIMATION OF SHORT DYNAMIC PANEL DATA MODELS WITH INTERACTIVE FIXED EFFECTS

Identification and Estimation of Nonlinear Dynamic Panel Data. Models with Unobserved Covariates

1. Overview of the Basic Model

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

Asymptotic distributions of the quadratic GMM estimator in linear dynamic panel data models

Lecture 8 Panel Data

Specification Tests in Unbalanced Panels with Endogeneity.

Panel Data Seminar. Discrete Response Models. Crest-Insee. 11 April 2008

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

Econometrics. Week 6. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Appendix A: The time series behavior of employment growth

What s New in Econometrics. Lecture 15

System GMM estimation of Empirical Growth Models

Bias Correction Methods for Dynamic Panel Data Models with Fixed Effects

A Robust Approach to Estimating Production Functions: Replication of the ACF procedure

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Least Squares Estimation of a Panel Data Model with Multifactor Error Structure and Endogenous Covariates

Advanced Econometrics

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

CORRELATED RANDOM EFFECTS MODELS WITH UNBALANCED PANELS

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Dynamic Panels. Chapter Introduction Autoregressive Model

A Note on Demand Estimation with Supply Information. in Non-Linear Models

xtseqreg: Sequential (two-stage) estimation of linear panel data models

1 Motivation for Instrumental Variable (IV) Regression

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Short T Panels - Review

Estimation of Dynamic Nonlinear Random E ects Models with Unbalanced Panels.

GMM Estimation of Empirical Growth Models

Nonparametric Instrumental Variables Identification and Estimation of Nonseparable Panel Models

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

THE BEHAVIOR OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF DYNAMIC PANEL DATA SAMPLE SELECTION MODELS

Christopher Dougherty London School of Economics and Political Science

SIMPLE SOLUTIONS TO THE INITIAL CONDITIONS PROBLEM IN DYNAMIC, NONLINEAR PANEL DATA MODELS WITH UNOBSERVED HETEROGENEITY

Regression with time series

1 Introduction The time series properties of economic series (orders of integration and cointegration) are often of considerable interest. In micro pa

NEW ESTIMATION METHODS FOR PANEL DATA MODELS. Valentin Verdier

Flexible Estimation of Treatment Effect Parameters

Specification testing in panel data models estimated by fixed effects with instrumental variables

Testing Error Correction in Panel data

An Exponential Class of Dynamic Binary Choice Panel Data Models with Fixed Effects

Panel Data Exercises Manuel Arellano. Using panel data, a researcher considers the estimation of the following system:

On IV estimation of the dynamic binary panel data model with fixed effects

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

ECON3327: Financial Econometrics, Spring 2016

Consistent estimation of dynamic panel data models with time-varying individual effects

Subset-Continuous-Updating GMM Estimators for Dynamic Panel Data Models

Parametric Identification of Multiplicative Exponential Heteroskedasticity

Econometrics of Panel Data

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

Common Correlated Effects Estimation of Dynamic Panels with Cross-Sectional Dependence

Chapter 6 Stochastic Regressors

Fixed Effects Models for Panel Data. December 1, 2014

New Developments in Econometrics Lecture 16: Quantile Estimation

1. GENERAL DESCRIPTION

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

Birkbeck Working Papers in Economics & Finance

Panel data methods for policy analysis

Least Squares Estimation-Finite-Sample Properties

Jeffrey M. Wooldridge Michigan State University

Econometrics Homework 4 Solutions

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Dealing With Endogeneity

10 Panel Data. Andrius Buteikis,

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

1. You have data on years of work experience, EXPER, its square, EXPER2, years of education, EDUC, and the log of hourly wages, LWAGE

Robust Unit Root and Cointegration Rank Tests for Panels and Large Systems *

xtdpdqml: Quasi-maximum likelihood estimation of linear dynamic short-t panel data models

Econometric Methods for Panel Data

Beyond the Target Customer: Social Effects of CRM Campaigns

Bias Corrections for Two-Step Fixed Effects Panel Data Estimators

GMM based inference for panel data models

Applied Economics. Panel Data. Department of Economics Universidad Carlos III de Madrid

GMM ESTIMATION FOR DYNAMIC PANELS WITH FIXED EFFECTS AND STRONG INSTRUMENTS AT UNITY. Chirok Han and Peter C. B. Phillips.

Instrumental variables estimation using heteroskedasticity-based instruments

Dynamic Panel Data Models

Transcription:

=== Estimation of Dynamic Panel Data Models with Sample Selection Anastasia Semykina* Department of Economics Florida State University Tallahassee, FL 32306-2180 asemykina@fsu.edu Jeffrey M. Wooldridge Department of Economics Michigan State University East Lansing, MI 48824-1038 wooldri1@msu.edu March 2, 2011 * Correspondence to: Anastasia Semykina, Department of Economics, Florida State University, Tallahassee, FL 32306-2180, USA. E-mail: asemykina@fsu.edu. Phone: 850-644-4557. Fax: 850-644-4535. We thank the editor M. Hashem Pesaran and three anonymous referees for their useful comments. 1

Summary We propose a new method for estimating dynamic panel data models with selection. The method uses backward substitution for the lagged dependent variable, which leads to an estimating equation that requires correcting for contemporaneous selection only. The estimator is valid under relatively weak assumptions about errors and permits avoiding the weak instruments problem associated with differencing. We also propose a simple test for selection bias that is based on the addition of a selection term to the first-difference equation and subsequent testing for significance of this term. The methods are applied to estimating dynamic earnings equations for women. Key words: Sample selection, Panel data, Dynamic models, Two-step estimation. 2

1 Introduction Recently developed methods for estimating dynamic unobserved effects panel data models have become widely used in applied economics research. In the present paper, we contribute to the literature by developing a new estimation method for the models, where the panel is not balanced due to nonrandom selection. In the absence of selection, the traditional approach to estimating dynamic panel data models is to remove the unobserved effect by first-differencing and then use instrumental variables methods for estimating the differenced equation. This approach was initially proposed by Anderson and Hsiao (1981) and was later considered within a more efficient generalized method of moments (GMM) framework by Holtz-Eakin, Newey and Rosen (1988), Arellano and Bond (1991), Ahn and Schmidt (1995), and others. Blundell and Bond (1998) raised the problem of weak instruments in the context of the first-differenced GMM estimation. This problem arises when the series are highly persistent, which happens in a simple AR(1) model with the autoregressive coefficient close to unity. 1 Blundell and Bond show that imposing restrictions on the initial condition results in additional linear moments that can help to improve the performance of the GMM estimator. As an alternative solution, they model the relationship between the unobserved effect and initial condition through a linear function and suggest using the generalized least squares estimator on the extended model, where the initial value is included in the conditioning set. Several previous studies considered estimation of dynamic panel data models with selectivity; most of them use differencing to remove the unobserved effect. 2 Ziliak and Kniesner (1998) and Wooldridge (2002) propose a solution to the selection problem that 1 Binder, Hsiao and Pesaran (2005) show that the same problem arises in panel vector autoregressive models. 2 Dynamic panel data models with censoring are considered, for example, by Honore and Hu (2004), Hu (2002) and Labeaga (1999). See also Bover and Arellano (1997). 3

arises because of nonrandom attrition. Given the nature of attrition as an absorbing state if the unit is observed in the current period, it is observed in the previous period, also Ziliak and Kniesner, and Wooldridge show that accounting for the current period selection in the differenced equation results in consistent estimation. Under the assumption that errors in the selection equation are normally distributed, the selection correction term is the inverse Mills ratio. Arellano, Bover and Labeaga (1999) consider autoregressive panel data models with sample selection. They model the conditional expectation of the unobserved effect as a linear function of the past values of the dependent variable and consider the distribution of the dependent variable conditional on its past. For each t, the resulting reduced-form equation is estimated on a sub-sample of data, which includes cross-section units without missing past values. Arellano, Bover and Labeaga assume normality of the error terms in both primary and selection equations and use the inverse Mills ratio to account for the fact that only the sub-samples with observed past values are used. The structural autoregressive coefficient is then recovered from the reduced-form coefficients using the restrictions imposed on parameters. Another solution to the incidental truncation problem in dynamic panel data models was proposed by Kyriazidou (2001), who suggested taking differences between any two periods in which the selection index for the given unit is the same or similar. Under the assumption that the vector of errors is independent and identically distributed over time conditional on the exogenous variables, differencing eliminates both the unobserved effect and selection effect. For consistency, it is crucial that the assumptions of strict stationarity and conditional serial independence of the errors hold. Moreover, the estimator converges at a rate that is slower than the usual square root of the cross-section sample size. Another semiparametric estimator was proposed by Gayle and Viauroux (2007), who consider a three-step sieve estimator. In the first step the selection probabilities in each 4

period are estimated nonparametrically by a kernel estimator. In the second step the inverse probability function is linearized, the unobserved effect is removed by differencing and the parameters in the linearized specification of the inverse probability function are estimated using a sieve minimum distance estimator (a GMM estimator with series used to approximate unknown functions). In the third step the GMM estimator is used to estimate the differenced primary equation augmented by the correction term, where the differenced correction term is again approximated by series estimators. As mentioned above, most earlier studies use differencing. A benefit of differencing for unbalanced panels is that it removes additive heterogeneity, and therefore any selection is allowed to be arbitrarily correlated with the heterogeneity in the levels equation. Unfortunately, if selection depends on the idiosyncratic shocks, consistent estimation requires either imposing relatively strong assumptions on the properties of error distributions or necessitates derivation of a complicated selection correction term that accounts for selection in several consecutive periods. As noted by Blundell and Bond (1998), differencing may also lead to a weak instruments problem. Furthermore, in the case of incidental truncation such as labor force participation units may drop out and appear again in any period; therefore, the use of first-differencing or otherwise conditioning on observability of the dependent variable in multiple consecutive periods in dynamic panel data models with arbitrary selection patterns implies that much of the data is lost. In this paper we consider an alternative method for estimating dynamic panel data models with selection, which does not rely on differencing. One of the key assumptions is that the initial condition is observed for all cross-section units. To account for unobserved heterogeneity, rather than using differencing we follow Blundell and Bond (1998) and Chamberlain (1980, 1982, 1984), and model the conditional expectation of the unobserved effect as a linear function of the exogenous variables and initial condition. Then, backward substitution for the lagged dependent variable is used to obtain the equation 5

that contains the lags of the exogenous explanatory variables (which are assumed to be always observed) and the initial condition, but no lags of the dependent variable. As a result, selection correction reduces to a contemporaneous selection problem of the type studied in Wooldridge (1995) with strictly exogenous variables. The ability to focus on selection period-by-period greatly simplifies the derivation of the correction term while allowing general serial correlation in the error of the selection equation. The simplest approach relies on the assumption that the error terms in the selection equation are normally distributed, but we also briefly discuss the possibility of semiparametric estimation. Once the correction term is obtained, the augmented equation can be consistently estimated by nonlinear least squares (NLS) or GMM. The new estimation methods have several important advantages. Modeling the unobserved effects allows us to estimate the equation of interest in levels, thereby avoiding the weak instruments problem often associated with the estimators that use differencing. In the discussed context the error terms in both primary and selection equations may be heterogeneously distributed over time, and the error in the selection equation may be arbitrarily serially dependent. We also discuss how estimation can be modified, so that the observability of the initial condition is not required, and serial dependence in the error terms is permitted in both equations. Additionally, the approach proposed here makes use of all cross-section units observed at least once after the initial period, which helps to avoid losing data. 2 The Model Consider a dynamic panel data model with unobserved heterogeneity: y it = ρy i,t 1 + x it β + c i1 + u it1, t = 1,..., T, (1) 6

where x it is a 1 K vector of time-varying variables, β is a K 1 vector of parameters, ρ is a scalar parameter, c i1 is a time-constant unobserved effect, and u it1 is an idiosyncratic error. Variables in x it are assumed to be strictly exogenous conditional on the unobserved effect, but may be correlated with c i1. Selection occurs because of the partial observability of the dependent variable, y it. This is modeled by specifying a selection rule s it = 1[z it δ 2t + c i2 + u it2 > 0], t = 1,..., T, (2) where s it is a selection indicator that equals one if y it is observed and is zero otherwise, c i2 is a time-constant unobserved effect, u it2 is an idiosyncratic error, z it is a 1 L (L > K) vector of variables that are strictly exogenous conditional on the unobserved effect, and δ 2t is an L 1 vector of parameters. In what follows, it is assumed that z it contains all of the regressors from the primary equation, but must also contain at least one additional time-varying variable. Additional variables may be the factors that affect selection but not the dependent variable in the primary equation. Alternatively, if selection is partly determined by the lagged values of y it (as in some labor supply models, for example), vector z it would include lagged values of x it. Given the selection problem, estimation of equation (1) by differencing is complicated for several reasons. First, we need to observe the dependent variable and explanatory variables in the current and previous periods. Because of the lagged dependent variable, we would only be able to use observations where y it is observed in three consecutive periods. Moreover, any selection correction term would involve conditioning on observability in three different periods, making its derivation and estimation difficult. We can avoid these problems by substituting back for y i,t 1 and expressing y it through 7

the current and lagged values of the explanatory variables and the initial condition, y i0 : y it = ρ t y i0 + ( t 1 ) t 1 t 1 ρ j x i,t j β + c i1 ρ j + ρ j u i,t j,1, t = 1,..., T. (3) j=0 j=0 j=0 Denote z i (z i1, z i2,..., z it ). Given (3), the estimating equation can be derived under the following assumption: ASSUMPTION 2.1 (i) y i0 and z i are always observed, while y it, t = 1,..., T, are observed only for s it = 1. (ii) E(u it1 x it, y i,t 1, x i,t 1,..., y i0, c i1 ) = 0, so that Cov(u it1, u is1 ) = 0, for all s t. (iii) E(u it1 z i, y i0, c i1 ) = 0, t = 1,..., T. (iv) c i1 = η 1 + T s=1 ξ sz is + γ 1 y i0 + a i1, E(a i1 z i, y i0 ) = 0. (v) c i2 = η 2 + T s=1 ψ sz is + γ 2 y i0 + a i2. (vi) For v it2 = a i2 + u it2, v it2 z i, y i0 Normal(0, σ 2 2t), t = 1,..., T. (vii) For v it1 = t 1 j=0 ρj (u i,t j,1 + a i1 ), E(v it1 z i, y i0, v it2 ) = ϕ 2t v it2, t = 1,..., T. According to part (ii) of Assumption 2.1, the conditional mean in equation (1) is assumed to be dynamically complete, which is a rather standard assumption in the literature. This part of the assumption ensures that y i0 is exogenous with respect to the final error in (3). At the end of this section we discuss an alternative set of assumptions and the corresponding estimating equation, where the dynamic completeness assumption is dropped, so that {u it1 } may be serially correlated. Part (iv) of Assumption 2.1 uses Chamberlain s (1980, 1982, 1984) device to model the conditional mean of the unobserved effect, c i1, as a linear function of exogenous variables (see also Blundell and Bond, 1998). This approach was used by Wooldridge (2005) in 8

the context of nonlinear dynamic panel data models with balanced panels. In general, z it may contain time-constant variables; of course, the leads and lags of such variables would not be included in the conditional mean of c i1. A non-zero correlation between the time-constant variables and c i1 implies that the effect of these variables cannot be distinguished from that of the unobserved heterogeneity. However, it may still be useful to include the time-invariant characteristics in z it because controlling for more variables can help to improve on the precision of the estimator. Under Assumption 2.1, parts (i)-(iv), the primary equation can be written as y it = ρ t y i0 + ( t 1 ) ( ) ( 1 ρ ρ j t x i,t j β + η 1 + 1 ρ j=0 ) T ξ s z is + γ 1 y i0 + v it1, E(v it1 z i, y i0 ) = 0, t = 1,..., T, (4) s=1 where v it1 = t 1 j=0 ρj (u i,t j,1 + a i1 ), t = 1,..., T, are the new error terms, which will be serially correlated even though the initial idiosyncratic errors were not. Equation (4) can be used to estimate the parameters when the panel is balanced. 3 Estimating equation (4) by NLS or GMM can serve as an alternative to traditional estimators that combine first differencing with instrumental variables methods. As mentioned in the introduction, a GMM estimator that uses first-differenced data suffers from the weak instruments problem when the series are highly persistent. Specifically, for a sequentially exogenous variable ω it, such as a lagged dependent variable, we can write the data generating process as ω it = ρω i,t 1 + ɛ it, where Cov(ɛ is, ɛ it ) = 0 for s t. In the extreme case, where ρ = 1, ω it = ɛ it, so that past values (ω i,t 1,..., ω i1 ) are not correlated with ω it and hence, cannot be used as instruments. When ρ is close to one, the lagged values are correlated with ω it, but the correlation is weak, which results in the weak instruments 3 We thank the anonymous referee for bringing this fact to our attention. The referee also noted that an interesting question is whether our approach is less efficient than the Blundell and Bond (1998) approach. This is difficult to say, as the two approaches make different assumptions about the initial condition. 9

problem. It is important to note, however, that this problem arises only when the estimation method is GMM. Binder, Hsiao and Pesaran (2005) proposed a quasi maximum likelihood estimator that uses differencing to remove unobserved heterogeneity, but does not suffer from the weak instruments problem. Similarly, Hsiao, Pesaran and Tahmiscioglu (2002) propose a transformed likelihood approach and show that their maximum likelihood estimator that uses differenced data performs better than the GMM estimator. In equation (4), the weak instruments problem does not arise. Because all variables in (4) are in levels, all of them are exogenous under Assumption 2.1 parts (ii)-(iv) and hence, are used as their own instruments. Although the estimator relies on time variation in the variables, the source of this variation does not matter. Even if ρ = 1, the parameters in (4) can be consistently estimated by NLS or GMM, as long as Var(ɛ it ) 0. As is true for all panel data models with large N and fixed T, the autoregressive coefficient can be identified from the cross-sectional variation in the data. In the context of an unbalanced panel, under Assumption 2.1, parts (v) and (vi), the selection equation can we written as T s it = 1[η 2 + z it δ 2t + ψ s z is + γ 2 y i0 + v it2 > 0], t = 1,..., T, (5) s=1 v it2 z i, y i0 Normal(0, σ 2 2t), t = 1,..., T, (6) where the Chamberlain s modeling device is used to model the distribution of the timeconstant unobserved effect, c i2. Note that due to the presence of the unobserved effect, the composite errors, v it2 = u it2 + a i2, t = 1,..., T, are necessarily serially correlated. Also, error variances are allowed to vary over time. The normality assumption is not crucial for estimating the selection equation. As long as v it2 is independent of (z i, y i0 ) and the appropriate regularity conditions hold, parameters in (5) can be consistently estimated using a semiparametric estimator (see, for example, Ichimura 1993, Klein and 10

Spady 1991). However, as discussed below, the derivation of the selection correction term is substantially simplified if Assumption 2.1(vi) holds. To correct for the selection bias, we consider a two-step estimator and use the assumptions similar to the standard selection literature in a cross-sectional context; see, for example, Wooldridge (2002, Chapter 17). Specifically, from Assumption 2.1(vii) it follows that E[v it1 z i, y i0, s it = 1] = E[E(v it1 v it2 ) z i, y i0, s it = 1] = E[ϕ 2t v it2 z i, y i0, s it = 1] = h t (z i, y i0 ) h it, t = 1,..., T, (7) where h it h t (η 2 + z it δ 2t + T s=1 ψ sz is + γ 2 y i0 ), and h t ( ) is an unknown function. From (7), it follows that for s it = 1, equation (4) can be written as y it = ρ t y i0 + ( t 1 ) ( ) ( 1 ρ ρ j t x i,t j β + η 1 + 1 ρ j=0 ) T ξ s z is + γ 1 y i0 + h it + e it1, E(e it1 z i, y i0, s it = 1) = 0, t = 1,..., T. (8) s=1 It is possible to estimate equation (8) semiparametrically. A semiparametric estimator would be appropriate if either the error distribution in the selection equation is not normal, or E(v it1 z i, y i0, v it2 ) is a nonlinear function of v it2, or both. However, it is also useful to consider a fully parametric approach that would lead to a simple estimation routine and would help to avoid computational difficulties typically associated with semiparametric methods. Therefore, in what follows we focus on the parametric case. Under Assumption 2.1, parts (vi) and (vii), function h t is given by h t ( ) = ϕ 2t φ( ) Φ( ) ϕ 2tλ( ), (9) where φ( ) and Φ( ) are standard normal pdf and cdf, respectively, and λ( ) is the inverse 11

Mills ratio. Thus, with some abuse of notation we can write the primary equation for the selected sample as y it = ρ t y i0 + ( t 1 ) ( ) ( 1 ρ ρ j t x i,t j β + η 1 + 1 ρ j=0 ) T ξ s z is + γ 1 y i0 + ϕ 2t λ it2 + e it1, E(e it1 z i, y i0, s it = 1) = 0, t = 1,..., T, (10) s=1 where λ it2 λ(η 2 +z it δ 2t + T s=1 ψ sz is +γ 2 y i0 ). Under Assumption 2.1, equation (10) is the final estimating equation that can be consistently estimated by NLS or GMM. As an alternative approach, one could treat the initial condition as an unobserved effect and model its conditional expectation as a linear function of exogenous variables, as suggested by Chamberlain (1984). 4 In this case, the dynamic completeness of the conditional mean in equation (2) is not needed (and most likely will not hold), so that the idiosyncratic errors in (2) may be serially correlated. Formally, the set of assumptions can be summarized as follows: ASSUMPTION 2.2 (i) y i0 is not observed, z i is always observed, and y it, t = 1,..., T, are observed only for s it = 1. (ii) E(u it1 z i ) = 0, t = 1,..., T. (iii) y i0 = T s=1 κ sz is + b i, E(b i z i ) = 0. (vi) c i1 = η 1 + T s=1 ξ sz is +γ 1 y i0 +a i1 = η 1 + T s=1 (ξ s +γ 1 κ s )z is +a i1 +γ 1 b i, E(a i1 z i ) = 0. (v) c i2 = η 2 + T s=1 ψ sz is + γ 2 y i0 + a i2 = η 2 + T s=1 (ψ s + γ 2 κ s )z is + a i2 + γ 2 b i. (vi) For v it2 = a i2 + γ 2 b i + u it2, v it2 z i Normal(0, σ 2 2t), t = 1,..., T. 4 We thank the anonymous referee for suggesting that we consider this approach. 12

(vii) For v it1 = ρ t b i + t 1 j=0 ρj (u i,t j,1 + a i1 + γ 1 b i ), E(v it1 z i, v it2 ) = ϕ 2t v it2, t = 1,..., T. Under Assumption 2.2, for s it = 1, the primary equation can be written as y it = ρ t T κ s z is + s=1 ( t 1 ) ( ) [ 1 ρ ρ j t x i,t j β + η 1 + 1 ρ j=0 ] T ξ s z is + ϕ 2t λ it2 + e it1, E(e it1 z i, s it = 1) = 0, t = 1,..., T. (11) s=1 where λ it2 λ(η 2 + z it δ 2t + T s=1 ψ s z is ), ψs = ψ s + γ 2 κ s and ξ s = ξ s + γ 1 κ s. Similarly to (10), parameters in equation (11) can be consistently estimated by NLS or GMM, as discussed in the following two sections. Alternatively, one can estimate the reduced-form equation and then obtain structural coefficients, ρ and β, using nonlinear restrictions on parameters. In (11), it is possible to test the presence of the observed dynamics. If only the unobserved dynamics is present, the lags of the exogenous variables would not appear in equation (11), i.e. ρ would be zero. Specifying the estimating equation as in (11) has an advantage of allowing serial correlation in idiosyncratic errors in equation (2). However, it also requires that the model necessarily contains exogenous time-varying explanatory variables and ignores the dynamics that is due to unobserved factors that are not included in the model. In what follows, we focus on the approach, where the initial condition appears in the conditioning set, and the conditional mean in (2) is assumed to be dynamically complete, so that u it1 are serially uncorrelated. We emphasize, however, that equation (11) can be estimated using the proposed methods, also. 3 NLS Estimation A simple way to obtain a consisted estimator of parameters in equation (10) is to replace λ it2 with its consistent estimator and estimate the parameters in two steps. Under 13

Assumption 2.1(v) and (vi), equation (5) can be consistently estimated by probit after the error variance is normalized to equal unity. Since error variances may differ across time periods, it is most appropriate to estimate the selection equation separately for each time period. Denote the first-step estimators ˆπ t = (η t2, ˆψ 1t,..., δ 2t + ψ tt,..., ˆψ T t, ˆγ t2 ), ˆπ = (ˆπ 1,..., ˆπ T ), and the first-step vector of regressors q it = (1, z i1,..., z it, y i0 ). These can be used to obtain ˆλ it2 λ(q itˆπ t ), and then ˆλ it2 can be used instead of λ it2 in equation (10). Denote the 1 [K + LT + T + 3] vector of the parameters θ (ρ, β, η 1, ξ 1,..., ξ T, γ 1, ϕ 21,..., ϕ 2T ). Parameters in θ can be consistently estimated by pooled nonlinear least squares (NLS) on the selected sample. Define the conditional expectation of y it : m it (θ) m(z i, y i0, s it = 1; θ) = E(y it z i, y i0, s it = 1), (12) where m(z i, y i0, s it = 1; θ) = ρ t y i0 + + ( t 1 ) ρ j x i,t j β j=0 ( ) ( 1 ρ t η 1 + 1 ρ ) T ξ s z is + γ 1 y i0 + ϕ 2t λ it2. (13) The correction term, λ it2, is not available, but it can be replaced by a consistent estimator mentioned above. In general, let m it (θ, ˆπ) be a conditional expectation obtained using the estimators of the parameters in the selection equation. Then, the pooled NLS estimator of θ is the solution to the minimization problem 1 min θ 2 N s=1 T s it [y it m it (θ, ˆπ)] 2, (14) t=1 14

where one half is used as a multiplier for convenience. The first-order condition for this problem is N T s it θ m it (ˆθ, ˆπ) [y it m it (ˆθ, ˆπ)] = 0, (15) t=1 which can be solved for ˆθ using the iterative procedures. As is standard in panel data models, for identification it is necessary that T 2. In summary, if Assumption 2.1 holds, a consistent estimator of θ can be obtained from the following two-step procedure: PROCEDURE 3.1 1. For each t = 1,..., T, estimate separate probit models, s it on 1, z i1,..., z it, y i0 i = 1,..., N and compute the inverse Mills ratios, ˆλ it2. 2. For s it = 1, estimate equation (10) with λ it2 replaced by ˆλ it2 by pooled NLS. Estimate the asymptotic variance as described in Appendix A. From Procedure 3.1 it is apparent that one needs at least one additional exogenous variable in the selection equation (L > K). Although the inverse Mills ratio, ˆλ it2, is a nonlinear function of its argument, it is approximately linear on the most of its range, which may lead to multicollinearity. Thus, it is necessary to have at least one exclusion restriction in order to make the estimation convincing. Even though the resulting estimator is consistent, it is not efficient. From equations (3) and (4) it is seen that the error terms in (10) are serially correlated. Besides, the errors are going to be heteroskedastic because of selection. A nonlinear analog of the seemingly unrelated regressions estimator (see Wooldridge 2002, Problem 12.7) cannot be 15

used in this context because selection is not strictly exogenous in the selection equation. However, one can improve efficiency by using a GMM estimator, as discussed in the next section. 4 GMM Estimation The efficiency of the two-step estimator can be improved by using GMM at the second step. Equation (10) is linear in regressors, but nonlinear in parameters, which results in overidentification and permits obtaining a more efficient estimator than pooled NLS. To specify a GMM estimator, define a 1 (LT +3) vector of instruments ˆω it ω it (ˆπ t ) (1, y i0, z i1,..., z it, ˆλ it2 ), t = 1,..., T, and a T T (LT + 3) matrix of instruments Ŵi, Ŵ i W i (ˆπ) ˆω i1 0 0... 0 0 0 ˆω i2 0... 0 0... (16) 0 0 0... 0 ˆω it Here 0 denotes a 1 (LT + 3) vector of zeros. Define a T 1 vector ĝ i g i (θ, ˆπ) (ĝ i1,..., ĝ it ), where ĝ it g it (θ, ˆπ t ) s it [y it m it (θ, ˆπ)], t = 1,..., T. (17) From equation (10) it follows that the following moment conditions are available: E[W i (π) g i (θ, π)] = 0. (18) Since the conditional expectation of y it is different in each time period, equation (18) implies T (LT + 3) moment conditions. Moreover, because m it (θ, ˆπ) is nonlinear in θ, 16

these conditions are not redundant and can be used to enhance efficiency. The GMM estimator of θ is the solution to the minimization problem min θ ( N W i (ˆπ) g i (θ, ˆπ) ) ˆΩ 1 ( N ) W i (ˆπ) g i (θ, ˆπ), (19) where ˆΩ 1 is a consistent estimator of a T (LT + 3) T (LT + 3) positive semidefinite weighting matrix Ω 1. The first-order condition for this problem is given by [ N W i (ˆπ) θ g i (ˆθ, ˆπ) ] ˆΩ 1 [ N ] W i (ˆπ) g i (ˆθ, ˆπ) = 0. (20) Then, θ can be consistently estimated using a procedure similar to Procedure 3.1, where the GMM estimator is used instead of the pooled NLS estimator. Notice that the pooled NLS estimator is identical to a GMM estimator, which exploits the moment conditions T E[ θ g it (θ, π) g it (θ, π)] = 0 (21) t=1 and uses the weighting matrix { T 1 E[ θ g it (θ, π) θ g it (θ, π)]}. (22) t=1 Thus, in the NLS estimation, the instruments are stacked on top of each other, and each time period receives an equal weight. In contrast, a general GMM estimator that uses a block-diagonal matrix of instruments, as in equation (16), assigns different weights to each time period, which can be used to improve efficiency. In the discussion below, it is the solution to the minimization problem (19), which we call the GMM estimator. The proposed GMM estimator will be consistent for any positive definite matrix Ω; however, a particular form is preferred. Specifically, we formulate an additional assump- 17

tion: ASSUMPTION 4.1 (i) Λ is the asymptotic variance of W i (ˆπ) g i (θ, ˆπ). (ii) Ω = Λ. (iii) ˆΩ p Ω. Appendix A provides a formula for ˆΩ that satisfies Assumption 4.1. Following a standard argument for the relative efficiency of the GMM estimator, the GMM estimator that employs weighting matrix ˆΩ as specified in Assumption 4.1 is asymptotically more efficient than pooled NLS and results in a relatively simple expression for the asymptotic variance of ˆθ. Specifically, denote G E[W i (π) θ g i (θ, π)]. If Ω satisfies Assumption 4.1, then the asymptotic variance of the described GMM estimator is Avar(ˆθ) = (GΩ 1 G) 1 /N, (23) which can be estimated as (ĜˆΩ 1 Ĝ) 1 /N, using the formulae provided in Appendix A. We can now summarize a two-step estimation procedure. Let Assumptions 2.1 and 4.1 hold. Then, an estimator of θ that is asymptotically more efficient than the estimator discussed in Section 3 can be obtained using the procedure: PROCEDURE 4.1 1. For each t = 1,..., T, estimate separate probit models, s it on 1, z i1,..., z it, y i0 i = 1,..., N and compute the inverse Mills ratios, ˆλ it2. 18

2. In equation (10), replace λ it2 with ˆλ it2. For s it = 1, estimate the equation by GMM that uses moment conditions (18) and the weighting matrix that satisfies Assumption 4.1. Estimate the asymptotic variance as described in Appendix A. It is important to note that there are more moment conditions available in addition to those specified in equation (18). Equation (10) implies that e it1 is uncorrelated with any function of z i and y i0. Therefore, any nonlinear functions of the exogenous variables and the initial condition should be valid instruments and can be used to obtain additional moment conditions. The proposed two-step estimator can also be formulated as a joint GMM estimator of (θ, π). As suggested by Newey and McFadden (1994, Section 6.1), such an estimator can be obtained by stacking the moment conditions from the two steps. The moment conditions from the second step are given in (18), while the first-order conditions from the first-step estimation generate the additional moment conditions: E [ {Φ(q it π t )[1 Φ(q it π t )]} 1 φ(q it π t )q it[y it2 Φ(q it π t )] ] = 0, t = 1,..., T. (24) The conditions in (18) and (24) can be used to form a vector of moment conditions for the joint GMM estimation. In that way the additional conditions can be used for estimating θ, which can help to improve efficiency. However, since the first-step equations are exactly identified, the efficiency gain may be modest or even not present at all. Moreover, the two-step GMM estimator appears to be computationally more tractable than the joint GMM estimator in applications where the number of the first-step moment conditions is large, for example, due to T being relatively large. To study the properties of the proposed estimators in finite samples we performed Monte Carlo experiments. 5 In the experiments, among the three estimators that account 5 Detailed description of the experiments and all results are summarized in the supplement to the paper, which is available from the authors upon request. 19

for the selection bias (two-step NLS, two-step GMM and joint GMM that uses the moment conditions for both equations) the two-step NLS estimator has the smallest standard deviations and root mean square errors (RMSEs) in small samples (N = 200), which is likely due to the fact that the GMM estimators use estimated weighting matrices, ˆΩ, that cannot be precisely estimated in small samples. However, in large samples (N = 4000) both GMM estimators are more efficient than the two-step NLS estimator. The joint GMM estimator tends to have slightly smaller standard deviations and RMSEs than the two-step GMM estimator, but the differences are minor and virtually disappear when N is large (N = 4000). The two-step NLS, two-step GMM and joint GMM estimators also perform reasonably well when testing simple hypothesis about parameters. Although for all three estimators the true null is rejected too often in small samples (with the over-rejection being most severe for the two-step GMM estimator), the computed size gets closer to the nominal size as N grows. Both the two-step GMM and joint GMM estimators outperform the two-step NLS estimator in terms of the power of the tests. 5 Testing for Selection Bias It is possible to test for selection bias by testing the hypothesis H 0 : ϕ 2t = 0 in equation (10). A variety of tests for GMM estimators described in Newey and McFadden (1994, Section 9) can be used for this purpose. However, such tests require estimation of either restricted or unrestricted model, or both, prior to testing. Since estimation of equation (10) may be computationally costly due to nonlinearity in the parameters, it is useful to have a simple alternative. A simple test can be developed based on the initial linear model (1). To construct a test, introduce a new selection indicator which identifies observability of y it in three 20

consecutive periods, and nominally assume that this new indicator follows an index model with unobserved heterogeneity: d it = 1[s it s i,t 1 s i,t 2 = 1] = 1[z it δ 30t + z i,t 1 δ 31t + z i,t 2 δ 32t + c i3 + u it3 > 0], t = 3,..., T, (25) where c i3 is the unobserved effect and u it3 is the idiosyncratic error. Moreover, (nominally) assume that u it3 is normally distributed and independent of the explanatory variables and unobserved effect, u it3 z i, c i3 Normal(0, 1). (26) Using Chamberlain s approach and assuming normality, write the unobserved effect as T c i3 = η 3 + z is ζ s + a i3, s=1 a i3 z i Normal(0, σ 3t ), t = 3,..., T. (27) Combining (25), (26), and (27) together gives T d it = 1[η 3 + z it δ 30t + z i,t 1 δ 31t + z i,t 2 δ 32t + z is ζ s + v it3 > 0], v it3 z i Normal(0, 1 + σ 3t ), t = 3,..., T, (28) s=1 where v it3 a i3 + u it3 is a new composite error term. With regard to the error terms in the primary equation, assume E( u it1 z i, v it3 ) = E( u it1 v it3 ) = ϕ 3t v it3, t = 3,..., T, (29) 21

which, when combined with the normality assumption, gives E( u it1 z i, d it = 1) = ϕ 3t E(v it3 z i, d it = 1) T = ϕ 3t λ(η 3 + z it δ 30t + z i,t 1 δ 31t + z i,t 2 δ 32t + z is ζ s ) ϕ 3t λ it3, t = 3,..., T. (30) s=1 After applying first differencing to equation (1), with some abuse of notation we can write the differenced primary equation for d it = 1 as y it = ρ y i,t 1 + x it β + ϕ 3t λ it3 + ɛ it1, E(ɛ it1 z i, d it = 1) = 0, t = 1,..., T. (31) Thus, the unobserved effect is removed by first differencing and ϕ 3t λ it3 captures the selection effect. Naturally, time-constant variables drop out from the equation. The test is then performed using the following procedure: PROCEDURE 5.1 1. For each of t = 3,..., T, run a probit regression d it on 1, z i1,..., z it, i = 1,..., N and compute the inverse Mills ratios, ˆλ it3. 2. For d it = 1, augment the first-differenced primary equation by ˆλ it3 and its interactions with time dummies and estimate the augmented equation by pooled two stage least squares or GMM using y i,t 2 and leads and lags of z it as instruments for y i,t 1 ( x it, ˆλ it3 and the interaction terms should be used as their own instruments). Use the Wald test to test the hypothesis ϕ 31 =... = ϕ 3T = 0. 22

As an extension to the proposed procedure, it is possible to impose a restriction of equal variances in the selection equation and estimate equation (28) by pooled probit. Similarly, one may assume that the effect of selection is the same in all time periods and omit the interaction terms in the second-step estimation. A test for selection bias in that case is a usual t-test of the significance of the coefficient on ˆλ it3. Note that for testing a usual variance-covariance matrix should be used; there is no need to adjust for the first-step estimation. If in some period, t j (for j = 3,..., t 1), y i,t j is observed for all cross-section units, then y i,t j can be used as an additional instrument in the second-step estimation. Otherwise, if there are missing values for at least some i, then the observable variable is (s i,t j y i,t j ), and this is not a valid instrument, since we did not account for selection in period t j when constructing ˆλ it3. Importantly, the proposed test is valid regardless of whether or not the model in (25) is correct and whether or not the normality assumption holds. All we need for testing is a reasonable proxy for the selection effect, and the correct specification of the selection term is not essential. If selection problem is present, hopefully this will still be captured by a non-zero coefficient on the inverse Mills ratio in the differenced equation. Similar to the estimators discussed above, having additional variables in z it that are not also in x it helps to make the test more reliable. When the hypothesis of no selection bias is not rejected, the pooled two stage least squares or GMM estimation of the first-differenced equation with x it, y i,t 2, and leads and lags of strictly exogenous variables used as instruments will produce consistent estimators. More distant lags can be used as additional instruments if observed for all cross-section units. However, if the null is rejected, Procedure 5.1 will be a valid correction procedure only if all the assumptions specified in this section are correct. Given that model for d it in equation (25) is quite restrictive, Procedure 5.1 is unlikely to perform 23

well as a correction method. Therefore, the methodology described in the previous two sections should be used instead. 6 Empirical Application This section illustrates the proposed methodology with an empirical example by applying the new methods to the estimation of dynamic earnings equations for females. This example is appropriate because earnings are largely determined by different historical factors and tend to be correlated over time. The data come from the Panel Study of Income Dynamics (PSID), years 1980 to 1992. The sample consists of white females, who were followed over the considered period. 6 Because when estimating equation (10) it is necessary that the initial condition is observed, we keep only those females for whom 1980 earnings are available. The final sample consists of 579 women, or 6,948 observations over the 12-year period (1981-1992). For this period, the earnings sample is comprised of 5,891 observations. Thus, about 15% of earnings data are missing due to non-participation. Because we define the population as women working in 1980, this exercise should be viewed as an evaluation of the effects of movement in and out of the labor force on estimated earnings equations. Such a question is of considerable interest in labor economics. The dependent variable in the primary equation is the natural logarithm of the average annual hourly earnings, while the independent variables include age, age squared and time dummies. We assume that age is strictly exogenous and is not correlated with the 6 We consider working-age women (ages 18-65) who were either household heads or wives, have completed their education and are neither self-employed nor agricultural workers. The woman was excluded from the analysis if her self-reported age exceeded the age constructed using information on the year of birth by more than two years or self-reported age was smaller than the constructed age by more than one year, or if the woman reported positive work hours and zero earnings. 24

unobserved effect. This assumption implies that the mean ability of women born in different years is about the same. Our sample is restricted to women who have completed their education (i.e. years of schooling do not vary over time); hence, the effect of education is not separable from unobserved heterogeneity. Therefore, we only include education as part of the unobserved effect. Additionally, to control for unobserved heterogeneity, we include the number of children in all time periods (i.e. the number of children is assumed to belong to z it, but not x it ). The selection rule is for labor force participation. A woman is considered to be a participant if she reports positive work hours in a given year. When estimating selection equations, in the probit regressions in each time period we include education, age, age squared, and the number of children in all time periods, where the number of children may have a direct effect on the labor force participation. Log of hourly earnings in 1980 is included depending on whether the methodology of Sections 2-4 or the methodology of Section 5 is used for the analysis. Before applying the more advanced methods developed in Sections 2 through 4, we first estimate equation (1) using the simple approach of Section 5. From the total 1980-1992 sample we keep observations for which earnings data are available in three consecutive periods and use first differencing to remove the unobserved effect. As a result, the sample size reduces to 5,033 observations; age and education drop out from the equation. Then, we estimate the first-differenced equation by pooled instrumental variables using the log of hourly earnings in t 2 as an instrument for y i,t 1. We call this estimator the first difference instrumental variables (FD-IV) estimator. The estimates for the log earnings equations are reported in Table 1. The first column of the Table display contains the estimates from FD-IV regressions without inverse Mills ratios. The second column contains the test of selection bias in the first-differenced equation using the results in Section 5. The estimate of ρ is rather similar in the two 25

columns; it is about 0.15-0.16 and is statistically significant at the 1% level. However, the test suggests that selection bias may be present. The null of no selection is rejected at the 8% significance level. Thus, one might conclude from the test using the FD equation that selection into the work force may be systematically related to idiosyncratic shocks to earnings. The estimates obtained using the methods discussed in Sections 2-4 are reported in the remaining three columns of Table 1. Columns (3) and (4) show estimation results from regressions where the NLS estimator is used at the second step. Column (5) contains the estimates obtained using Procedure 4.1, which employs GMM at the second step. The estimates for the augmented log earnings equation are reported in columns (4) and (5). Based on the Wald tests of the joint significance of the selection terms, the hypothesis of no selection bias is rejected at the 5% level in both cases. Thus, we again find the evidence of the selection bias. The NLS and GMM estimates of ρ are very similar in all three regressions. The estimate is about 0.6 and is significant at the 1% level, which provides evidence of state dependence in earnings offers. This estimate is rather different from the one obtained using first-differencing. Interestingly, similar results were obtained in Monte Carlo simulations, where the FD-IV estimator had substantially larger biases than the NLS estimator that did not account for selection. For all coefficient estimates, standard errors are smaller when the GMM estimator is used at the second step. Columns (3)-(5) show an estimated effect of another year of schooling of about 3%, which is statistically significant at the 1% level. We emphasize, however, that this effect is not distinguishable from unobserved heterogeneity. Moreover, the coefficient on years of schooling in these regressions is not a true return to education because education has an additional effect on earnings through the autoregressive earnings term. The coefficients on the age and age squared reveal a usual U-shape profile, although 26

the corresponding estimates are less precise, particularly in the NLS regressions. As a robustness check, we re-estimated the earnings equation using the data from years 1981-1992. The sample was restricted to only include women who reported earnings in 1981 (583 women). 7 The resulting coefficient estimates and standard errors were very similar to the ones reported in Table 1. The only noticeable change was observed for the two-step GMM estimates of the coefficients on age and age squared, which became somewhat smaller and statistically insignificant. Based on the results of the joint Wald tests, the null of no selection bias could not be rejected; however, several selection correction terms were individually significant. Specifically, in the FD-IV regression the inverse Mills ratios for years 1984, 1985 and 1991 were significant at the 5% significance level. The correction term for year 1991 was also significant at the 5% level in the two-step NLS and two-step GMM regressions. The table with detailed estimation results is available from the authors upon request. Returning to the discussion of the estimating equation in Section 2, we note that one could also estimate the parameters using equation (11). Is such a case, identification would rely on time variation in strictly exogenous variables, age and age squared. Moreover, the autoregressive coefficient, ρ, would only capture the observed dynamics. In applications where there are no time-varying strictly exogenous variables in the model (i.e. x it is empty), the data would not provide a distinction between the observed and unobserved dynamics. 8 7 Conclusions In this paper, the new methods for estimating dynamic panel data models with selectivity were proposed. A distinctive feature of the new estimators is that they do not rely on 7 The cross-section sample size increased because more women were working in 1981 than in 1980. 8 We thank the anonymous referee for suggesting that we include the discussion of this issue. 27

differencing when treating the unobserved heterogeneity. This feature allows to avoid the weak instruments problem, which arises in the context of differencing if series are highly persistent or close to unit root. The proposed correction is relatively simple because the method requires correcting for selection in current period only. The errors in both selection and primary equations may be heterogeneously distributed. The errors in the selection equation may also be serially dependent, and the general form of heteroskedasticity is allowed in the primary equation. Additionally, this paper develops a simple test for sample selection bias. The proposed methods are applied to the estimation of dynamic earnings equations for females using the Panel Study of Income Dynamics data. The evidence of selection bias is found in both the first-differenced equation and the equation obtained after backsubstitution. The NLS and GMM estimation based on the new methodology produces the estimate of the stability parameter that equals 0.6 and is rather different from the estimate obtained from the instrumental variables estimation of the first-differenced equation. The proposed correction procedure is parametric and assumes normality of the errors in the selection equation. An important topic for future research is developing a semiparametric estimator, which would not require parametric assumptions regarding the error distributions. Such an estimator can be implemented within the framework of this paper using the methods similar to those considered in Semykina and Wooldridge (2010). Appendix A This section starts with a derivation of the variance of the GMM estimator. The derivation of the variance of the pooled NLS estimator follows by analogy. Using the notation from Section 3, let ˆπ t = (η t2, ˆψ 1t,..., δ 2t + ψ tt,..., ˆψ T t, ˆγ t2 ), ˆπ = (ˆπ 1,..., ˆπ T ), be the first-step estimators, and let q it = (1, z i1,..., z it, y i0 ) be the first-step vector of regressors. Also, 28

denote the vector of the parameters θ (ρ, β, η 1, ξ 1,..., ξ T, γ 1, ϕ 21,..., ϕ 2T ). Under the standard regularity conditions given, for example, in Wooldridge (2002, Theorem 14.1), the GMM estimator, ˆθ, is consistent when π is known. If ˆπ is a consistent estimator of π, the first stage estimation will not affect consistency of ˆθ. ˆΩ By definition, if ˆΩ is a consistent estimator of a positive definite matrix Ω, then p Ω. Also, by consistency of ˆθ and ˆπ, and the weak law of large numbers, N 1 N W i (ˆπ) θ g i (ˆθ, ˆπ) p G, where G E[W i (π) θ g i (θ, π)] was also defined earlier. Recall that the first-order condition for the GMM estimator is as in equation (20). After normalizing by the number of observations, taking the appropriate probability limits, and expanding N 1 N W i(ˆπ) g i (ˆθ, ˆπ) around θ, we obtain G Ω 1 N 1/2 N W i (ˆπ) g i (θ, ˆπ) + G Ω 1 N 1/2 N N N(ˆθ θ) = C 1 G Ω 1 N 1/2 W i (ˆπ) θ g i (θ, ˆπ)(ˆθ θ) + o p (1) = 0, W i (ˆπ) g i (θ, ˆπ) + o p (1), (32) where C G Ω 1 G. Next, we need to account for the first-stage estimation of π. In equation (32), both the matrix of instruments and function g i depend on ˆπ. However, as is known, the use of generated instruments does not affect the asymptotic variance of the GMM estimator. This result follows from the conditional moment restrictions in equation (10), which imply that E[g i (θ, π) x i, y i0, s it = 1] = 0, so that g i (θ, π) is uncorrelated with any function of (x i, y i0 ) conditional on s it = 1. Therefore, the mean-value expansion of N 1/2 N W i(ˆπ) g i (θ, ˆπ) 29

around π gives N 1/2 N W i (ˆπ) g i (θ, ˆπ) = N 1/2 N W i (π) g i (θ, π) + F N(ˆπ π) + o p (1), (33) where F E[W i (π) π g i (θ, π)], and π g i (θ, π) is a block-diagonal matrix, π g i (θ, π) = s it q i1 ϕ 21 λ i12 (q i1 π 1 + λ i12 ) 0... 0 0... 0 0... 0 s it q it ϕ 2T λ it 2 (q it π T + λ it 2 ) (34) Here we used the fact that the derivative of the inverse Mills ratio is equal to q it λ it2 (q it π+ λ it2 ) [see, for example, Wooldridge 2002, p. 522]. Since ˆπ t, t = 1,..., T are maximum likelihood estimators, ˆπ satisfies N(ˆπ π) = N 1/2 N d i (π) + o p (1), (35) where d i (π) (d i1 (π 1 ),..., d it (π T ) ), d it (π t ) = A 1 t {Φ(q it π t )[1 Φ(q it π t )]} 1 φ(q it π t )q it[s it Φ(q it π t )], A t E[ H it (π t )], H it (π t ) = {Φ(q it π t )[1 Φ(q it π t )]} 1 [φ(q it π t )] 2 q itq it. (36) Combining equations (32), (33), and (35), we can write N(ˆθ θ) = C 1 G Ω 1 N 1/2 N [W i (π) g i (θ, π) + F d i (π)] + o p (1), (37) 30