A Two-Regime CRC Model with Nonnegative Dependent Variable

Size: px

Start display at page:

Download "A Two-Regime CRC Model with Nonnegative Dependent Variable"

Aron Wade
6 years ago
Views:

1 A Two-Regime CRC Model with Nonnegative Dependent Variable Myoung-Jin Keay October 23, 2012 Abstract This paper presents a method for estimating the average treatment effects (ATE) under the case that 1) the dependent variable is count variable and 2) the coefficients of covariates are random variables that are correlated with the binary treatment variable. The identifying assumptions are given and the estimating equation based on them is derived. Simulations show that, in large samples, it has usually smaller biases and larger variances than the linear methods have. An application on Botswana fertility is given. Keywords: correlated random coefficient model, average treatment effect, count data, fertility JEL Classification: C15, C25, C34, C35, J13 Department of Finance and Economics, Georgia Southern University. mkeay@georgiasouthern.edu. I am grateful to Jeff Wooldridge, Tim Vogelsang and Todd Elder for their support and guidance. I also thank Otavio Bartalotti, Valentin Verdier and other seminar participants at Michigan State University for their helpful comments. 1

2 1 Introduction The Correlated Random Coefficient (CRC) Models that were first introduced by Heckman and Vytlacil (1998) provided an alternative way to model the individual heterogeneity. The main focus having been on clarifying the conditions for identifying the parameters of interest under the presence of CRC since its inception (Wooldridge 1997, 2003, 2005; Card 2001), more recent development is focusing on the extension to the cases where the support of dependent variables are of limited nature. Wooldridge (2007) has suggested a method for estimating Average Partial Effects (APE) where the count dependent variable as well as random coefficients correlated with variable of interest are present. Cases of interest in this article are very similar to it except for the fact that a binary variable with counterfactual causal model is considered as a starting point. Specifically the goal of this paper is to provide an estimation method for Average Treatment Effects (ATE) when the dependent variable is count variable and all the covariates have CRC s that are correlated with the binary variable. This paper is organized as follows: In Section 2, previous discussion on CRC model will be reviewed. Section 3 discusses various CRC models proposed so far in the context of ATE. Section 4 is the core of this article where the ATE estimation method for Nonlinear Two-regime CRC model will be provided. Section 5 discusses two specification tests; the test for endogeneity is designed for detecting the correlatedness of the random coefficients and the model selection test is for choosing the model between the ones with and without random coefficients. The method in this article is dependent on some distributional assumptions and their sensitivity on the estimation performance will be examined by Monte Carlo simulations in Section 6. Section 7 is an empirical application of this method and the performance of other competing estimator will be compared. Lastly, Section 8 is concluding remarks. 2 Previous Literature Random coefficients typically arise from a model where an unobserved variable interacts with one or more observed variables. It then calls for different approaches due to the random nature of the coefficients on the observed variables. The randomness of the coefficients makes the conventional partial effect be random, and the quantities of interest are usually the means of the random coefficients. In other words, we would like to estimate not the partial effect but the average partial effect. If a regressor is mean independent from the unobserved variables, then the APE of the regressor can be consistently estimated by using OLS (See Amemiya, 2

3 1985). Let the unobserved heterogeneity denoted q and the 1 K vector of exogenous variables x. Then for the case where E[q x] 0, even if E[q x] can be appropriately modeled as a function of x, the APE of a particular variable in x cannot be consistently estimated as long as the function contains the x with degree one. In such cases, q needs to be expressed by some proxy variables that does not contain x to consistently estimate the APE of a variable in x. Now let s consider a CRC model where E(q x) 0 and proxy variables are not available as follows. In CRC, no correlation means mean independence, not the usual orthogonality. y = α 0 + (α 1 + q 1 )x + u Assume for now that the regressor of interest is exogenous, i.e. orthogonal to the structural error. Then the model will have the regressor itself inside the composite error, i.e. q 1 x + u. Unlike from the mean independent case, this composite error does not vanish by using conditional expectation operation and renders the regressor effectively endogenous. Heckman and Vytlacil (1998) and Wooldridge (2003) have shown that under some assumptions, the instrumental variables method can be used to consistently estimate α 1. In sum, when there are uncorrelated random coefficients, OLS gives a consistent estimator under the mean independence of q conditional on regressor. For CRC models, under the violation of mean independence, even a regressor that is independent of the structural error u in the population effectively becomes endogenous due to the random part of the coefficient. In what follows, our assumption is that the regressor is independent with the structural error for the sake of simplicity. Too restrictive the independent regressor as above might seem to be, the solutions for the CRC usually work for the nonindependent cases also (See Wooldridge, 2010). It must be noted that given a particular regressor x, the coefficients on other regressors can be both random and correlated with x. The first case to consider is the model where the CRC with respect to a specific regressor is only on that regressor, i.e. the coefficients on all other covariates are assumed to be constants. The problem with this case is that a usual IV conditions are inadequate; even if the mean independence between IV and structural error and between IV and CRC are established, the orthogonality between that IV and composite error is not guaranteed. In order for the IV to be effective, it has to satisfy an additional condition that E[q 1 x z] is constant (Wooldridge, 2003). 1 Although this condition is not very strong, its 1 Although this condition is for the IV estimation in CRC models, it is still needed for a random coefficient that is uncorrelated with the regressor. Even if E[q 1 x] = 0, it is not necessarily that E[q 1 x z] is constant. Again no correlation in CRC is not about the orthogonality but mean independence. 3

4 validity may be put in question when x is not continuous(card, 2001; Wooldridge, 2005). For the cases where the conditions for IV are not met, there is still another method that uses control function approaches with stronger assumptions. Heckman (1976) provides a solution for binary regressor, whereas Garen (1984) does it for continuous one. See the equation below. y 1 = η 1 + z 1 δ 1 + a 1 y 2 + y 2 z 1 γ 1 + u 1, (1) where z 1 is 1 L 1 vector of exogenous variables and δ 1 and γ 1 are L 1 1 coefficients vectors. The common convention is that Greeks are constant and Romans random. Regressors may or may not have random coefficients which may or may not be correlated with the structural error. The variables z 1 that have constant coefficients without any correlation with u 1 will be called exogenous variable throughout this article. And any variable that is correlated with structural error or at least has correlated random coefficient will be called endogenous variable; for instance in the above equation the endogenous variable y 2 not only has correlation with u 1, but also has CRC. One can also enrich the model by introducing some interaction terms between exogenous and endogenous variables such as y 2 z 1 in equation (1). Let s put some restrictions here; suppose that y 2 is binary endogenous variable and that the coefficient of the interaction term γ 1 is constant. Then this model simply becomes Linear Full Endogenous Switching Regression or LFES model (Keay, 2011) with two regimes between which an individual can switch. In this model all the coefficients in each regime are nonrandom. For this LFES it is also possible to introduce CRC to the exogenous variables for each equation (Wooldridge, 2007). Or can one equivalently construct such two-regime CRC model, i.e. a two regime regression model where the covariates in each equation have random coefficients correlated with y 2, by directly introducing CRC not only to the endogenous variable but also to the exogenous variables as well as the interaction. Here is an equivalence result: two-regime CRC model can be constructed by putting CRC to the all regressors in the equation including the interaction terms. The discussion so far was about the models with continuous structural error and hence continuous dependent variable. In this paper I will provide a model where the dependent variable takes nonnegative or more specifically natural numbers. Such cases arise very often in real life such as when the dependent variable is commuting frequency (Terza, 1998) or number of child (Keay, 2011). In addition to the nonlinearity in the first stage binary choice, another nonlinearity to the dependent variable of the structural equation will be taken care of. Although Terza (2009) has already considered such nonlinear model with two regimes without 4

5 CRC on y Linear with Interaction Nonlinear with Interaction CRC on y LFES CRC on all NFES CRC on all CRC on z LTCRC CRC on z NTCRC Figure 1: Analogy Structure CRC, its extension with CRC on all exogenous variables has not been proposed so far. In the following, the former model will be called Nonlinear Full Endogenous Switching Regression or NFES model and the later be called Nonlinear Two-regime CRC or NTCRC model. The linear version of Two-regime CRC model was already proposed by Wooldridge (2007) that will be called Linear Two-regime CRC or LTCRC model. In what follows Poisson distribution will be used to model the nonnegative response in the structural model. Incidentally an ensuing natural question is whether all the results in the linear model is valid to the same extent. Of course the model itself does not undergo much changes although some quantities of interest will not be identified, which will be discussed in the following sections. The problem of NFES approach is that the intercept term inside the exponential function in the structural equation is not identified by itself alone; instead Terza (2009) estimates the average treatment effect for nonlinear model and such approach will be employed also for the two-regime CRC model in this paper. As it was already mentioned above, the CRC on binary endogenous variable makes two-regime switching regression model. The objective of this paper will be to set out a NTCRC model by bringing once again the CRC to the two-regime nonlinear switching model, where the average treatment effect will be identified. Incidentally this is equivalent to the model where all the variables including the interaction have CRC. The above discussion is summarized in the analogy scheme in Figure 1. 3 Various Models of CRC 3.1 Continuous Endogenous Variable Let the variable with which random coefficient is correlated be denoted w. By the unobserved heterogeneity argument the variable w is usually regarded as endogenous, which is 5

6 also the case here. The CRC can be either only on w or on all other variables. In population regression model, the variable w and all other covariates can be interacted, which may create many possible combinations of situations: CRC only on w with no interaction, CRC only on w with interaction, CRC on all variables with no interaction and finally CRC on all variables with interaction. Consider a regression model as follows. y = xβ + τw + γq + qxδ + u, where x is a 1 K vector of exogeneous variables and β and δ are K 1 vectors. The variables q and w are assumed to be scalar. The former is unobserved variable that is correlated with the latter, but interacts with x. One of the characteristics of this model is that the variable w will be correlated with the random coefficients of x, i.e. β + qδ and, at the same time, will be an endogenous variable in traditional sense because it is correlated with the composite error γq + u. If an interaction term between x and w is included in this model and also if that interaction term is again interacted with the unobserved variable q, then, given that w is binary, the resulting model will be the two-regime CRC model that will be discussed below. y = x(β + δ 0 q) + w(τ + δ 1 q) + wx(κ + δ 2 q) + γq + u A random coefficient model with an endogenous variable with which the coefficients on an exogenous variable is correlated is created in such manner. This model turns out to be a very good description of education-fertility behavior which will be discussed in later sections. In addition to that, this model is very general in that many other previous models can be treated as its special cases. In case w is endogenous binary variable, a restriction δ 0 = δ 2 = 0 makes the model as an ordinary two-regime switching model and also an additional restriction δ 1 = κ = 0 will make it as Endogenous Treatment Model as in Terza (1998, 2009). There are also many other possibilities that are nonetheless special cases of the model given above. Let s focus on a continuous w here deferring discussion of binary case in subsequent sections. With count dependent variable, suppose the conditional expectation can be modeled by an exponential function as link function. This approach explicitly considers the dependent variable of the structural model as count variable that does not take negative value. As the simplest case consider a model of CRC only on w without interaction as below. E(y 1 z, w, a 1, r 1 ) = exp(xδ 1 + a 1 w + r 1 ) w = zδ 2 + v 2, 6

7 where x and z are 1 L 1 and 1 L vectors of exogenous variables with x z that are independent with all the random variables generated in model. The coefficient vectors δ 1 and δ 2 are L 1 1 and L 1 respectively. Also assume that a 1 = α 1 + d 1 d 1 = ψv 2 + e d v 2 e d r 1 = θv 2 + e r v 2 e r In the first equation, α 1 = E[a 1 ] and d 1 = a 1 α 1. The second and third equations are basically the linear projections of each random variable on v 2. However, we need a stronger condition that the orthogonal decompositions are not only uncorrelated but also independent in order to facilitate the identification. The regressor w suffers from endogeneity unless ψ = θ = 0. Along with those assumptions, we put multivariate normal assumption on the random variables (d 1, r 1, v 2 ). Under these assumptions, Wooldridge (2007) has derived a conditional mean function as below. E(y 1 z, v 2 ) = exp (xδ 1 + α 1 w + ψv 2 w + θv 2 + σ2 r + 2σ dr w + σd 2 ) w2 2 Although the above equation is estimable once v 2 is estimated in the first stage regression, the semi-elasticity of w, i.e. α 1 is not identified due to the fraction term, and this is still the case even if w is exogenous, where ψ = 0, θ = 0. In order to identify α 1, it is required that the random coefficient on w is uncorrelated not only with w but also with r 1. Although the semi-elasticity is not identified, one can still identify the APE of w over z, w and v 2. From the above equation, the partial effect of w is E(y 1 z, w, v 2 ) = exp (xδ 1 +α 1 w+ψv 2 w+θv 2 + σ2 r + 2σ dr w + σ 2 ) d w2 (α 1 +σ dr +ψv 2 +σ 2 w 2 dw). By taking the average over the whole population, APE of w is identified. Wooldridge (2007) also extends to discuss the case where the coefficients of other covariates are also correlated with w. The results are basically the same; APE is identified although the semi-elasticity is not. 3.2 Binary Endogenous Variable Recall Figure 1. It should be noted that we are already familiar with the models with CRC only on y, which are equivalent to the two-regime endogenous switching regression (See Maddala, 7

8 1986). For the linear and nonlinear cases, one can use the correction methods proposed by Heckman (1978) and Terza (1998) respectively. As an extension in linear model, the one with CRC not only on endogenous but also on all other variables, i.e. Linear Two-regime CRC model (hereafter LTCRC), was explored by Wooldridge (2007). Consider a regression model as follows. y 1 = xd 1 + a 1 y 2 + xy 2 g 1 + u, where x is 1 K vector of exogenous covariates and d 1 and g 1 are K 1 vectors of CRC s. Also we have a vector of all exogenous variables z such that x z. As it was assumed in Keay (2011), it is assumed that x is demeaned without loss of generality. Wooldridge (2007) shows that under the assumptions E(a 1 z, v 2 ) = α 1 + ϕ 1 v 2 E(d 1 z, v 2 ) = δ 1 + ψ 1 v 2 E(g 1 z, v 2 ) = ξ 1 + ω 1 v 2, the estimating equation is E(y 1 z, y 2 ) = xδ 1 +α 1 y 2 +y 2 xξ 1 +ρ 1 h 2 (y 2, zδ 2 )+ϕ 1 h 2 (y 2, zδ 2 )y 2 +h 2 (y 2, zδ 2 )xψ 1 +h 2 (y 2, zδ 2 )y 2 xω 1, where h 2 (y 2, zδ 2 ) = y 2 λ(zδ 2 ) (1 y 2 )λ( zδ 2 ), where λ( ) is inverse Mill s ratio. From this model and estimating equation, one can identify the ATE of y 2 on y 1, i.e. α 1. In the next section, the main contribution of this article, the Nonlinear Two-regime CRC (NTCRC) model and its estimation method will be discussed. 4 Nonlinear Two-Regime CRC Model 4.1 Model In what follows we allow for a binary case of the variable of interest w, which calls for an analysis of ATE under counterfactual framework. Under this setting, the first stage binary selection equation is assumed to follow probit model, which makes the estimation more efficient. For the continuous w, the average of the semi-elasticity of w, i.e. α 1 was not identified, while 8

9 the APE was. Analogously, although the average semi-elasticity of w is not identified, ATE will be. Consider a model in terms of conditional expectation functions given below. E(y 0 z, w, a 0, b 0, a 1, b 1 ) = E(y 0 x, a 0, b 0 ) = exp(a 0 + xb 0 ) E(y 1 z, w, a 0, b 0, a 1, b 1 ) = E(y 1 x, a 1, b 1 ) = exp(a 1 + xb 1 ) (2) w = 1[zγ + v > 0], where a g are scalar errors, x is 1 K vector of covariates, and b g are K 1 random coefficient vectors. The z is the vector of all the available exogenous variables such that x z. For those random coefficients, our quantities of interest will be the means of random variables. In what follows, we will use notation var(x) σ 2 (x) and cov(x, y) σ(x, y). Also the regime subscript will be suppressed whenever obvious. Here are the assumptions for further discussion. Assumptions 1. Unconfoundedness: As was already stated in equation (2), w is redundant conditional on a and b that summarize all the information about the determined regime, i.e. E(y 1 z, w, a 0, b 0, a 1, b 1 ) = E(y 1 x, a 1, b 1 ), and same for Regime 0. Although the main focus is on w, this equation also assumes the same thing for the unobserved heterogeneity for the other regime. 2. Multivariate Normality: For each regime in equation (2), we have a = α + e b = β + d, where α and β are the means of a and b. The vector (e 0, d 0, e 1, d 1, v) of which dimension is (2K + 3) 1 follows multivariate normal, i.e. (e 0, d 0, e 1, d 1, v) N(0, V), where V is the appropriate variance-covariance matrix. Let the linear projections of e and d on v as e = σ(e, v)v + ɛ d = σ(d, v)v + δ. 9

10 Note that b, d, δ and σ(d, v) are K 1 vectors. Specifically, σ(d, v) = [σ(d 1, v),, σ(d K, v)], where b j = β j + d j is the j-th element in b. Since v and ɛ, δ are orthogonal, they are independent under multivariate normality. Another implication of the above assumption is that the selection equation follows probit model. 3. Conditional Independence: v and ɛ + xδ are independent conditional on z, w, i.e., v ɛ + xδ z, w. Under these assumptions, one can obtain the following lemma. Lemma 1. Under the assumptions 2 and 3, E[exp(e 1 + xd 1 ) z, w = 1] for Regime 1, and [ ( K K K ] = exp σ 2 (e 1 ) + σ 2 (d 1j )x 2 j + 2 σ(e 1, d 1j )x j + σ(d 1j, d 1r )x j x r )/2 E[exp(e 0 + xd 0 ) z, w = 0] for Regime 0. ( Φ r j zγ + σ(e 1, v) + ) K σ(d 1j, v)x j Φ(zγ) [ ( K K K ] = exp σ 2 (e 0 ) + σ 2 (d 0j )x 2 j + 2 σ(e 0, d 0j )x j + σ(d 0j, d 0r )x j x r )/2 Φ ( r j zγ σ(e 0, v) K σ(d 0j, v)x j ) Φ( zγ) Proof Consider the Regime 1 only. For notational simplicity the regime subscripts are sup- 10

11 pressed. Derivation for Regime 0 is almost identical. Note that [ ( ) ] E[exp(e + xd) z, w = 1] = E exp σ(e, v)v + ɛ + xσ(d, v)v + xδ z, w = 1 [ ( ) ] = E exp (σ(e, v) + xσ(d, v))v exp(ɛ + xδ) z, w = 1 [ ( ) ] = E exp (σ(e, v) + xσ(d, v))v z, w = 1 E[exp(ɛ + xδ) z, w = 1] [ ( ) ] = E exp (σ(e, v) + xσ(d, v))v z, w = 1 E[exp(ɛ + xδ) z] = For the second term on RHS, ( σ(e, v) + xσ(d, v)) 2 = ( ) Φ zγ + σ(e, v) + xσ(d, v) ( (σ(e, v) + xσ(d, v)) 2 exp Φ(zγ) 2 ( σ(e, v) + K ) 2 σ(d j, v)x j K K = σ 2 (e, v) + 2 σ(e, v) σ(d j, v)x j + σ 2 (d j, v)x 2 j + K σ(d j, v) σ(d r, v)x j x r. r j ) E[exp(ɛ + xδ) z] Now consider the third term on RHS. From the normality of ɛ + xδ that is guaranteed by the multivariate normal assumption, Now that ( / ) E[exp(ɛ + xδ) z] = exp var[ɛ + xδ x] 2. var[ɛ + xδ x] = σ 2 (ɛ) + σ 2 (xδ) + 2σ(ɛ, xδ) = K K K σ 2 (ɛ) + σ 2 (δ j )x 2 j + σ(δ j, δ r )x j x r + 2 σ(ɛ, δ j )x j. r j Collecting those terms we have the stated result. Remark Terza (1998) directly used the multivariate normal property in order to solve the ( conditional expectation, i.e. E[exp(ɛ) z, v] = exp ρσv + 1 ) 2 σ2 (1 ρ 2 ) was derived under the multivariate normal assumption between ɛ and v. Although such approach is also applicable here, it does not give a convenient expression for estimating the average treatment effect. In the above derivation I used the linear projections of e and d on v that give un estimating equation 11

12 of which the coefficients can be directly used to find the ATE. As will be discussed below, the structural parameters such as β is not independently identified, but the coefficients on x and x 2 from the estimating equation derived by the linear projection coincides with those coefficients of the estimating equation for ATE. and 2, Now let s derive an estimating equation of the model in (2). Note that by assumptions 1 E[y z, w, a 0, b 0, a 1, b 1 ] = (1 w)e[y 0 z, w, a 0, b 0, a 1, b 1 ] + we[y 1 z, w, a 0, b 0, a 1, b 1 ] = (1 w)e[y 0 x, a 0, b 0 ] + we[y 1 x, a 1, b 1 ] ( ) ( ) = (1 w) exp α 0 + e 0 + x(β 0 + d 0 ) + w exp α 1 + e 1 + x(β 1 + d 1 ) E[y z, w] = (1 w) exp(α 0 + xβ 0 )E[exp(e 0 + xd 0 ) z, w = 0] + w exp(α 1 + xβ 1 )E[exp(e 1 + xd 1 ) z, w = 1] By Lemma 1. [ ( K K K ] = w exp 2α 1 + σ 2 (e 1 ) + σ 2 (d 1j )x 2 j + 2 [β 1 + σ(e 1, d 1j )]x j + σ(d 1j, d 1r )x j x r )/2 ( Φ zγ + σ(e 1, v) + ) K σ(d 1j, v)x j Φ(zγ) r j [ ( K K K ] +(1 w) exp 2α 0 + σ 2 (e 0 ) + σ 2 (d 0j )x 2 j + 2 [β 0 + σ(e 0, d 0j )]x j + σ(d 0j, d 0r )x j x r )/2 Φ ( zγ σ(e 0, v) K σ(d 0j, v)x j ) Φ( zγ) r j The above equation identifies only σ 2 (d j ), σ(e, v) and σ(d j, v) separately, and the average of semi-elasticity, i.e. β is not identified due to the nuisance parameters σ(e, d j ) s. It can be estimated by two step by using the probit model for the selection equation; given the estimated index zγ, the above equation can be estimated either by NLS or Quasi-ML under Poisson distribution. Or can they still be estimated simultaneously by a single-step method. See Keay (2011) for detailed discussion of estimation procedure and their asymptotic distribution. Since the structural parameters α and β are not separately identified, our interest should lie in average treatment effects. (3) 12

13 4.2 Identification of ATE Note that the ATE is E[y 1 y 0 ]. One of the easiest way to identify this is by using the law of iterated expectation, i.e. E[y 1 y 0 ] = EE[y 1 y 0 x] as long as the expectation conditional on x can be derived. To that end, the following lemma is derived. Lemma 2. Under the assumptions given above, the following result holds. [ ( K K K ] E[exp(e + xd) x] = exp σ 2 (e) + σ 2 (d j )x 2 j + 2 σ(e, d j )x j + σ(d j, d r )x j x r )/2 r j Proof Note that E(e + xd x) = E(e x) + E(xd x) = 0 var(e + xd x) = var(e x) + var(xd x) + 2cov(e, xd x) = K K K σ 2 (e) + σ 2 (d j )x 2 j + σ(d j, d r )x j x r + 2 σ(e, d j )x j. r j For any fixed value of x, e + xd is normally distributed since it is a combination of multivariate normal random variables as was mentioned in assumption 2. Since the mean and variance of normal random variables are already obtained, the mean of its log-normal variable is trivially found. The ATE conditional on x is E[y 1 y 0 x] = exp(α 1 + xβ 1 )E[exp(e 1 + xd 1 ) x] exp(α 0 + xβ 0 )E[exp(e 0 + xd 0 ) x]. Thus from the above lemma, E[y 1 y 0 x] [ ( K K K ] = exp 2α 1 + σ 2 (e 1 ) + σ 2 (d 1j )x 2 j + 2 [β 1 + σ(e 1, d 1j )]x j + σ(d 1j, d 1r )x j x r )/2 r j [ ( K K K ] exp 2α 0 +σ 2 (e 0 )+ σ 2 (d 0j )x 2 j +2 [β 0 +σ(e 0, d 0j )]x j + σ(d 0j, d 0r )x j x r )/2. r j Note the similarity of this equation with the estimating equation in (3); except for the correction functions, the expressions inside the exponential function are identical which makes the (4) 13

14 identification of ATE possible. The estimator will be the sample average of equation (4) over the values of x. The fact that each parameter is not identified does not make any problems for ATE identification. The asymptotic distribution of this ATE estimator is essentially same as was presented in Terza (2008). 5 Specification Tests 5.1 Tests for Endogeneity In the previous section, we have derived the estimating equation (3) and ATE conditional on covariates (4). In these equations no restrictions were imposed and an estimation of the model without restriction, where the number of identifiable composite parameters is no less than (K +4)(K +1)/2, might cause difficulty in numerical optimization. One can apply the Lagrange Multiplier (LM) test in order to test for the presence of correlated random coefficients. Another test called Variable Addition Test (VAT) is also available which is asymptotically equivalent to LM but easier to apply. This test constructs a conditional mean by adding appropriately defined variables to create the likelihood function of which the score under restriction is same as the one used in LM test (See Wooldridge, 20XX). The actual test is performed by Wald test on the significance of coefficients of the added variables. Revisit the estimating equation (3). Inside the exponential function we have each covariate, the square of each covariate and their cross products. Let the vector of these functions of covariates denoted x and rewrite the estimating equation for Regime 1 as follows. E[y z, w = 1] = exp( xθ 1 ) Φ( ) zγ + θ 2 + xθ 3, (5) Φ(zγ) where θ 2 = σ(e 1, v) and θ 3 = σ(d 1, v). As was already mentioned, θ 2 is scalar, θ 3 is K 1 vector and θ 1 is (K + 2)(K + 1)/2 1 vector. The estimating equation for Regime 0 is the same as above except for the minus sign inside the Φ( ) in correction function. In order to perform VAT, we need to construct a conditional mean function of which score is identical to the equation (5) under restriction. Let the restriction or equivalently the null hypothesis H 0 : θ 2 = θ 3 = 0. In other words, there is no correlation between selection error v and any other errors in the structural equations. An LM test can be applied that does not require an estimation of complicated model without restriction. The VAT is a device that facilitates this LM test by using 14

15 an auxiliary regression for Regime 1 ( ) exp xθ 1 + θ 2 λ(zγ) + θ 3 λ(zγ)x, (6) where λ( ) is inverse Mill s ratio. The auxiliary regression for Regime 0 will have zγ for zγ. It can be easily verified that the likelihood and score functions derived from (6) under restrictions are same as the one from (5). Therefore the score tests on the significance of coefficients from the above auxiliary regression is equivalent to the LM tests on (5). Thus the Wald tests for (6) will give a simple and asymptotically equivalent way to test the null without estimating the model without restriction. In actual test, λ(zγ) has to be estimated through the first stage regression. 5.2 Model Selection Test What we have considered above is whether the already given random coefficients are correlated with the selection error or not. Even when the null hypothesis is not rejected, it does not warrants our coming back to the Nonlinear Full Endogenous Switching (NFES) Regression provided by Terza (1998). In current model, presence of CRC is assumed both in a and b in equation (2). As a natural consequence one also assumes the multivariate normality between those errors and v. On the contrary, the coefficient b is assumed to be constant in NFES and the multivariate normal joint error distribution is assumed just between a and v. It should be emphasized that the VAT tests for the endogeneity under the assumption that b is random, and thus a failure to reject null hypothesis does not imply an acceptance of NFES. A practical consequence of this observation is that one has to include the square and cross products of the covariates, which were not included in NFES, even when there turns out to be no correlation between those random coefficients. Then how can one perform a test of which the alternative is NFES model? The NFES model assumes that the coefficient b in equation (2) is constant and thus d = 0 as well as σ 2 (d j ) = σ(, d j ) = 0. From equation (3), the null hypothesis will be the zero restriction on the coefficients of square and cross product terms inside the exponential functions and the coefficients on x j inside the correction functions. 6 Monte Carlo Simulation The purpose of the simulation is two-fold: First, it will compare the performances of this nonlinear ATE estimator with the linear method proposed by Wooldridge (2007). This will 15

16 show the advantage of using the nonlinear model for estimating ATE. Second, it will see the impact of violations of the assumption that were used in order to derive the estimating equation in the previous section. 6.1 Data Generating Processes The derivation of the estimating equations depends on the three assumptions listed in the above section. In this section, we examine the performances of the ATE estimator under violations of the first and second assumptions only. In order to check the robustness of those assumption, a series of simulations are run by using the following data generating processes (DGPs). x uniform[15, 50] z Binomial(1, 1/2) w = 1[ x 0.3z + v 0] ( ) E(y 1 x, z, e 1, d 1 ) = exp e 0 + ( d 0 )x ( ) E(y 0 x, z, e 0, d 0 ) = exp e 1 + ( d 1 )x, where y 1 and y 0 follow Poisson distribution with the mean specified above. This DGP was designed to make the population ATE approximately equal to unity. There are three sessions of simulations; DGP 1 deals with the ideal case where the errors follow multivariate normal distribution and v ɛ + xδ. The joint distribution of errors for DGP 1 is e ρ 0 0 ρ d ρ e 1 N 0, 1 ρ ρ. d ρ v 0 1 This variance-covariance matrix is in fact correlation matrix since the variances are all equal to unity. For the simulation, e 0 = e 0/100, e 1 = e 1/100, d 0 = d 0/100, and d 1 = d 1/100 were used. Also for the correlation values ρ, 0.3 and 0.5 were used. This is the case where the above estimator might perform the best, since the actual DGP conforms the assumptions on which the estimating equation is based. 16

17 DGP 2 generates a process where the linear projection errors ɛ and δ are uncorrelated but dependent with v. For a given v, ɛ is designed to be either the same value of v or v by a half probability for each. Then the scattergram will look like a X letter, where the covariance of the two variables becomes zero, but very strongly dependent with each other. Another important assumption for Lemma 1 is that v follows normal distribution, which brought the normal cumulative distribution function Φ( ) in the equation. In order to check the robustness of the estimator on that assumption, DGP 3 uses χ 2 (5) distribution to create skewed errors. The variance-covariance matrix looks just the same as in the above equation. However, we here drop the multivariate normal assumption; all the marginal distributions of the errors follow some adjusted χ 2 distribution with the predetermined correlation Simulation Results Table I and II list the simulation results for ρ = 0.3 and 0.5 respectively. For each table, each column for I, II and III lists the results for the DGP I, II and III both for linear and nonlinear estimator. Let s first consider the performances of nonlinear estimator. In either Table I or II, the column I for DGP 1 shows that the ATE estimator is performing well as expected. The results show that there are nontrivial amount of outlying values under the sample size It is usually the case that the numerical optimization does not work well in smaller sample size and gives many dubious estimates. In the Table I and II, the outlying values that are greater or smaller than the maximum and minimum values for the sample size of 3000 were discarded from the generated data. Roughly about 8 to 10% of the total data is removed in this way. However, whole data are kept for median and MAD calculation which are not sensitive to the extreme values. There seems to be no discernible difference between ρ = 0.3 and 0.5 The column for DGP 2 shows that for those two tables the results do not show any clear sign of the adverse impact due to the violation of the independence assumption. independence assumption is more or less practically negligible. Therefore we can conclude that the The last column is for DGP 3, where the multivariate normal assumption is violated. In this 2 There might be infinitely many such joint distribution and the multivariate χ 2 distribution, which is not used here, is just one of them. The reason why multivariate χ 2 distribution is not used is that 1) it is hard to generate the DGP 3 by Stata, and that 2) what matters is just the failure of the normality for each marginal distribution and its impact on the estimation. In this DGP 3, I generated χ 2 distributions by first generating many basis standard normal distributions. The correlations were created by using some common basis standard normal distribution. The Stata code will be provided upon request. 17

18 case, it is clearly visible that the estimator suffers from larger variance and bias even under 5000 observations. It happens because we used the fact that E[exp(u)] = exp(σ 2 /2) under normality in Lemma 1. If the normality fails, then the whole estimating equation will be misspecified causing inconsistency. A comparison between linear and nonlinear estimators shows that the nonlinear estimator has larger variance particularly when the sample size is small. However, it is also clear that the linear estimators do not show any sign of consistency. The nonlinear estimator keeps approaching the true value, i.e. one, from sample size of 3000 to 5000, the linear estimator does not show any convergence as the sample sizes grow. Another interesting aspect is that the skewed distribution of errors not only affects the nonlinear estimator, but also the linear one too. This is particularly true in Table II; in column III, the linear estimator is actually moving away from the true value one. In sum, although having larger variance, the nonlinear estimator is better than the linear one in terms of consistency. It is preferable to use the nonlinear ATE estimator when the sample size is sufficiently large. 7 An application: the effect of elementary school education on fertility in Botswana In this section, the ATE estimator derived above will be applied to the Botswana fertility data set from Wooldridge (2010, Chapter 21). Detailed descriptions of the data set are already given in Keay (2011) and thus are omitted here. Before the regression results are presented, it warrants some justification at this point why the data can be analyzed by the CRC model framework. Consider a regression equation as below. children = β 0 + b 1 age + β 2 agesq + u For the time being let s focus on the motivation of CRC modeling and ignore other covariates that might also be present in the structural equation. The interpretation of β 1 is the marginal birth per year that is determined by the opportunity cost of childbearing (Becker and Barro, 1988; Barro and Becker, 1989). Since education increases human capital, higher education would lead to lower b 1. For now just assume β 2 is constant. Then b 1 is a negative function of years of education. Thus the above equation becomes children = β 0 + b 1 (educ)age + β 2 agesq + u. 18

19 It is assumed in the above model that agesq does not have the CRC. In other words, all the heterogeneity of age are assumed to be absorbed into age only. This assumption helps us dispense with age 4 that might possibly cause much trouble in exponential regression. Now suppose educ is continuous. Then for each value of educ, there are uncountably many regression equations with constant coefficient. Let s divide the support of educ by using threshold 7, i.e. the duration of primary education in Botswana to have a model below. children 0 = β 00 + b 01 (educ)age + β 02 agesq + u 0 children 1 = β 10 + b 11 (educ)age + β 12 agesq + u 1 The upper equation is defined on {ω educ(ω) < 7} whereas the lower one on the complement set. Next suppose educ is not available but only educ7 is. The information reduction is done by the following rule. educ7 = 1 if educ > 7 educ7 = 0 if educ < 7, where educ = zδ + v. Then the above model can be written as children 0 = β 00 + b 01 (v)age + β 02 agesq + u 0 children 1 = β 10 + b 11 (v)age + β 12 agesq + u 1 educ7 = 1[zδ + v > 0] Considering the nonnegativity of children, E[children 0 age, u 0 ] = exp(β 00 + b 01 (v)age + β 02 agesq + u 0 ) E[children 1 age, u 1 ] = exp(β 10 + b 11 (v)age + β 12 agesq + u 1 ) educ7 = 1[zδ + v > 0] It is expected that there are negative correlations between b g1 and v; in other words σ dgvs in equation (3) might be negative. Higher education can reduce fertility either by increasing opportunity cost or by providing people with more information on, say, contraception. Although omitted for simplicity so far, other variables such as evermarr, urban, and electric, which might have CRC s too, will also be included in the following regression. Therefore all the variables but agesq have the CRC s and their square and cross product will be included in 19

20 the estimating equations. Before running regression using NTCRC model, it might be a good idea to test for endogeneity by VAT. The test statistic for both the regime that follows χ 2 (10) is with p-value The test statistics for Regime 1 and 0 that follow χ 2 (5) are with p-value.0001 and with p-value Thus the null hypothesis of no endogeneity is rejected and we will run a regression with full-blown model. The regression results for LFES and NFES models are summarized in Table III and IV. First and second columns in Table III show the results by OLS and 2SLS. Next columns are the ones by LFES estimated by Heckman s correction method. NFES model is estimated by using both Poisson Quasi-ML and NLS methods. The results for NTCRC model are provided separately in Table IV. Although both NLS and Poisson QMLE is available also for NTCRC, the estimate for NLS does not give a reasonable value; the estimated ATE for NLS is 5.28 and thus only the QMLE results are listed. All the standard errors in Tables III and IV are bootstrap standard errors. The bootstrap ran 500 replications. Among those the bootstrap standard error for ATE in NTCRC model was particularly large. Since the maximum number of children in the data set was 13, it might be highly unlikely that the absolute value of ATE is greater than 13. By that reasoning, all the data with estimated ATE over 13 were discarded to get the bootstrap standard error presented in the table. In Table IV, each basic covariate is numerically labeled such as age (1), evermarr (2) and so forth. Using that numbers cov(1)=cov(δ 1, v), cov(2)=cov(δ 2, v), and so on. Let s now see the estimation results for NTCRC model. As it was already expected in theory, there are indeed, although insignificant, negative correlations between the selection error and the coefficients of age for both the regimes. The estimated ATE is which is very close to the estimates from NFES model in Table III. Let s now see the LTCRC estimates in Table IV. The ATE estimate is substantially large and highly significant, which is indicative of inconsistency of the estimator. Remember that both LTCRC and NTCRC estimating equations are based on the conditional expectation and thus these two estimators cannot be consistent at the same time. As long as we are convinced that the effect of education on fertility is negative, the lesson is that the inconsistency of LTCRC can be substantial, which supports the usefulness of NTCRC model provided in this article. Finally one can perform the model selection test discussed in Section 5.2. The null hypothesis that this model is in fact NFES is that the coefficients of all the bilateral products as well as the covariances between δ and v are equal to zero. The Wald statistics of this null hypothesis 20

21 for Regime 1, Regime 0 and both the regimes are (df=10), (df=10) and (df=20) with p-values all equal to.0000, which justifies the model specification as NTCRC. 8 Conclusion We have so far considered the endogenous switching regression where there is a random coefficient correlated with the switching variable under the count dependent variable. The count dependent variable requires a nonlinear modeling using the exponential conditional mean function. Although it is impossible to identify the means of each CRC, we have seen that the ATE, which might be more interesting, can be identified. Simulations show that this ATE estimator by nonlinear two-regime CRC model performs well with large sample size. 21

22 References Amemiya, T. (1985), Advanced Econometrics. Cambridge: Harvard University Press Barro, R., and G. Becker (1989), Fertility Choice in a Model of Economic Growth, Econometrica, 57, Becker, G., and R. Barro (1988), A Reformulation of the Economic Theory of Fertility, Quarterly Journal of Economics, , 1-25 Progress on Some Persistent Eco- Card, D. (2001), Estimating the Return to Schooling: noemtric Problems, Econometrica 69, Garen, J. (1984), The Returns to Schooling: A Selectivity Bias Approach with a Continuous Choice Variable, Econometrica 52, Heckman, J. J. (1976), The Common Structure of Statistical Models of Truncation, Sample Selection, and Limited Dependent Variables and a Simple Estimator for Such Models, Annals of Economic and Social Measurement 5, Heckman, J.J., and E. Vytlacil (1998), Instrumental Variables Methods for the Correlated Random Coefficient Model, Journal of Human Resources 33, Keay, M.J., (2011), Estimating Average Treatment Effect by Nonlinear Endogenous Switching Regression with an Application in Botswana Fertility, Michigan State University Maddala, G. S. (1986), Limited-Dependent and Qualitative Variables in Econometrics, Cambridge: Cambridge University Press. Terza, J. (1998), Estimating Count Models with Endogenous Switching: Sample Selection and Endogenous Treatment Effects, Journal of Econometrics 84, Terza, J. (2008), Parametric Regression and Health Policy Analysis: Estimation and Inference in the Presence of Endogeneity, Department of Economics, University of Florida. Terza, J. (2009), Parametric nonlinear regression with endogenous switching, Econometric Reviews 28(6), Wooldridge, J. (1997), On Two Stage Least Squares Estimation of the Average Treatment Effect in a Random Coefficient Model, Economics Letters 56, Wooldridge, J. (2003), Further Results on Instrumental Variables Estimation of Average Treatment Effect in the Correlated Random Coefficient Model, Economics Letters 79, Wooldridge, J. (2005), Fixed Effects and Related Estimators for Correlated Random Coefficient and Treatment Effect Panel Data Models, Review of Economics and Statistics 87,

23 Wooldridge, J. (2007), Control Function and Related Methods, What s New in Econometrics?, National Bureau of Economic Research Wooldridge, J. (2010), Econometric Analysis of Cross Section and Panel Data, second edition, Cambridge: The MIT Press Wooldridge J. (2011), Quasi-Maximum Likelihood Estimation and Testing for Nonlinear Models with Endogenous Explanatory Variables, Department of Economics, Michigan State University. 23

24 (Table I) Simulation Results for ρ = 0.3 I II III linear nonlinear linear nonlinear linear nonlinear n=1000 mean * * * mc. st. dev * * * RMSE * * * median MAD IR n=3000 mean mc. st. dev RMSE median MAD IR n=5000 mean mc. st. dev RMSE median MAD IR Note: The Monte Carlo standard deviation is substantially large for the sample size of The * indicates that the outlying values that are greater or smaller than the maximum and minimum values of for the sample size of 3000 were discarded from the generated data. 24

25 (Table II) Simulation Results for ρ = 0.5 I II III linear nonlinear linear nonlinear linear nonlinear n=1000 mean * * * mc. st. dev * * * RMSE * * * median MAD IR n=3000 mean mc. st. dev RMSE median MAD IR n=5000 mean mc. st. dev RMSE median MAD IR

26 (Table III) Regression Results for Linear and NFES models Variable Linear Model NFES OLS 2SLS LFES (Hkt) PQML NLS ATE *** *** (0.278) (0.315) educ *** * ** (0.049) (0.596) (0.528) R1 R0 R1 R0 R1 R0 age *** *** *** *** *** *** *** *** (0.017) (0.019) (0.020) (0.043) (0.020) (0.016) (0.022) (0.018) agesq *** *** *** *** *** *** *** *** (0.000) (0.000) (0.000) (0.001) (0.000) (0.000) (0.000) (0.000) evermarr *** *** * *** *** *** *** *** (0.052) (0.080) (0.078) (0.138) (0.047) (0.043) (0.055) (0.046) urban *** ** *** *** *** * *** (0.046) (0.084) (0.064) (0.120) (0.051) (0.040) (0.059) (0.044) electric *** * * *** ** * (0.066) (0.161) (0.093) (0.224) (0.079) (0.113) (0.090) (0.127) constant *** *** *** ** *** *** (0.245) (0.620) (0.268) (0.478) (0.097) (0.154) (0.327) (0.469) cov(ɛ, v) *** ** *** (0.304) (0.605) (1.398) (0.192) (0.295) (0.218) L-likelihood R σ Note: All the standard errors presented above are bootstrap standard errors. *, ** and *** indicates the significance at 10%, 5% and 1% levels respectively. 26

27 (Table IV) Regression Results for LTCRC and NTCRC models Variable LTCRC NTCRC ATE (1.633) educ *** (2.347) R1 R0 R1 R0 age (1) *** *** *** (0.030) (0.040) (0.040) (0.088) agesq ** *** *** (0.001) (0.002) (0.001) (0.001) evermarr (2) *** *** ** (0.164) (0.299) (0.844) (0.321) urban (3) *** (0.155) (0.287) (0.382) (0.319) electric (4) ** ** (0.175) (0.919) (0.335) (2.471) age*evermarr ** (0.041) (0.008) age*urban (0.025) (0.008) age*electric (0.010) (0.054) evermarr*urban ** (0.427) (0.066) evermarr*electric * (0.272) (0.178) urban*electric (0.287) (0.248) constant *** *** *** (0.345) (0.502) (0.749) (1.842) cov(ɛ, v) *** ** (0.460) (0.553) (1.068) (2.435) cov(1) *** (0.033) (0.043) (0.074) (0.121) cov(2) ** ** ** (0.287) (0.313) (0.414) (3.374) cov(3) * (0.263) (0.279) (0.182) (2.127) cov(4) * (0.583) (0.563) (0.680) (1.240) L-likelihood R σ Note: The bootstrap error for NTCRC ATE estimator is calculated by excluding all the replication with ATE values over 13. cov(1) is the covariance between v and the coefficient of variable (1), i.e. age. Variable number for each basic covariate is indicated in the variable list. cov(2) and other numbers are defined similarly.

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. Linear-in-Parameters Models: IV versus Control Functions 2. Correlated