Relaxing Conditional Independence in a Semi-Parametric. Endogenous Binary Response Model

Size: px

Start display at page:

Download "Relaxing Conditional Independence in a Semi-Parametric. Endogenous Binary Response Model"

Stephen Hopkins
5 years ago
Views:

1 Relaxing Conditional Independence in a Semi-Parametric Endogenous Binary Response Model Alyssa Carlson Dated: October 5, 2017 Abstract Expanding on the works of Rivers and Vuong (1988), Blundell and Powell (2004), and Rothe (2009) this paper presents a new flexible conditional maximum likelihood estimator that is able to address issues previously ignored in the literature. This estimator follows the standard two step control function approach to address endogeneity of a continuous random variable and is semi-parametric in the standard preliminary infinite dimensional nuisance parameter sense. Relaxing the Conditional Independence assumption that was previously used for identification, the proposed estimator is more robust in certain respects. For instance, this estimation procedure allows for parametric specification of heteroskedasticity in which the Blundell and Powell and Rothe estimators can only address in restricted forms. In addition, following the work of Kim and Petrin (working paper), the model allows for a more flexible (although parametrically specified) control function. Standard asymptotic results for the estimator are derived including consistency, n-asymptotic normality, and an estimator for the asymptotic variance. Simulation results on parameter estimates, Average Partial Effects estimates, and Average Structural Function estimates are provided for two different specifications. The data generating process for the simulations model the empirical data given in Blundell and Powell (2004) and Rothe (2009) to give some economic context to the results. This paper concludes that there is a trade-off between a flexible specification and a structural interpretation in which the consequences of assuming Conditional Independence cannot be ignored. Keywords: Binary Response, Probit, Endogeneity, Control Function, Sieve Estimation, Heteroskedasticity PhD student at the Economics Department of Michigan State University. I would like to thank Professor Jeffrey Wooldridge, Professor Kyoo Il Kim, and Professor Ben Zou for all their direction, helpful comments, and advice. 1

2 Contents 1 Introduction 3 2 Model Set Up 4 3 Identification Multiplicative Heteroskedasticity Conditional Mean Restriction Conditional Mean and Variance Functions are Nonlinear in Parameters Estimation Average Structural Function Average Partial Effects Asymptotics Consistency n-asymptotic Normality Simulation Conditional Independence does not hold Linear Index Conclusion 31 A Equations and Notation 34 B Consistent Variance Estimation 36 C Proofs 39 C.1 Proof of Theorem C.2 Proof of Lemma B D Simulation Implementation 40 D.1 Conditional Independence does not hold D.2 Linear Index

3 1 Introduction In investigating causal effects in a binary response framework, a predominant concern is the ability to address endogeneity. For instance, in models of labor participation, one is usually concerned of the endogeneity of non-wage income as it is usually simultaneously determined in the household. Rivers and Vuong (1988), Blundell and Powell (2004), and Rothe (2009) are a series of paper that propose a control function method to address endogeneity of a continuous regressor. However all three papers either explicitly or implicitly assume Conditional Independence (CI) to simplify identification. Following the notation of Blundell and Powell (2004), y 1i = { 1 y 1i 0 0 y 1i < 0 y 1i = x i β o + u 1i (1) y 2i = π o (z i ) + v 2i where x i includes exogenous regressors (z 1i ) and the continuous endogenous regressors (y 2i ). Then the CI assumption is, u 1i v 2i, z i u 1i v 2i (2) in other words, conditional on v 2, z is independent of u 1. In Rivers and Vuong (1988), CI is a consequence of their assumptions that u 1 and v 2 are jointly independent of z. Blundell and Powell (2004) and Rothe (2009) state that the CI assumption is necessary for identification. But CI is may be too stringent of an assumption in many empirical contexts. This paper proposes an estimator that utilizes the same control function technique but extends the model by relaxing the CI assumption. Following the literature on Non-Parametric Instrumental Variables (NPIV) 1 and Kim and Petrin (working paper), the CI assumption can and in many cases should be relaxed. Kim and Petrin (working paper) relax CI in the non-parametric case where the unobserved heterogeneity is additively separable. They also provide several examples where CI is too restrictive such as returns to education, productions functions and supply and demand frameworks where the reduced form for price is non-separable. This paper extends the 1 Ai and Chen (2003), Newey and Powell (2003), Hall et al. (2005), and Blundell et al. (2007) 3

4 generalization to a binary response model where the error is not additively separable. Alternative estimators that do not require the CI assumption in the context of the control function approach are the special regressor estimator proposed in Lewbel (2000) and Dong and Lewbel (2015) and the minimum distance estimator proposed by Hong and Tamer (2003). However both of these methods require alternative conditional restrictions. The special regressor estimator requires a strictly exogenous regressor that has large support while the minimum distance estimator requires a conditional median restriction that the median of the error u 1it conditional on the instruments z i is zero. Under these alternative assumptions, the proposed estimators are valid. This paper aims to propose an alternative estimator that falls under the control function method with weaker assumptions than conditional independence. The remainder of this paper is organized as follows. Section 2 describes the set up of the model and how it relates to and differs from previous models and assumptions in the literature. The generalization proposed in this paper does have its costs in that identification is slightly more difficult to obtain. So the following section is a discussion on what restrictions are needed to obtain local and global identification. Section 4 gives instructions on how the estimator would be implemented and discusses the usual functions of interest, Average Structural Function (ASF) and Average Partial Effects (APE), and how to estimate them. Standard asymptotic results are presented in Section 5. This section relies heavily on the work in Newey (1994). Section 6 presents the results from a simulation study, comparing the proposed estimator with current estimation procedures under two scenarios The data generating process for the simulations model the empirical data given in Blundell and Powell (2004) and Rothe (2009) to give some economic context to the results. Finally, the paper concludes with a short discussion on possible directions to expand this research. 2 Model Set Up Consider the triangular system described by equation (1) where y 1i is a binary response variable, z i = (z 1i, z 2i ) is a 1 (k 1 + k 2 ) vector of co-regressors, y 2i is a single continuous endogenous regressor, and x i 4

5 is a 1 K x vector where each element is a function of (z 1i, y 2i ). Let u 1i and v 2i be mean zero unobserved heterogeneity and the function π o ( ) is allowed to be an unknown function. Unlike in the linear case (see Chapter 6.2 in Wooldridge (2010) for reference) where constructing the control function only relies on linear projection arguments, in the nonlinear case stronger assumptions are needed. In particular, the function π o ( ) will need to be the true conditional mean. This will be discussed further in the section on identification. The distributional assumptions for u 1i and v 2i determine the estimation procedure and the consistency of estimates. For example, if one were to assume u 1i v 2i, z i N(0, 1), then there is no endogeneity and no heteroskedasticity and therefore a standard probit MLE procedure will yield consistent estimates. On the other hand, if u 1i v 2i, z i N(0, exp(2z i δ)) then heteroskedasticity is present and the standard probit MLE procedure would be inconsistent but as het-probit MLE procedure would be consistent. If u 1i v 2i, z i N(ρv 2i, 1)) then the two step CMLE procedure developed by Smith and Blundell (1986) and further explored by Rivers and Vuong (1988) would be consistent and other methods that ignore the endogeneity would be inconsistent. More generally, if the CI assumption holds such that u 1i v 2i, z i u 1i v 2i with some unknown distribution Blundell and Powell (2004) (for the remainder of the paper referred to as BP) and Rothe (2009) provide semi-parametric methods that estimates the parameters consistently. As a first step in relaxing the CI assumption, the proposed estimation procedure will require that conditional distribution is assumed to be normal with known conditional mean and conditional variance functions as stated in Assumption further generalizations to unknown conditional mean, conditional variance or distributions will be left to future research. Assumption 2.1. From the set up in equation (1), let the unobserved heterogeneity have the following conditional distribution ( ) u 1i z i, v 2i, y 2i = u 1i z i, v 2i N h(v 2i, z i ; γ o ), exp(2 g(v 2i, z i ; δ o ) Where z i = (z 1i, z 2i )) and h(v 2i, z i ; γ o ) and g(v 2i, z i ; δ o ) are known functions up to a finite number of unknown parameters (γ o, δ o ). Under Assumption (2.1), the conditional mean of y 1i is: 2 The normality assumption could be easily generalized to just a known distribution with CDF G( ). Obviously this would then allow for the logistic distribution and therefore this paper would generalize to the logit model as well. 5

6 ( ) xi β o + h(v 2i, z i ; γ o ) E(y 1i z i, y 2i, v 2i ) = Φ exp(g(v 2i, z i ; δ o )) (3) This result should be unsurprising as it appears to be a heteroskedastic probit model that adjusts for endogeneity using the control function approach, both of which have been discussed extensively in the literature before. Since the normal distribution is indexed by its mean and variance, relaxing CI with a normality assumption is simply allowing for heteroskedasticity and a flexible control function in a probit model. As a result, estimation will follow a simple two step approach where the conditional mean of y 2i is estimated in the first stage to obtain residuals (ˆv 2i ) that will be plugged into a second step heteroskedastic probit using the conditional mean given in equation (3). This will be discussed with more detail in Section 3. There are several important implications of Assumption 2.1 in comparison to the standard CI assumption. First, Assumption 2.1 allows the control function to be a general function of (v 2i, z i ) whereas CI implies that the control function cannot be a function z i. In the linear model this was not an issue since the control function was derived using a projection argument. Consider the example of the demand model from Kim and Petrin (working paper) where the outcome (y 1i ) is demand for the product, the endogenous variable (y 2i ) is price, the exogenous variables (z i ) are observable characteristics, and latent error (u 1i ) can be interpreted as the unobservable characteristics. Then CI requires the expectation of unobservable characteristics of a product conditional on the price and observable characteristics to be just a function of the price. One could imagine how the observable characteristics may interact with price in the conditional expectation. For example, consider the demand for purchasing a home where the unobservable characteristic is quality of the neighbors (ie: are they loud neighbors). Then observable characteristics such as proximity to the neighbors (ie: apartment versus spaced out homes) and affluence of the neighborhood would have interactive effects on expected price of the home. Therefore CI may be too strong of an assumption. Second, CI implies that, u 1 conditional on v 2 cannot be heteroskedastic in z. Assumption 2.1 allows for general forms of heteroskedasticity through the function g(v 2i, z i ; δ o ) 3. This is particularly relevant in the 3 Note that assuming the conditional variance is an exponential function is not restrictive, it merely enforces non-negativity for the variance and allows for g(v 2i, z i ; δ o) to be unrestricted which eases estimation 6

7 nonlinear model where unaccounted heteroskedasticity can result in inconsistent parameter estimates. BP and Rothe claim that although they impose CI, their estimation procedure should be able to handle limited forms of heteroskedasticity. This proposition will be addressed further in the section on deriving the ASF and explored in the simulation study. The third implication is the distributional assumption and possible misspecification of the probit functional form. This is particular pertinent in contrast to estimators from BP and Rothe who have no distributional assumptions. Although the distributional assumption may not hold in empirical contexts, having the probit functional form simplifies identification and estimation. This could be relaxed in future research. In the following section, identification will be shown for u 1i v 2i, z i normally distributed and the functions h(v 2i, z i ; γ) and g(v 2i, z i ; δ) are known. To see why we cannot relax both CI and a distributional assumption, consider the most general case: let F ( ; v 2i, z i ) be the conditional cdf of u 1i v 2i, z i, then, E(y 1i v 2i, z i ) = P ( u 1i x i β o v 2i, z i ) = F (x i β o ; v 2i, z i )) where F ( ; v 2i, z i ) would be estimated non-parametrically. But since y 2i is in x i and is perfectly determined by z i and v 2i, identification would not hold. Therefore some structure must be imposed in order to generalize. There will always be instances where one can argue which assumption should take precedence. However, one of the benefits of assuming a known distributional functional form is the computational ease of implementation. In particular, the proposed estimation procedure can be executed using preprogrammed commands in common statistical packages and can be computed quite quickly. This is compared to the estimation procedure described BP and Rothe (2009) which are much more computationally demanding. 4 3 Identification Since the estimation process will be done in two stages, identification needs to be shown in both steps. Identification of the first stage is a standard application of non-parametric identification of a conditional mean. 4 Rothe provides code in R for his estimator which makes it fairly easy to implement. 7

8 The exogeneity condition for the reduced form of y 2i is a decomposition of its conditional mean (π o (z 1i, z 2i )) and unobserved heterogeneity (v 2i ). Therefore the reduced form is a model without endogeneity. 5 This is stated in the following lemma, Lemma 3.1 (First Stage Identification). Consider the set-up in equation (1), let E(v 2i z i ) = 0 such that E(y 2i z i ) = π o (z 1i, z 2i ), then the function π o (z 1i, z 2i ) is non-parametrically identified. Notice that the assumption in this lemma is much stronger than the usual projection argument when the control function approach is used in linear regression. When using a projection argument, one may always write y 2i = z i ρ + v 2i with E(z i v 2i) = 0 but z i ρ is not necessarily the conditional mean. Identification of the second stage parameters β o, γ o, and δ o requires more thought and depends on the functional forms of h(v 2i, z i ; γ o ) and g(v 2i, z i ; δ o ). Let s first consider identification in the most common setting in which h(v 2i, z i ; γ o ) = h(v 2i, z i )γ o and g(v 2i, z i ; δ o ) = g(v 2i, z i )δ o are linear in parameters. There are two major issues that need to be addressed for identification: multiplicative heteroskedasticity and conditional mean restriction. The next two subsections address both issues in the linear case. 3.1 Multiplicative Heteroskedasticity Carlson (Working Paper) addresses the issue of identification with exponential multiplicative heteroskedasticity and provides sufficient conditions for identification. From Theorem 1 of Carlson (Working Paper) the following would be sufficient for the identification of the parameters in the heteroskedastic probit model Assumption 3.1. In the set-up in equation (1) where h(v 2i, z i ; γ o ) = h(v 2i, z i )γ o and g(v 2i, z i ; δ o ) = g(v 2i, z i )δ o are linear in parameters, (i) E[(x i, h(v 2i, z i )) (x i, h(v 2i, z i )] is non-singular and x i includes a constant. (ii) E[g(v 2i, z i ) g(v 2i, z i ] is non-singular and does not include a constant. (iii) The joint support of (x i, h(v 2i, z i ), g(v 2i, z i )) has at least three points (iv) The parameter space of (β o, γ o ) does not allow for the coefficient on the constant to be 0 or the coefficients on the remainder of the terms to be all 0. Parts (i) and (ii) are fairly standard in the literature. Parts (iii) and (iv) are needed to insure there is no manipulation of the support or heteroskedastic transformation that does not allow for separate identification 5 As discussed in Chen (2007), this implies the true parameter π o is identified as the unique maximizer of Q(π) = E[(y 2i π(z i )) 2 ] 8

9 of the mean parameters β o and γ o and the heteroskedastic parameters δ o. Although part (i) does not appear to be restrictive, it does make assumptions on the relationship between x i and h(v 2i, z i ). The next section discusses this issue in more detail. 3.2 Conditional Mean Restriction The random variables that compose the elements in x i and h(v 2i, z i ) are the same since y 2i is a function of z i and v 2i, therefore it is quite likely that even if E[x i x i] and E([h(v 2i, z i ) h(v 2i, z i )] are non-singular, Assumption 3.1 (i) may not be satisfied. This section aims to develop a sufficient condition on the construction of the control function to insure identification. Since Assumption 3.1 (i) essentially requires that none of the elements in x i and h(v 2i, z i ) can be written as linear combinations of the elements, a sufficient assumption is the Conditional Mean Restriction (CMR) Assumption 3.2. Given the set up in Assumption 2.1 with the function h(v 2i, z i ; γ o ) = h(v 2i, z i )γ o is linear in its parameters, (i) E(x i x i) is non-singular. (ii) E(h(v 2i, z i ) h(v 2i, z i )) is non-singular. (iii) (CMR) E(h(v 2i, z i ) z i ) is a zero vector. The Conditional Mean Restriction is from Kim and Petrin (working paper) and the literature on nonparametric IV (Newey and Powell (2003), Hall et al. (2005), Blundell et al. (2007) and others). Kim and Petrin (working paper) uses a similar assumption to show identification in a non-parametric triangular system with an additively separable error. The CMR can be interpreted as a way to distinguish the endogeneity of y 2i and the exogeneity of z i. Previous papers have utilized the CI assumption (Newey et al. (1999), Blundell and Powell (2004), Rothe (2009)), equation (2), and as a result, the distribution of u 1i conditional on v 2i and z i can not be a function of z i. CI insures identification since v 2i (and any function of v 2i ) is linearly independent of x i by construction. After removing the CI assumption z i and u 1i are allowed to have a relationship but the CMR restricts that relationship so z i cannot be interpreted as an endogenous variable. By law of iterated expectations, E(u 1i z i ) = E(E(u 1i z i, v 2i ) z i ) = E(h(v 2i, z i )γ o z i ) = 0 9

10 The middle equality holds by the specification provided in Assumption 2.1 and the last equality holds by the CMR. As a result, the CMR merely requires that z i be mean independent of u 1i. To provide some intuition for the implications, the CMR requires the elements of h i to be random and to be demeaned. For instance, v 2 2i could not be an element of h i, but v 2 2i E(v2 2i z 1i, z 2i ) could be. In addition, no element can only be a function of (z 1i, z 2i ) they can only enter as an interaction with functions of v 2i. This prevents any issues of linear dependence between elements of x i and h(v 2i, z i ). To show, let ξ be a nonrandom vector such that ( xi h(v 2i, z i ) ) ( ) ξ 1 = 0 ξ 2 x i ξ 1 + h(v 2i, z i )ξ 2 = 0 Taking the conditional expectation with respect to z i, E(x i z i )ξ 1 + E(h i z i )ξ 2 = 0 E(x i z i )ξ 1 = 0 By standard exclusion and relevance conditions on the instrument z 2i and linear independence of x i, E(x i z i ) is also linearly independent. Therefore ξ 1 is a zero vector and it follows that ξ 2 is a zero vector. Therefore Assumption 3.2 is sufficient for Assumption 3.1 (i). The next subsection extends this discussion to the nonlinear case. Since the specifications for the control function and heteroskedastic function are left to be general, identification cannot be shown without knowing the functional forms. Consequently, this section shows how one would go about showing identification given a known nonlinear functional form. 3.3 Conditional Mean and Variance Functions are Nonlinear in Parameters If h(v 2i, z i ; γ o ) and g(v 2i, z i ; δ o ) are nonlinear in the parameters showing identification will follow the works of Rothenberg (1971) and Komunjer (2012). Using Theorem 1 of Rothenberg (1971), Local identification requires verification that the information matrix is full rank along with some other regulatory conditions. 6 6 The standard regulatory assumptions are that the parameter space is open, the support of the random variables do not depend on the values of the parameter, the functions h(v 2i, z i ; γ o) and g(v 2i, z i ; δ o) are continuously differentiable in γ o and δ o 10

11 Let θ o = (β o, γ o, δ o) then let S i (θ o, π o ) denote the score of the log-likelihood for a probit model derived using the conditional mean given in equation (3). Then the information matrix can be written as, I(θ o ) =E[S i (θ o, π o )S i (θ o, π o ) ] [ x i x i x i h γ (x i + h(v 2i, z i ; γ o ))x i g ] δ =E ω(x i, z i ; θ o, π o ) h γx i h γh γ (x i + h(v 2i, z i ; γ o ))h γg δ (x i + h(v 2i, z i ; γ o ))g δ x i (x i + h(v 2i, z i ; γ o ))g δ h γ (x i + h(v 2i, z i ; γ o )) 2 g δ g δ (4) Where h γ and g δ are the row vector partial derivatives of the functions h(v 2i, z i ; γ) and g(v 2i, z i ; δ) with respect to γ and δ (respectively) and evaluated at γ o and δ o (respectively), and ω(x i, z i ; θ o, π o ) is a non-zero scalar function of it s arguments. 7 Verification of full rank will reduce to verification of linear independence of the random row vectors (x i, h γ ) and g δ where an analogue of the CMR will be used, as well as no elements of g δ being a constant term. This can be extended to global identification following Theorem 2 of Komunjer (2012). However, this does require that the determinant of the information matrix be non-positive for all values of the parameters in the parameter space and that the expectation of the score be a proper function of the parameters θ. This is much more difficult to show or provide intuition without providing the functional forms of h(v 2i, z i ; γ o ) and g(v 2i, z i ; δ o ). 4 Estimation The estimator follows a two step procedure as an application of a semi-parametric two-step estimator. First, estimate the conditional mean function E(y 2i z i ) = π o (z 1i, z 2i ) non-parametrically. This could be done via the sieve method where Π denotes a space of functions that includes π o ( ) with a metric induced by the norm π. Then let {p lln (z i ), l = 1, 2,..., L n } be a sequence of basis functions of z i. Letting Z be the support of z i then the sieve spaces are defined as: Π Ln = {π : π = l<l n p lln (z)λ l, λ l R dim(plln (z)), z Z} (5) respectively, and θ o is a regular point of the information matrix I(θ) the where θ o = (β o, γ o, δ o ). 7 For it to be non-zero, β o is required to be non-zero as well as sufficient support of x i so that x i β o is non-zero. This assumption is paralleled in the linear case where (β o, γ o) is required to be non-zero to identify the heteroskedastic component. 11

12 where L n, (L n /n 0) so that Π L Π L+1... Π. Consider a non-parametric multivariate least squares regression where the population criterion function is Q(π) = E[(y 2i π(z i )) 2 ] (6) Then the series estimator would maximize the sample criterion function such that ˆπ = arg min π Π L 1 ˆQn (π) arg min π Π L n n (y 2i π(z i )) 2 (7) In practice this is as simple as OLS estimation where if P Ln (z i ) = (p 1Ln (z i ), p 2Ln (z i ),..., p LnL n (z i )), then estimate of the conditional mean is i=1 ( n ) 1 n ˆπ(z i ) = P Ln (z i ) P Ln (z k ) P Ln (z k ) P Ln (z j ) y 2j k=1 j=1 = P Ln (z i )ˆλ where the residuals would be constructed as ˆv 2i = y 2i ˆπ(z i ) and used in the second stage. In determining a sieve space, for a moment suppose z i is a scalar, then if the support of z i is a the unit interval [0, 1], one could consider a simple linear sieve such as polynomials. 8 Then the polynomial sieve space is, Π P oly L n = {π : π = l<l n λ l z l, λ l R, z [0, 1]} (8) This can be extended to a multi-dimensional space using tensor products. One could also consider splines or orthogonal wavelets in place of the polynomial sieve if there is bounded support. However, if z i has unbounded support, it would be more appropriate to use a Hermite polynomial or Laguerre polynomial sieve space. In Section 4 where asymptotics are derived, it will be assumed that a polynomial sieve space is utilized. Since only the convergence rate of the first stage estimator comes into play when deriving n-asymptotic normality for the second stage parameters, any non-parametric first stage estimator can be consider in place of the series estimator such as a kernel based method. Then in the second step, one would maximize the following likelihood 8 Of course, any bounded random variable with known bounds can be transformed into a random variable with support [0, 1] 12

13 L(y 1i, x i, z i ; θ, ˆπ) = 1 n n i=1 [ y 1i log Φ ( )] [ xi β + h(ˆv 2i, z i ; γ) + (1 y 1i ) log 1 Φ exp(g(ˆv 2i, z i ; δ)) ( )] xi β + h(ˆv 2i, z i ; γ) exp(g(ˆv 2i, z i ; δ)) with respect to β, γ and δ to obtain estimates of the parameters. This can be as simple as running a heteroskedastic probit, a standard preprogrammed command in many statistical packages. However, standard errors need to be adjusted to account for the variation from using the first stage estimates in which a simple solution would be to bootstrap the standard errors. In many instances, the parameters are not of much interest, but rather the average structural function and the partial effects. These next two sections will provide estimators for the ASF and APE and compare them with other estimators standard in the literature. (9) 4.1 Average Structural Function Consider a general, possibly nonlinear, model y = m(x, u), then from BP, the ASF for y is E u [m(x o, u)] evaluated as the non-random vector x o and averaged over the unobserved heterogeneity u. The ASF for the model considered can be derived using law of iterated expectations, [ ( x ASF (x o o )] β o + h(v 2i, z i ; γ o ) ) = E Φ exp(g(v 2i, z i ; δ o )) (10) which can be estimated by approximating the expectation (with respect to v 2i and z i ) with a sample average and plugging in the parameter estimates for β o, γ o, and δ o and the first stage residuals for v 2i. Now I would like to compare this to the ASF s derived in BP and Rothe. Recall that they assume CI so that equation (2) holds. By doing so they have no distributional assumptions whereas this paper imposes a normality assumption. Following the same set up as in equation (1), let G( ; v 2i ) be the unknown CDF of u 1i v 2i. Because of the CI assumption, u 1i is independent of z i conditional on v 2i. Therefore G( ; v 2i ) is the conditional CDF of u v 2i, z i. Fixing x i = x o, averaging out the unobserved heterogeneity u 1i, and by law of iterated expectations, ASF (x o ) = E(G(x o β; v 2i )) (11) where the expectation is with respect to v 2i. Comparing equations (10) and (11), highlights the differences 13

14 in the two approaches. The proposed approach requires the normality assumption but by doing so, is able to specify more flexible functions for the conditional mean and variance of u 1i v 2i, z i. However if the normality assumption fails (which may be very likely), then this method is no longer valid. A small consolation derived by Ruud (1983), if E(x i x i β) is linear in x i β (which is true for x i distributed multivariate normal) then MLE assuming normality will produce consistent estimates of a scaled β o. Consequently APE estimates would still be consistent. The BP approach relaxes the distributional assumption and in doing so, requires that the unobserved heterogeneity u 1i is conditionally independent of the instruments z i. This generally rules out z i entering the control function and any heteroskedasticity in z i. But BP argues that their estimation method could technically allow for heteroskedasticity in terms of the linear index x i β where the CI assumption is relaxed to u i v 2i, z i u i v 2i, x i β (12) (equivalent to equation 2.2b in Rothe). This is because the function G(x i β o ; v 2i ) is merely estimated as an approximation over the two arguments and therefore could represent any function of x i β o and v 2i which is not restricted to those that impose CI. It is important to note that this is still a fairly strong restriction on the conditional distribution of u i v 2i, z i. For a moment, imagine the best case scenario in which CI does not hold but the linear index condition (equation (12)) does hold. Therefore the estimators in BP and Rothe (2009) should consistently estimate the function G(x i β o ; v 2i ) which is the conditional distribution of u 1i given v 2i and z i. Now to re-examine the ASF in equation (11), and what is being averaged out. Since it is only averaging out the second argument, v 2i, it would not be averaging out any the effect of the heterogeneity due to the linear index (ie: the parts of the conditional CDF, P ( u 1i < c v 2i, x i β o ) that include the linear index, but not at the point of evaluation). Therefore, even if their estimation procedure could handle limited forms of heteroskedasticity and flexible control function, I would argue that their estimation of the Average Structural Function is no longer capturing what they stress is important. 9 This will be reflected in the simulation study presented in 9 Rothe (2009) considers this case in their simulation study with Design III where there is heteroskedasticity in the latent variable error in terms of the linear index Xβ, but they only report results on coefficient estimates and not ASF or APE 14

15 section Average Partial Effects The APE captures the causal effect that a regressors has on the outcome variable averaged over the distribution of the explanatory variables. At first impression, it would seem appropriate, as many statistical packages do, to calculate average partial effects from the conditional mean, ( ) E(y x) AP E = E x (13) This definition is correct in the simple cases such as linear regression and a probit model without endogeneity or heteroskedasticity. In the more complicated nonlinear models like a binary response model where the distribution of the unobserved heterogeneity depends on the explanatory variables, for instance through its mean (endogeneity) or through its variance (heteroskedasticity), then the above definition for the APE is incorrect. As clearly put in Lin and Wooldridge (2015), if E(Y X) were the object of interest then decades of published research on accounting for [endogenous explanatory variables] in econometric models would be irrelevant. They go on to argue that an APE derived from the partial effect on the ASF is intuitive, can be derived via counter-factual reasoning, and has the desirable property that it corresponds to the parameter of interests in linear models with endogeneity. As a result, this paper derives the APE from the ASF. First consider the partial derivative of the average structural function with respect to the jth element of x o assuming regularity conditions that would allow passing the derivative through the integral. [ ( x o )] β o + h(v 2i, z i ; γ o ) / E Φ x o exp(g(v 2i, z i ; δ o )) [ ( x o β o + h(v 2i, z i ; γ o ) j = E Φ exp(g(v 2i, z i ; δ o )) [ ( x o ) β o + h(v 2i, z i ; γ o ) = E φ exp(g(v 2i, z i ; δ o )) ) / ] x o j β j,o exp(g(v 2i, z i ; δ o )) Then the APE averages the partial derivative over the distribution of x. Using sample averages and consistent estimates of the parameters, the APE s can be estimated with: ] AP E j = 1 n n k=1 1 n [ ( ) ] n x k ˆβ + h(ˆv2i, z i ; ˆγ) ˆβ j φ exp(g(ˆv 2i, z i ; ˆδ)) exp(g(ˆv 2i, z i ; ˆδ)) i=1 (14) estimates. 15

16 Notice that the inner sum is taken with respect to the explanatory variables in the control function and heteroskedasticity and the outer sum is averaging with respect to the explanatory variables in the structural function. 5 Asymptotics This section presents the standard asymptotic results for the proposed estimator. Useful equations and notations are defined in Appendix A. Proofs of theorems and lemmas that are not direct restatements of previous results in the literature are given in Appendix C. In general, since this is a two-step semi-parametric estimator with a conditional mean being non-parametrically estimated in the first step, consistency and n- asymptotic normality follow from Newey (1994). This does require that the score of the log-likelihood be continuous in the second stage parameters over the entire parameter space. Although it is beyond the scope of this paper, this assumption can be relaxed following Theorem 1 of Chen et al. (2003). Moreover, it is assumed that the first stage is estimated using a polynomial sieve. This could be extended to other sieve spaces or other non-parametric estimation technique as long as their convergence rates are known. 5.1 Consistency For now, we will assume consistency of the first stage estimates. Later when deriving asymptotic normality we will be deriving the convergence rate of the first stage estimates and therefore deriving consistency will be redundant. Consistency of the second stage estimates follows directly from Lemma 5.2 of Newey (1994). Let θ = (β, γ, δ ) have support Θ and W i = (y 1i, y 2i, z i ) have support W where the maximum likelihood estimation can be formated to fit a generalized method of moments framework. Recall the sample loglikelihood function in equation (9) where the first stage parameter π will enter the log likelihood through v 2i and let S i (θ, π) denote the score of the likelihood (explicitly written in Appendix A). Note that the MLE parameter estimates also solve the sample minimization problem: min θ Θ 1 n n S i (θ, ˆπ) 2 (15) i=1 16

17 where the norm,, is defined as A = (tr(a A) 1/2 ) for any matrix A. This follows the framework of Newey (1994) where the weighting matrix is the identity matrix. Moreover, let d denote the Sobolev norm define as, f(x) d = sup λ d x X sup λ f(x) The following two assumptions correspond to Assumptions 5.4 and 5.5 of Newey (1994) and are needed to apply Lemma 5.2. Assumption 5.1 (5.4 of Newey (1994)). There are ɛ, b(w i ), b(w i ) > 0 such that, (i) for all θ Θ, S i (θ, π o ) is continuous in θ with probability 1 and S i (θ, π o ) b(w i ); (ii) S i (θ, π) S i (θ, π o ) b(w i )( π π o 0 ) ɛ This assumption provides sufficient conditions for uniform convergence or sup θ Θ 1/n n i=1 S i(θ, ˆπ) E(S i (θ, π o )) P 0. Part (i) should be easily verifiable as long as h(v 2i, z i ; γ) and g(v 2i, z i ; δ) are continuous. Note that part (ii) is a smoothness condition on S i (θ, π) in π and to derive any lower level assumptions, one would need to know more specifically how π (or equivalently v 2i ) enters the control function and the heteroskedastic function. The following is a standard assumption for identification. Assumption 5.2 (5.5 of Newey (1994)). (i) E(S i (π, θ)) has a unique solution on Θ at θ o (ii) Θ is compact. Section 3 discussed sufficient (lower level) assumptions for part (i) (global identification). The following theorem gives consistency of the second stage estimators. Theorem 5.1 (Consistency of θ). Suppose that {W i } n i=1 are i.i.d, ˆπ π o 0 = o p (1), and that Assumptions hold. Then, ˆθ θ o = o p (1) The proof is a direct application of Lemma 5.2 of Newey (1994). Assumptions correspond to Assumptions 5.4 and 5.5 of Newey (1994). 17

18 5.2 n-asymptotic Normality This section derives the asymptotic normality of the second stage estimates using Lemma 5.3 and Theorem 6.1 of Newey (1994). Lemma 5.3 is the general n-asymptotic normality result for two step semi-parametric estimators while Theorem 6.1 applies Lemma 5.3 to the case where the first stage is a least squares projection using a series estimation method. Since the estimator is based on a general model where the functions h(v 2i, z i ; γ), and g(v 2i, z i ; δ) are not specified, some of the assumptions are rather conditions that should be verified given particular specifications of the functions. Denote the derivative of the score with respect to theta as θ S i (θ, π) and define Γ E[ θ S i (θ o, π o )] (both explicitly defined in Appendix A). The following set of assumptions are (or sufficient for) Assumptions (5.6), (6.1)-(6.7) of Newey (1994) that are needed to apply Theorem 6.1 of Newey (1994) to obtain n- asymptotic normality. The following correspond with Assumption (5.6) of Newey (1994), Assumption 5.3 (5.6 of Newey (1994)). (i) θ int(θ); (ii) there is an ɛ > 0, a d 0, and a neighborhood N of θ o such that for all π π o d < ɛ, S i (θ, π) is differentiable in θ on N ; (iii) Γ is non-singular; (iv) E[ S i (θ o, π o ) 2 ] < (v) Assumption 5.1 is satisfied with S i (θ, π) equal to a column of θ S i (θ o, π o ). This assumption is sufficient for uniform convergence of the Jacobian terms. Generally part (i) is assumed whiles parts (ii), (iv), and (v) can be verified given specifications of h(v 2i, z i ; γ) and g(v 2i, z i ; δ). Recall that in the discussion on identification, part (iii) of this assumption is needed for the second stage parameters to be locally identified. The next set of assumptions are used to derive the convergence rates of the first stage estimator under the Sobolev norm. Assumption 5.4 (6.1 and 6.2(i) of Newey (1994)). (i) E[(y 2i π o (z i )) 2 z i ] < (ii) The smallest eigen value of E[P Ln (z i ) P Ln (z i )] is bounded away 0. The first part is standard in the literature and difficult to relax without affecting the convergence rates. The second part holds when there is no perfect multicollinearity in the sequence of polynomial basis functions 18

19 of z i. The next assumption provides the remaining necessary conditions in deriving the convergence rates for power series. Assumption 5.5 (Assumptions 8 and 9 of Newey (1997)). (i) The support of z i is a Cartesian product of compact connected intervals on which x has a probability density function that is bounded away from zero; (ii) π o (z i ) = E(y 2i z i ) is continuously differentiable of order s on the support of z i. The following is a restatement of Theorem 4 from Newey (1997) and provides the convergence rates results. Lemma 5.1 (Theorem 4 of Newey (1997)). Suppose {(y 2i, z i )} n i=1 satisfied and L 3 n/n 0 then is i.i.d. If Assumptions are where K 1 + K 2 is the dimension of z i. ˆπ π o 0 = O(L n [ L n / n + L s/(k1+k2) n ]) (16) L 3 n/n 0 is needed to place limits on the growth of the series terms. The remainder of the assumptions needed to show asymptotic normality places conditions on the linearized function D(W i, π; θ, π). This function is linear in π such that on could write D(W i, π; θ, π) = D(W i ; θ, π) π and in this context it will be the path-wise derivative of the score with respect to the function π( ) (explicitly written in Appendix A), The next assumption states conditions that linearization D(W i, π; θ, π) needs to satisfy in order to be considered a good approximation of S i (θ o, ˆπ) S i (θ o, π o ), primitive assumptions for stochastic equicontinuity for the linear function D(W i, π; θ, π), and lower level conditions for mean square continuity that allows the first stage estimator to be n-consistent. Assumption 5.6 ( of Newey (1994)). (i) There are ɛ, b(w i ) > 0 such that for all θ θ o < ɛ and π π o 0 < ɛ, S i (θ, π) S i (θ, π) D(W i, π π; θ, π) < b(w i )( π π o 0 ) 2 such that E[b(W i )] < ; (ii) There is a b(w i ) > 0 such that E[ b(w i ) 2 ] < and D(W i, π; θ o, π o ) < b(w i ) π 0 (iii) Let d(z i ) be defined as, [ ( d(z i ) =E E [l xx y 1i, x ) ] ( ) iβ o + h(γ o ) hv (γ o ) + g v (δ o )(x i β o + h(γ o )) y2i, z i exp(g(δ o )) exp(2g(δ o )) 19

20 x i h γ (γ o ) (x i β o + h(γ o )) g δ (δ o ) zi ] is defined in equation (27) and let d(z i ) be continuously differ- ( ) ] where E [l xx y 1i, xiβo+h(γo) y2i exp(g(δ o)), z i entiable of order s on the support of z i such that nln (s+ s)/(k1+k2) (17) 0 and L 2 s/(k1+k2) n 0; This assumption should be verifiable given functional forms for h(v 2 i, z i ; γ) and g(v 2i, z i ; δ). Part (i) corresponds to 6.4(i) of Newey (1994) and to show (6.4)(ii) requires sup P Ln (z) [(L n /n) 1/2 + Ln s/(k1+k2) ] 0 (18) z Z n[sup P Ln (z) ] 2 [L n /n + Ln 2s/(K1+K2) ] 0 (19) z Z Part (ii) corresponds to the first part of 6.5 of Newey (1994) whereas the second part requires Ln 2s/(K1+K2) 0 (20) ) 1/2 p lln (z i ) 2 0 [(L n /n) 1/2 + L s/(k1+k2) n ] 0 (21) ( Ln l=1 Equations (18)-(21) will hold under rate and smoothness conditions specified in the Theorem. Part (iii) is sufficient for 6.6(ii) of Newey (1994) in which 6.6(i) assumes the existence of the correction term α(w i ). Since the first stage is estimating a conditional mean, by proposition 4, Newey (1994) shows that the correction term will be of the form: α(w i ) = d(z i )[y 2i π o (z i )] where d(z i ) satisfies E[D(W i, π; θ o, π o )] = E[d(z i )π] (22) The existence of such a function d(z i ) follows from the Reisz Representation Theorem. The proposed d(z i ) from Assumption 5.6 is shown to satisfy equation (22) in Appendix A. Last, we need to define the asymptotic variance, V Avar( n(ˆθ θ o )) = Γ 1 + Γ 1 E[d(z i )[y 2i π o (z i )] 2 d(z i ) ]Γ 1 (23) that follows from equation (5.2) of Newey (1994). Finally, the general n-asymptotic normality result is 20

21 stated below, Theorem 5.2 (Asymptotic Normality of θ). Suppose that θ o Θ satisfies E[S i (θ o, π o )] = 0 (or that the specification in Assumption 2.1 holds) and {W i } n i=1 is i.i.d., then under Assumptions and the following growth rate and smoothness conditions (i) L 6 n/n 0 (ii) s/(k 1 + K 2 ) > 5/2 for V defined in equation (23) n(ˆθ θ o ) N(0, V ). Growth rate conditions (i) and (ii) are the growth rate conditions on L in which equations (18)-(19) and (20)-(21) can be shown to hold. First, since the estimation procedure is using a polynomial sieve, by Lemma A.15 of Newey (1995), sup z Z P Ln (z) < CL n for some constant C and p lln (z i ) 0 < L 1/2 n. Then equations (18)-(21) become, (L n /n) 3/2 + L 1 s/(k1+k2) n 0 (18 ) n[l 3 n /n + L 2 2s/(K1+K2) n ] 0 (19 ) L 2s/(K1+K2) n 0 (20 ) (L n /n) 3/2 + L 1 s/(k1+k2) n 0 (21 ) which easily hold given the rates and smoothness conditions (i) and (ii) of Theorem 5.2. An estimator for the asymptotic variance is provided in Appendix B along with regulatory conditions needed for the estimator to be consistent. Because of the numerical equivalence results of Newey (1994), Chen (2007), and Ackerberg et al. (2012), the consistent estimator for the asymptotic variance is equivalent to estimator for the asymptotic variance if the first stage parameters were finite dimensional. As a result, implementing the calculations for correct standard errors is fairly simple. 6 Simulation I will consider two different simulation designs in order to highlight the impact of the CI assumption. The first design specifies the conditional distribution of u 1i to be normal but CI does not hold in the conditional mean (control function) or conditional variance (heteroskedasticity). The second design examines in more detail the ability of the BP and Rothe estimators to handle relaxing the CI assumption in terms of the linear 21

22 index x i β o. In both designs, I will compare the proposed estimator to a sieve variation on the BP and Rothe estimators as well as standard parametric approaches probit, probit with a control variable, and linear probability models such as OLS and 2SLS. 6.1 Conditional Independence does not hold To give this simulation some context, consider the BP application of British male (without college education) labor force participation during 1985 to Consider the latent variable setting as described in equation (1) where x i = (z 1i, y 2i ). In this application, y 1i is labor force participation, z 1i is education level of the husband (and possibly other factors that can influence the market wage level), and y 2i is the log of other (non-wage) income. The usual argument for endogeneity is that other non-wage income is contemporaneously determined with labor force participation. They consider two instruments: the first, z 21i is the potential welfare benefits entitlement of the family if neither spouse was working 10 and the second, z 22i is the wife s education level. The education variables, z 1i and z 22i, are indicators for whether or not the student stayed on after the minimum school leaving age of 16. As per usual, the argument is potential welfare benefits entitlement and wife s education level should have no direct impact on the husband s labor force participation except through the non-wage income. Why should CI not hold in this set up? Consider the scenario where individual i has a particularly large shock to his outside income (inheritance, lottery, ect.). I would argue that the probability that he is in the labor force depends on his education level. For instance, someone with higher education is more likely pursuing a passion and therefore a positive shock to income would likely not dissuade them from working. In addition, one would expect more variability among lower educated individuals than higher educated. As a result, I will construct the conditional distribution of u 1i conditional on z i and v 2i as, ( ) u zi 1i, v 2i N v 2i γ 1 + (v2i 2 σv)γ z 1i v 2i γ 3, exp(2 (z 1i δ 1 + z 22i δ 2 )) (24) such that CI does not hold. Furthermore, the conditional distribution cannot be written as a function of v 2i and the linear index x i β, therefore we would expect the BP and Rothe estimator to fail in both the ASF 10 In the application of BP, this variable is constructed from local welfare benefit rules, the demographic structure of the family, the geographic location, and housing costs. 22

23 and APE estimates as well as the coefficient estimates. Table 1: Comparison of Summary Statistics Blundell and Powell Simulated Data Variable Mean Std dev. Mean Std dev. Work (y 1 ) Education > 16 (z 1 ) ln(other income) (y 2 ) ln(benefit income) (z 2 1) Education (spouse) (z 2 2) ,000 simulations of a sample size of 1,606 The construction of the random variables z 1i, z 21i, z 22i, and y 2i is discussed in Appendix D. Generating 1,000 simulations of a sample size of 1,606, the Table 1 presents the summary statistics from BP as well as the summary statistics from the simulated samples. The parameter values in the construction of the random variables z 1i, z 21i, z 22i, and y 2i are calibrated to so the summary statistics of the simulated data match those from BP. Table 2 presents the results for parametric specifications (corresponding to Table 4.2 of BP). Table 2: Comparison of Parametric Estimators Blundell and Powell Simulated Data Variable Reduced Form Probit Probit (CF) Reduced Form Probit Probit (CF) (1) (2) (3) (4) (5) (6) Education z (0.0224) (0.1474) (0.1677) (0.0005) (0.0253) (0.0257) ln(other inc) y (0.1299) (0.6624) (0.0054) (0.0410) ln(benefit inc) z (0.0093) (0.0005) Education(sp) z (0.0219) (0.0004) R Standard errors (for Blundell and Powell) and standard deviations (for Simulated Data) are given in parenthesis. Omitted is the estimates of the intercept and the coefficient on the control function (ˆv 2i ) in columns (3) and (6). The simulated data is not an exact replica of the BP data, but it still provides the main takeaway in which the impact of adjusting for endogeneity is quite dramatic. This is reflected in the BP data in the comparisons of the coefficient estimates in columns (2) vs (3). In the simulated data we see a similar effect 23

24 when comparing columns (5) and (6). Table 3: APE Results and Simulated Distribution (True APE = ) Specification Mean SD 10% 25% 50% 75% 90% Het-2SCML Het-2SCML (true first stage) BP and Rothe (Sieve) Probit Probit (CF) Lin. Prob. (OLS) Lin. Prob. (2SLS) Het-2SCML is the proposed estimator, Het-2SCML (true first stage) is the proposed estimator using the true values of the control variable (v 2i ) instead of the reduce form residuals, and BP and Rothe is the sieve analogue of the BP and Rothe estimators. It is estimated using polynomials of (x i β, ˆv 2i ) up to order 3. Table 3 provide the APE estimates using different estimation specifications on the simulated data. The first column provides the mean of the APE estimates over the simulations, the second column is the standard deviation, and the last five columns are the quantiles of the empirical distribution of the simulated APE estimates. The first step (when applicable) uses a polynomial sieve up to order three. The first three rows are semi-parametric estimators, the second two are parametric probit models, and the last two are linear probability models. The most prominent result is that addressing endogeneity in any of the specifications makes the largest impact (compare probit vs probit (cf) and lin. prob. (OLS) vs lin. prob. (2SLS)). In addition the proposed Het-2SCML performs better than all the other specifications. This is unsurprising since CI does not hold so the BP and Rothe estimator should not perform well. To give the difference some context: the Het-2SCML would estimate that for every 10% increase in other income, the probability of being employed decreases by while BP and Rothe would estimate a decrease of There is also a much larger spread in the simulated distribution of the BP and Rothe estimated APE in which the 90th quantile is positive. As a result, the BP and Rothe estimated APE would not be able to reject the null that the APE is 0 under standard confidence levels. In addition, we see that the probit estimator with the control function performs quite well even though there is heteroskedasticity and a more complex control function than what was included. It is concerning that even though the proposed estimator performed the best, there seems to be an upward 24

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. Linear-in-Parameters Models: IV versus Control Functions 2. Correlated