Identi cation and Estimation of Semiparametric Two Step Models

Size: px

Start display at page:

Download "Identi cation and Estimation of Semiparametric Two Step Models"

Sharon Harrison
6 years ago
Views:

1 Identi cation and Estimation of Semiparametric Two Step Models Juan Carlos Escanciano y Indiana University David Jacho-Chávez z Indiana University Arthur Lewbel x Boston College original May 2010, revised April 2011 Abstract Let g 0 (X) be a function of some observable vector X that is identi ed and can be nonparametrically estimated. This paper provides new results on the identi cation and estimation of the function F and the vector 0 when E(Y jx) = F [X > 0 ; g 0 (X)]. Many models t this framework, including latent index models with an endogenous regressor, and nonlinear models with sample selection. Our results show that identi cation based on functional form, without exclusions or instruments, extends to this semiparametric model. The asymptotic properties of the proposed estimator are obtained allowing for data dependent bandwidths and random trimming. We include Monte Carlo simulations and an empirical application. Keywords: Double index models; two step estimators; semiparametric regression; control function estimators; sample selection models; empirical process theory; limited dependent variables; migration. JEL classi cation: C13; C14; C21; D24 We especially want to thank Yingying Dong, for providing data and other assistance on our empirical application, and Je rey Racine for his advice on how to use the np package in our reported estimation results. We would also like to thank Hidehiko Ichimura, Ingrid Van Keilegom, Adonis Yatchew, Simon Lee, and participants of the International Symposium on Econometric Theory and Applications (SETA 2009) for many helpful comments. We acknowledge the usage of the Big Red high performance cluster at Indiana University where all computations were performed. All errors are our own. y Department of Economics, Indiana University, 105 Wylie Hall, 100 South Woodlawn Avenue, Bloomington, IN , USA. Phone: +1 (812) Fax: +1 (812) jescanci@indiana.edu. Web Page: jescanci/ Research funded by the Spanish Plan Nacional de I+D+I, reference number SEJ z Department of Economics, Indiana University, 105 Wylie Hall, 100 South Woodlawn Avenue, Bloomington, IN , USA. Phone: +1 (812) Fax: +1 (812) djachoch@indiana.edu. Web Page: djachoch/ x Corresponding Author: Department of Economics, Boston College, 140 Commonwealth Avenue, Chesnut Hill, MA 02467, USA. lewbel@bc.edu. Web Page: lewbel/ 1

2 1 Introduction We provide new identi cation and estimation results for two step estimators with a nonparametric rst step. Given an observable vector X, suppose g 0 (X) is some identi ed function that can be nonparametrically estimated, e.g. g 0 (X) could be a conditional mean, quantile, distribution, or density function. This paper then considers identi cation of the function F and the vector 0 where E(Y jx) = M(X) = F [X > 0 ; g 0 (X)]. (1.1) for some observed outcome Y. Equation (1.1) is a double index model, with one linear index X > 0 and one general index function g 0 (X). If g 0 (X) represents a conditional mean, the type of estimators we consider begin with a uniformly consistent rst step estimator bg of g 0, e.g. the Nadaraya-Watson estimator. Let F b be a kernel estimator of the conditional mean function of Y given X > and bg. In a second step, we construct estimators b and F b by minimizing a weighted sum of squares of Y F b. For example, in the standard Heckman (1979) selection model, F [X > 0 ; g 0 (X)] = X > 0 + g 0 (X) 0, where g 0 (X) is the inverse mills ratio of a rst step selection regression, and estimation of and F corresponds to a linear (possibly weighted) least squares regression for the coe cients and. As described below, many common econometric models involving either selection or an endogenous regressor t this framework of two step estimation. For estimation, we use a new uniform-in-bandwidth expansion for sample means of weighted semiparametric residuals recently obtained in Escanciano, Jacho-Chávez and Lewbel (2011), to establish the asymptotic normality of the proposed two-step semiparametric least squares estimator with datadriven bandwidths, random trimming and estimated weights. Unlike other well-known approaches such as Newey and McFadden (1994, p. 2197), Chen, Linton, and van Keilegom (2003) and Ichimura and Lee (2010), results in Escanciano, Jacho-Chávez and Lewbel (2011) avoid functional derivatives, which are specially di cult to deal with in the current setting, see e.g. Rothe (2009). For identi cation, we show that standard assumptions such as exclusion restrictions (e.g., knowing that some elements of 0 are zero, corresponding to having instruments for g) are stronger than necessary. Essentially, we show that what is usually considered to be identi cation based on functional form can in fact apply to a semiparametric setting like ours where nothing more than linearity of X > 0 is parameterized. We apply these identi cation and estimation results in an empirical application. We estimate a migration decision model, which is a control function speci cation of a binary choice model with an endogenous regressor. Our new identi cation result is required because exclusions are not economically plausible in this application. Our new limiting distribution theory for this application accounts for data dependent bandwidth choice and random trimming. The rest of the paper is organized as follows: Section 2 provides some background information, including examples of models that t our framework. Section 3 gives our general identi cation theorems and Section 4.2 provides the asymptotic theory for our estimator. A Monte Carlo experiment and an empirical application to migration data is presented in Section 5. Section 6 concludes. All proofs are gathered into the Appendix. 2

3 2 Background Models of the form M(X) = F [r 0 (X); g 0 (X)] where M(X) can be estimated are called double index models. Examples of estimators of double or multiple index models include Ichimura and Lee (1991), Pinkse (2001), and Lewbel and Linton (2007). Additional models where both r 0 (X) and g 0 (X) are linear indices include the Sliced Inverse Regression models of Li (1991) and arti cial neural networks. This paper focuses on the case where just one of the two indices is linear, and hence parameterized as X > 0. This is an appropriate assumption in contexts where X > 0 arises as part of a structural model, while g 0 (X) is a nuisance function associated with selection or endogeneity of a regressor. In the semiparametric literature, when one or more of the functions g 0, r 0, F are not parameterized, identi cation is generally obtained by exclusion restrictions, that is, some element of X is assumed to drop out of either r 0 (X) or g 0 (X). In models where all of the unknown functions are parameterized, exclusion restrictions are generally not required for identi cation, and we may instead obtain identi- cation by functional form. For example, the Heckman (1979) selection model does not require an exclusion restriction for identi cation, since identi cation is obtained by parameterization of the joint error distribution in the selection and outcome equations. Empirically, identi cation by exclusion restrictions is generally considered more reliable than identi cation based on functional form, but it is often the case that exclusions are hard to nd or to plausibly impose. Identi cation by functional form also provides a way to test exclusion restrictions, since it nests models with exclusions. This paper shows that identi cation by functional form extends to the semiparametric model (1.1). In particular, it is shown that 0 and F can be identi ed without exclusion restrictions under some relatively mild regularity conditions (essentially, nonlinearity in g 0 and some inequalities su ce). So for example in Heckman s model identi cation by functional form does not actually require a parameterized functional form for the error distributions or for the selection index g 0 (X). One large class of models that t in this paper s framework are endogenous regressor models. Suppose Y = H(Z > 0 + W 0 ; e) for some possibly unknown function H, where W is an endogenous regressor with W = s 0 (Z) + u, and e and u are unobserved correlated error terms. Let E(ujZ) = 0 and assume the endogeneity takes the control function form of e = h 0 (u; v) for some function h 0, where v is an unobserved error that is independent of W and Z. Another way to describe the endogeneity is to say ejz; u = eju. De ne X > 0 := Z > 0 + W 0, and g 0 (X) := W s 0 (Z). Then equation (1.1) holds with F de ned by E(Y jx) = E[H(X > 0 ; h 0 (g 0 (X) ; v))jx] =: F X > 0 ; g 0 (X). An example of this type of endogenous regressor model is Rivers and Vuong (1988). More generally, for Y binary this is Blundell and Powell s (2004) semiparametric binary choice model with an endogenous regressor, so this paper s identi cation results show that Blundell and Powell s control function model is generally identi ed (and it can be estimated using their estimator or Rothe s 2009 estimator), without the exclusion restrictions they impose for identi cation and estimation. Our empirical application to estimation of a migration equation is an example of this model. Another large class of models that t this paper s framework are limited dependent variable models with selection. Suppose Y = H(X > 0 + e) for some function H, e.g., H could be the identity function so Y = X > 0 + e is a linear model, or Y could be a binomial response with Y = I(X > 0 + e > 0) 3

4 for I being the indicator function that equals one if its argument is true and zero otherwise, or H could be a censored regression such as Y = max(0; X > 0 + e). Suppose in addition to this possibly limited dependent variable we also have nonrandom selection, so we do not observe Y but instead observe (Y; D; X > ) where Y = Y D = H(X > 0 +e)d, implying that Y is only observed when D = 1. Suppose D = I[s 0 (X) + u > 0] where the conditional distribution u given e is continuous. Nonrandom selection arises because the unobserved errors e and u, though independent of X, are correlated with each other. It then follows that g 0 (X) = E(DjX) = F u [s 0 (X)] where F u is marginal distribution function of u, and equation (1.1) holds with F de ned by E(Y jx) = E[H X > 0 + e I(F 1 u [g 0 (X)] + u > 0)jX] =: F X > 0 ; g 0 (X). This includes a wide class of selection models, including standard Heckman type selection models, extensions of tobit models like double hurdle models, and censored binary choice models, among others. 3 Identi cation Here we provide su cient conditions for identifying the function F and the parameter vector 0 in the semiparametric double index model M(X) = F X > 0 ; g 0 (X), where M (X) and g 0 (X) are assumed to be identi ed. Our estimators are based on M (X) = E (Y jx) and g 0 (X) also representing another conditional expectation, but this is not necessary for identi cation. For example, identi cation would be the same if M (X) were an identi ed density, distribution, or quantile function rather than a conditional mean. To obtain identi cation under general conditions, we provide two di erent sets of identifying assumptions, each of which can be applied to di erent regressors in the same model. We therefore let X > 0 = V > 0 + Z > 0 and write the model as M(V; Z) = F V > 0 + Z > 0 ; g 0 (V; Z). Here V is a vector of one or more continuous regressors, while Z is a vector that can include covariates which are discrete, continuous, continuous with mass points, etc. Theorem 3.1 below identi es 0 using di erentiability and inequality constraints, while Theorem 3.2 below identi es 0 exploiting support and invertibility restrictions instead. Z could be empty so only Theorem 3.1 would be needed, or V could just have one element with a coe cient normalized to equal one, in which case only Theorem 3.2 would be needed, or both Theorems can be applied to identify models where di erent sets of regressors satisfy di erent assumptions. Assumption 1 Assume V is a K 1-vector for some positive integer K. Assume there exists a function F and a vector 0 such that m (V ) = F V > 0 ; g (V ). Assume functions m (V ) and g (V ) are identi ed. Let 0;i and V i denote the i-th element of 0 and V respectively, for i = 1; : : : ; K. Assume 0;1 = 1. Assumption 1 applies to the general double index model M(V; Z) = F V > 0 + Z > 0 ; g 0 (V; Z) de ning m (V ) = M(V; 0) and g (V ) = g 0 (V; 0) ; and provided the J 1 vector of zeroes is in the support of Z. If Z is empty then Theorem 3.1 based on Assumptions 1 and 2 below will identify the entire model, otherwise it will identify just the 0 coe cients, and F only on the supports of V > 0 and 4

5 g 0 (V; 0). If K = 1, then 0 just equals 0;1 = 1 and so is known by the scale normalization assumption in that case. The scaling of 0 is arbitrary, since changes in scaling can be freely absorbed into F. Assumption 1 imposes the convenient scale normalization that V 1, the rst element of V, has a coe cient of 0;1 = 1. This is a free normalization if one knows that this regressor has a positive e ect on F through the rst index. This is also the most natural normalization in some contexts, e.g., if in a binary choice model Y is a purchase decision and V 1 is the price, then with the normalization 0;1 = 1 the remainder of the index V > 0, that is, P K k=2 V k 0;k, equals (up to location) the willingness to pay for the product (see e.g. Lewbel, Linton, and McFadden, forthcoming). Assumption 2 Assume g and F are di erentiable and de r F (r; g) (r; g) =@r. De Vi g (V ) (V ) =@V i Vi m (V ) (V ) =@V i for i = 1; :::; K. Assume there exists two vectors v and ev on the support V such that the Vi g (V ) Vi m (V ) are identi ed at V = v and V = ev for i = 1; : : : ; K, and there exist two elements k and j of the set f1; :::; Kg such that the following inequalities r F (ev > 0 ; g (ev)) 6= 0 r F (v > 0 ; g (v)) 6= Vj g (ev) V1 g (ev) 0;j Vk g (ev) V1 g (ev) Vi g (v) V1 g (v) 0;i for i = 2; :::; Vj g Vk g Vj g Vk g (v) 6= [@ V1 g Vk g V1 g Vk g (v)] g Vj g V1 g Vj g (v) 0;k. The inequalities in Assumption 2 essentially require F to depend on V > 0, and require some variation in g (V ) that distinguishes it from V > 0. These inequalities will not hold if g (V ) equals a transformation of a single index in V for example (see Chamberlain, 1986), but otherwise it is very di cult to construct examples that violate the inequalities of Assumption 2. The identi cation of derivatives of m and g at the two points v and ev generally requires that V be continuously distributed at those points, so Theorem 3.1 below cannot be applied to discrete regressors. Theorem 3.2 later will provide identi cation for other regressor distributions, including discrete regressors. Theorem 3.1 If Assumptions 1 and 2 hold, then the vector 0 and the function F (on the supports of V > 0 and g 0 (V; 0)) are identi ed. Theorem 3.1 obtains identi cation without an exclusion assumption, that is, all of the covariates V can appear in both the index V > 0 and in the function g (V ). Identi cation can also be obtained from Theorem 3.1 using exclusions restrictions to satisfy Assumption 2. However, with an exclusion restriction, identi cation can be obtained more simply as follows: Let Assumption 1 hold with 0;K = 0 and assume g(v ) varies with V K (so V K is in g(v ) but not in the linear index V > 0 and hence V K is the excluded regressor). Then if F 0 is di erentiable in its rst element it follows from equation (A-1) 5

6 that 0 is identi ed by 0;k [m (V ) jv 1,...,V K 1 ; g (V k [m (V ) jv 1,...,V K 1 ; g (V 1, for k = 2; :::; K 1, evaluated at any value of V that makes the derivative in the denominator of the above expression nonzero. 3.1 Discrete Regressors Theorem 3.1 does not identify the coe cients of discrete regressors, but it can identify the coe cients of the continuous regressors when both continuous and discrete regressors are present. Given both continuous and discrete regressors, Theorem 3.2 below can be combined with Theorem 3.1 to identify the coe cients of the remaining discrete regressors, or of other regressors that satisfy the alternative regularity conditions provided in Assumptions 3 and 4. If all regressors satisfy these alternative regularity conditions, then Theorem 3.2 alone can be used for identi cation with both types of regressors. Assumption 4 imposes support restrictions and local invertibility of F, instead of the di erentiability and inequality constraints in Assumption 2. Assumption 3 Let M(V; Z) = F [V > 0 + Z > 0 ; g 0 (V; Z)]. Assume g 0 (V; Z) and M (V; Z) are identi ed. Assume Z is a J 1 vector and that 0 is identi ed. If V is a scalar, then 0 is identi ed by the free (up to sign) normalization 0;1 = 1 as before. Alternatively, if V is a vector then 0 is identi ed by Theorem 3.1 as long as Assumptions 1 and 2 hold with m (V ) = M(V; 0) and g (V ) = g 0 (V; 0). Having V be a vector instead of a scalar makes the support requirements in Assumption 4 below less restrictive. These support conditions are particularly mild when Z consists only of discrete regressors, which would then require V to contain all the continuous covariates in the model. Assumption 4 Assume the J 1 vector of zeroes is in the support of Z. Let ez j denote the J 1 vector that has element j equal to z j and all other elements equal to zero. Assume for some z j 6= 0 in the support of Z j, there exists v (z j ) in the support of V such that v (z j ) > 0 + z j j is in the support of V > 0 and g 0 (v (z j ) ; z j ) is in the support of g 0 (V; 0). Assume F (r; eg) is invertible on its rst element at the point r = v (z j ) > 0 + z j j, eg = g 0 (v (z j ) ; z j ). A su cient condition for Assumption 4 to hold is if V > 0 and g 0 (V; 0) have support on the entire real line and if F is strictly monotonic in its rst element. Alternatively, if Z is discrete, then only a limited range of values of V > 0 and g 0 (V; 0) are required, e.g., if there is a v such that v > 0 can take on the value z j j, then letting v (z j ) equal that v makes v (z j ) > 0 + z j j lie in the support of V > 0, and a similar analysis applies to g 0. Theorem 3.2 If Assumptions 3 and 4 hold for j = 1; :::; J then 0, 0, and F at all points on the support of V > 0 + Z > 0 ; g 0 (V; Z) are identi ed. We now illustrate these identi cation theorems with two examples, a double hurdle model and a binary choice control function model with an endogenous regressor. 6

7 3.2 A Semiparametric Double Hurdle Model Suppose a latent binary variable Y satis es the standard xed (at zero) censored regression model Y = X > 0 e I X > 0 e 0 = max 0; X > 0 e with e independent of X and the distribution function of e, F e, may be unknown. Now consider the case where we only observe Y for some subset of the population, indexed by a binary variable D, i.e. we only observe Y = Y D. This is a sample selection model with a censored regression outcome. For example, Y could indicate the quantity of a good an individual might want to purchase, D indicates whether the good is available for purchase where the individual lives, and Y indicates the quantity of the good the individual purchases, which is nonzero only when both X > 0 e > 0 and D = 1. Now consider identi cation of this censored regression with selection model without assuming selection on observables, so D and Y remain correlated after conditioning on observables X, as in various Heckman-type selection models. The econometrician is assumed to know relatively little about selection D other than that it is binary, so assume D is given by a nonparametric threshold crossing model D = I [g 0 (X) u 0] where u? X and both the function g 0 (X) and the distribution of u are unknown. Based on Matzkin (1992), we may without loss of generality assume g 0 (X) = E(DjX) and u has a uniform distribution, since then Pr(D = 1jX) = Pr [u g 0 (X)] = g 0 (X). We then have the model D = I [g 0 (X) u 0] (3.1) Y = max(0; X > 0 e)d (3.2) The latent error terms e and u are not independent of each other, so the model does not have selection on observables. When g 0 and the joint distribution of e and u are parameterized, Cragg (1971) and later authors call this a double hurdle model, because two hurdles must be crossed, g 0 (X) u and X > 0 e before a positive quantity of Y can be observed. We therefore call equations (3.1) and (3.2) the semiparametric double hurdle model. Let (e; u) > be drawn from the joint distribution function F e;u (e; u) with e; u independent of X. Then we have g 0 (X) = E(DjX) and we can de ne Z M (X) := E (Y jx) = max 0; X > 0 e I(g 0 (X) support (e;u) u 0)dF e;u (e; u) =: F [X > 0 ; g 0 (X)] so equation (1.1) holds, and by construction both M (X) and g 0 (X) are identi ed as conditional expectations. Identi cation of F and 0 is obtained by having the assumptions of Theorems 3.1 or 3.2 hold. Di erentiability of F corresponds to e and u being continuously distributed. 3.3 Binary Choice With an Endogenous Regressor Control Function Without Instruments Suppose Y = I(Z > 0 + W 0 e 0) where W is an endogenous regressor with W = s 0 (Z) + u, and e and u are unobserved possibly correlated error terms. In our empirical application, Y will indicate whether an individual moves (migrates) from one state to another in the United States, and W will be 7

8 logged income. People often move to nd better jobs, and unobservables that a ect the willingness to relocate are likely to be related to unobservables a ecting income, so W will generally be an endogenous regressor. We assume to have a parametric model for migration, but do not know much about how income is determined, so the model for W is nonparametric. Let E(ujZ) = 0 and assume the endogeneity takes the control function form ejz; u = eju. De ne X > 0 := Z > 0 + W 0, and g 0 (X) := W s 0 (Z). Then equation (1.1) holds with F de ned by E(Y jx) = E[I X > 0 h 0 (g 0 (X) ; v) 0 jx] =: F X > 0 ; g 0 (X). In this application we will have some discrete regressors, and a continuous exogenous regressor which is age. The conditions of Theorem 3.2 are sensible here. In particular, among working age individual s migration probabilities decrease steadily with age (since the gains in lifetime expected earnings from migrating to a better paying job decrease linearly with age), and empirically income is highly non-linear in age, providing the necessary nonlinearity in g 0 (X). As for Assumption 4, F is monotonic in its rst element since it equals a conditional distribution function, and migration probabilities vary greatly with age as required by the support assumptions. 4 Estimation 4.1 Two-Step Semiparametric Least Squares Estimation Given identi cation, we now describe in detail our estimator for the general double index model E(Y jx) = F [W j W ]; where X may contain both continuous and discrete regressors, W (X > 0 ; g 0 (X)) and g 0 (X) = E[DjX]. We assume that a random sample f(y i ; X i > ; D i)g n is observed from the joint distribution of (Y; X > ; D) > taking values in X Y X X X D 2 R d+2. For any candidate value of and conditional mean function g, let W (; g) := [X >,g (X)] >, and set F (wj ; g) := E [Y j W (; g) = w], w 2 R 2. The regression function F ( wj ; g) can be consistently estimated by the leave-one-out nonparametric Nadaraya-Watson kernel estimator bf i (wj ; g) := T b i (wj ; g) = f b i (wj ; g), where for i = 1; :::; n, bt i (wj ; g) := 1 Y j K bhn (W j (; g) w), n bf i (wj ; g) := 1 n j=1;j6=i j=1;j6=i K bhn (W j (; g) w), W i (; g) := [X > i,g (X i)] >, K h (w) = k h (w 1 )k h (w 2 ), w = (w 1 ; w 2 ); k h (u) = h 1 k (u=h), k () is a kernel function and b h n denotes a possibly data-dependent bandwidth parameter; see Assumptions 9 and 10 below. Our two-step semiparametric least squares estimator (SLS) is b := arg min 2 S n (; bg) 1 n 8 h Y i b Fi (W i (; bg) j; bg)i 2 bai, (4.1)

9 where bg represents the Nadaraya-Watson estimator of g 0 satisfying Assumption 15 below, and ba i := I( b f i n ), with b f i b f i (W i ( b ; bg)j b ; bg) for i = 1; : : : ; n, is a trimming function introduced here to keep bf i (W i (; bg)j ; bg) away from zero, with n! 0, as n! 1 at a suitable rate. The preliminary consistent estimator, b, of 0 can be obtained by M -estimation with xed trimming, i.e. ba i := I(X i 2 A) for a compact set A X X, see e.g. Delecroix, Hristache and Patilea (2006). Notice that the estimator de ned by (4.1) is like Ichimura s (1993) estimator after plugging in bg, and is also considered in Ichimura and Lee (2010), except that here g 0 is not assumed to have an index structure, and we are allowing for a random trimming function with possibly data-dependent bandwidth. The asymptotic properties of (4.1) are established by repeated applications of Escanciano, Jacho-Chávez and Lewbel s (2011) uniform-in-bandwidth result. 4.2 Asymptotic Theory De ne for any vector (a 1 ; :::; a d ) of d integers the di erential a x jaj =@x a 1 1 : : d d, where jaj := P d a i. Let X X be the support of X; and assume that X X X ; where X is a convex, bounded subset of R d, with non-empty interior. For any smooth function h : X X R d! R and some > 0, let be the largest integer smaller than, and khk 1; := max sup jajx2x j@ xh(x)j a + max sup j@ xh(x a 1 xh(x a 2 )j jaj=x 1 6=x 2 kx 1 x 2 k. Further, let C M (X ) be the set of all continuous functions h : X Rd! R with khk 1; M. De ne the sup-norm khk 1 := sup x2x jh(x)j. Let f X (xj w; ; g) be the density of X conditional on W (; g) = w. Let h Z (zj w; ; g) be the conditional density function of Z (Y; X > ) > given W (; g) = w with respect to a - nite measure (). Note that Y does not need to be absolutely continuous as we do not require () to be the Lebesgue measure. Let G be the class of functions where g belongs. De ne the set W as W := fw 2 R 2 : 9x 2 X X, g 2 G, 2 s.t. w = (x > ; g (x)) > g. The following assumptions will be needed in our subsequent analysis. Escanciano, Jacho-Chávez and Lewbel (2011) provide su cient primitive conditions for some of our high-level assumptions in terms of the density of X and the class G: Assumption 5 The sample observations fy i ; X i > gn are a sequence of independent and identically distributed (iid) variables, distributed as fy; X > g and satisfying E[jY j p ] < 1; for some p > 2, and E[Y jx] = E[Y jw ] a.s. Assumption 6 The parameter space is a compact subset of R d and 0 is an element of its interior. The class G C g M (X ), for some g > d. Assumption 7 For any g 2 G and 2 the distribution of the random vector W (; g) admits a continuous density function f (wj ; g) with respect to the Lebesgue measure, such that for any compact set I of W R 2 there exists some " > 0, with " inf g2g;2;w2i f (wj ; g) sup g2g;2;w2i f (wj ; g) " 1 : Furthermore, sup (;g;w)2gw E[jY j p jw (; g) = w] < 1; for some p > 2: 9

10 Assumption 8 For all g 2 G, 2 : (i) with q (wj x; ; g) f (wj ; g) ; F (wj ; g) or f X (xj w; ; g), q ( wj x; ; g) is (r + 1)-times continuously di erentiable in w, with uniformly bounded derivatives: sup g2g;2;x2x ;w2w j@ wq (wj x; ; g) j < 1, 8 jj r + 1, where r is as in Assumption 9 below; and (ii) f (wj ; g) and F (wj ; g) are continuously di erentiable in, for all w 2 W: Conditions 5 8 are standard in the literature. Assumptions 7 and 8 are needed for establishing uniform convergence rates for kernel estimators, and for a uniform expansion of nonparametric residualmarked empirical processes. Note that by consistency of bg we can take G to be contained in an arbitrary neighborhood of g 0. Assumption 9 The kernel function k (t) : R! R is integrable, r-times continuously di erentiable and satis es the following conditions: R k (t) dt = 1, R t l k (t) dt = 0 for 0 < l < r, and R jt r k (t)j dt < 1, for some r 2; is of bounded variation with j@k(t)=@tj C; and for some v > 1; j@k(t)=@tj C juj v for juj > L; for some positive L < 1: Assumption 10 The possibly data-dependent bandwidth b h n satis es P (a n b h n b n )! 1 as n! 1, for deterministic sequences of positive numbers a n and b n such that: (i) b n! 0 and a 2 nn= log n! 1; (ii) nb 2r n! 0. Assumption 9 is standard in the literature of nonparametric kernel estimation, while Assumption 10 permits data-dependent bandwidths, as in Andrews (1995). In particular, our theory allows for plug-in bandwidths of the form b h n = bch n with bc stochastic and h n a suitable deterministic sequence converging to zero as n! 1. Andrews (1995) points out that this condition holds in many cases for other common data-dependent bandwidth selection procedures, such as cross-validation, and generalized cross-validation. Obviously, our results also apply to deterministic sequences. In particular if b h n is of the form b h n = cn, for some constant c > 0; then Assumption 10(ii) requires that 1=2r < < 1=2, so r needs to be greater than 1: We now introduce two classes of functions that will serve as parameter spaces for the functions F (w (x; ; g) j; g) and f(w (x; ; g) j; g); respectively, where henceforth w (x; ; g) := [x > ; g (x)] >. Let T M be a class of uniformly bounded and measurable functions fx! q (w (x; ; g)j ; g) : 2 ; g 2 G; q 2 T g (4.2) such that for a universal constant C L ; all g j 2 G; j 2, j = 1; 2; and all q 2 T ; jq (wj 1 ; g 1 ) q (wj 2 ; g 2 )j C L fj 1 2 j + kg 1 g 2 k 1 g : (4.3) Moreover, we assume that there exist a convex, bounded subset of R 2, say C W W; with non-empty interior such that for each g 2 G; 2 ; it holds that q (j ; g) 2 C M (C W); for some > 1. Assumption 11 F 2 T F M, f 2 T F M F (w (; 0; g 0 ) j 0 ; g 0 ) 2 T F M ; P ( b F i b hn T F M )! 1; P ([ b f i b hn T F M )! 1, for some F > d and M > 0: 2 T F M )! 1, P ( f b i b hn 2 f] 2 T F M )! 1; P (@> b F i (w(; ; b bg)j ; b bg) 2 T F M )! 1 and P ( f b i (w(; b ; bg)j b ; bg) 2 10

11 Assumption 12 Let n be a sequence of positive numbers satisfying: (i) for all u 2 R and all su - ciently small > 0; P (u 2 f(w jw ) u+2) C; C > 0; (ii) p n sup W 2W P (f (W j W ) < 2 n )! 0, n! 0; a 4 nnn 4! 1 and nn 4 b 4r n! 0. Assumption 11 follows from certain smoothness in the underlying functions, as emphasized in?. The continuity condition in Assumption 12(i) is used to handle the random trimming. Assumption 12(ii) is similar to that considered in Robinson (1988) for example. Assumption 13 The conditional density function f X (xj ; g) is such that for all x 2 X X ; and g 1 2 G; 1 2 ; sup jf X (xj 1 ; g 1 ) f X (xj 2 ; g 2 )j '(xj 1 ; g 1 ) for all > 0, ( 2 ;g 2 ):j 1 2 j<;kg 1 g 2 k 1 < where '(xj 1 ; g 1 ) is a real-valued function such that R '(xj 1 ; g 1 )d (x) < C, with C denoting an absolute constant. Assumption 14 The function F (wjw ) is (+1)-times continuously di erentiable in w 2 with bounded derivatives, for all w 1. The function g 0 and the density f X () of X are -times continuously di erentiable in x, with bounded derivatives, where is as in Assumption 15 below. Assumption 15 (a) The regression g 0 is estimated by a NW kernel estimator with a kernel function satisfying Assumption 9 with r = and a possibly stochastic bandwidth b h gn satisfying P (c n b h gn d n )! 1 as n! 1, for deterministic sequences of positive numbers c n and d n such that: (i) d n! 0 and c d nn= log n! 1; (ii) nb 2 n! 0 and c n n 1 1=! 1; (b) Moreover, E D 2 < 1; kbg g 0 k 1 = o P (n 1=2 ), g 0 2 G and P (bg 2 G)! 1. Assumption 16 The density f (wj ; g) is uniformly continuous in ( 0 ; g 0 ); i.e. k f (j n ; g n ) f(j 0 ; g 0 ) k 1 = o(1) for any sequence fg n ; n g satisfying kg n g 0 k 1 = o(1) and j n 0 j = o(1): Assumption 17 The matrix 0 := E[@ F (W )@ > F (W )] is positive de nite. Assumption 13 is a high-level condition and it implies that conditional moments of bounded functions '(X) conditional on W (; g) are Lipschitz in W (; g): Assumptions 14 and 15 are standard in the literature. Examples of random bandwidths that satisfy our assumptions are plug-in bandwidths of the form b h gn = bch gn with bc is bounded in probability and h n a suitable deterministic sequence. If h gn = cn, for some constant c > 0; then Assumption 15(a) requires that 1=2 < < minf1=d; 1 needs to be greater than d=2 and 1=2 + 1=2( in the literature that assume kbg 1=g, so 1): Assumption 15 is weaker than related conditions g 0 k 1 = o P (n 1=4 ). This is an additional advantage of our method of proof based on continuity arguments over alternative methods of proof based on di erentiability arguments. However, in general we do still require further conditions on bg to satisfy P (bg 2 G)! 1. For example, these conditions have been veri ed to hold under primitive regularity assumptions for the local polynomial estimator in Neumeyer and van Keilegom (2010) when all the X s are continuous. 11

12 The smoothness condition in Assumption 16 is standard. Finally, Assumption 17 is also standard and it ensures the non-singularity of the asymptotic covariance matrix of the nal estimator. After de ning g F (W i ) (w 1 ; w 2 )jw 0 )=@w 2 j w=wi ; and i := Y i F (W i jw ) (D i g 0 (X i ))@ g F (W i ), 1;i := F (W i jw ); the following theorem establishes the consistency and asymptotic normality of the proposed estimator. Theorem 4.1 Assume 0 is identi ed and that Assumptions 5 17 hold, then b is consistent and asymptotically normal where 0 := E[ 1;i > 1;i ]: p n( b 0 )! d N(0; ), Estimation of g 0 has an impact in the asymptotic variance of b through the term (D i g 0 (X i ))@ g F (W i ); however, the estimator has an oracle property with respect to F, as its asymptotic properties are not a ected by the lack of knowledge of F, see e.g. Ichimura and Lee (2010). The asymptotic variance of b can be easily estimated by the analogue principle or by bootstrap methods. 5 Numerical Results In this section we discuss the numerical implementation of our estimator (4.1) in the context of a small Monte Carlo experiment and an empirical implementation. We make use of the np package by Hay eld and Racine (2008) in the statistical computing environment R. In particular, under its General Public License (GPL) we modify its estimating function called npindex(...,method="ichimura",...) to allow the inclusion of a second conditioning variable when calculating Ichimura s (1993) estimator and user-speci ed weights. We use the option optim.method="bfgs" and optim.method="nelder-mead" in each of the 20 times (option nmulti=20) we randomly restarted the optimization algorithm to avoid nding local minima in both the Monte Carlo and the empirical application respectively. We nd our numerical calculations to be more stable using a simple 2nd order Gaussian kernel as permitted by our asymptotic theory, along with bandwidths that are estimated jointly with the unknown 0, i.e. ( b > ; b h > ) = arg min ( > ;h > ) > 2R n [Y i Fi b (W i (; bg) j; bg)] 2 ba i, (5.1) see Marron (1994), Härdle, Hall, and Ichimura (1993), and Rothe (2009) for similar results regarding kernel and bandwidth choice. Random starting values in each restart and trimming function, ba i, follow the original implementation by the np package developers and maintainers. 12

13 5.1 Monte Carlo Experiments We assess the performance of the estimator with a Monte Carlo simulation of a binary choice model with an endogenous regressor control function, where no outside instruments are available. This is the same type of model that we apply in our empirical application. We generate samples of n 2 f250; 500; 1000g pseudo-random numbers in 300 replications. Variables fy i ; X 1i ; X 2i g n are generated from the model Y = I (X X 2 " 0), X 2 = s 0 (X 1 ) + U, where X 1 has a beta marginal distribution with shape parameters (2; 4), s 0 (u) = 0:5(10 u) where represents the cumulative distribution function of a standard normal random variable, and the error terms, " and U, were generated independently of X = [X 1 ; X 2 ] > from marginal standard normal distributions with correlation coe cient = f0; 1=2; 3=4g. We calculated 4 sets of estimators of our parameter of interest, 0 = 1: The standard Probit [1], Ichimura s (1993) estimator [2], the infeasible semiparametric estimator that uses the true s 0 [3] and the proposed (feasible) semiparametric least squares with s 0 estimated by the Nadaraya-Watson estimator with bandwidths chosen in each replication by Silverman s rule-of-thumb. Results are presented in Table 1. They show simulated bias (Bias), standard deviation (Std. Dev.), root mean square error (RMSE) and the mean absolute deviation (MAE). Both [1] and [2] are only consistent when = 0, because they are misspeci ed otherwise. They are useful as benchmarks to check if errors in our estimator are smaller than errors due to (sometimes) misspeci ed existing estimators. Notice that when = 0, the Probit model is correctly speci ed and the most e cient. Model [3] provides an infeasible standard in all scenarios against which the proposed estimator can be compared to measure the extent of the rst-step impact on the estimation precision of the proposed estimator, [4]. In all scenarios and sample sizes, estimators [3] and [4] performs well in terms of RMSE, and the Monte Carlo variance of [4] is always larger than [3] s as predicted by the asymptotic theory. When = 0, the simulated variance of the infeasible version of the proposed estimator, [3], is slightly larger than [2] s indicating a loss of e ciency when trying to control for non-existent endogeneity in the model. When the endogeneity problem is severe, i.e. = 3=4, both versions of the proposed estimator perform very well in comparison to the standard Probit and Ichimura s (1993) estimators which become largely biased. 5.2 Empirical Application As previously discussed, we now estimate a binary choice model for workers migration decisions based on a sample of years old male household heads who had completed education by the time of interview and who reported positive labor income during The sample is drawn from the 1990 wave of the Panel Study of Income Dynamics (PSID). The top 1% highest earning individuals are dropped to reduce the impact of outliers. Details regarding this data construction are in Dong (2010). Table 2 shows descriptive statistics of the resulting 4582 observations in the sample. 13

14 Let Y indicate if an individual moves (migrates) from one state to another in the United States in the years , and let W be their log labor income averaged over 1989 and Exogenous covariates Z are State (number of states ever lived in), Edu (dummy indicating college or above education), Size (family size), and Age. The model is Y = I(X > 0 e 0) = I(Z > 0 + W 0 e 0) where W is an endogenous regressor with W = s 0 (Z) + u, and e and u are unobserved error terms, correlated with each other but independent of Z. Then as shown earlier, E(Y jx) = F X > 0 ; g 0 (X) where g 0 (X) = g 0 (W; Z) = W s 0 (Z). We assume identi cation based on Theorem (3.2) without exclusion assumptions, with age as a continuous regressor. This requires that the latent variable driving the probability that Y equals one be linear in age (which as noted earlier is supported by the human capital theory of migration) and that this probability varies over a reasonably wide range with age. The nonlinearity of g 0 (X) = W s 0 (Z) required for identi cation will hold if s 0 (Z) is nonlinear in age. Figure 1 shows Nadaraya-Watson Kernel regression estimates of E (YjAge) and E(log(Income)jAge) with bandwidths chosen by Least-Squares cross-validation and 95% pointwise con dence intervals based on 399 bootstrapped replications. This gure provides empirical evidence supporting our above identifying assumptions. For example, E (YjAge) takes on a reasonably wide range of values, considering that the expected probability that a randomly chosen individual migrates in a three year period would not generally be high. Also the estimated migration probability is close to linear and certainly plausibly monotonic in age, while log(income) is highly nonlinear and nonmonotonic in age. We estimate the model using a Probit [I] which ignores endogeneity of labor income and assumes e is normal, Ichimura s (1993) estimator [II] which ignores endogeneity but does not assume the distribution of e and the proposed 2-Steps SLS. The coe cient of State is normalized to 1 in all speci cations for comparison purposes. Bandwidths in [II] and [III] were jointly estimated with the unknown coe cients to minimize least squares, and then used throughout to calculate related quantities such as asymptotic standard errors and marginal e ects. For our proposed estimator, [III], in the rst stage we nonparametrically regress the endogenous covariate log(income) on State, Edu, Size, and Age using generalized kernels as suggested in Racine and Li (2004) with smoothing parameters chosen by Least-Squares cross-validation. The resulting datadependent bandwidths are 1 for State, for Edu, for Size and for Age. We then estimate in the second step as in (5.1) where bg equals the residuals from the rst-step nonparametric regression. Table 3 shows the results. The table also w1f b evaluated at the sample mean of the estimated index in [II], and the sample mean of the estimated indexes corresponding to [III]. Marginal e ects for all models are shown in Table 4. The marginal e ects marked as (*) were obtained by multiplying the w1f b in Table 3 by their corresponding estimated coe cient. Using results in Sperlich (2009), the asymptotic standard errors for model [II] and [III] are approximated as the square root of cvar(@ w1f b ) b2 j w1f b 2 cvar( b j ). For discrete covariates Edu and Size, Table 4 also reports for each t s the corresponding change, F b t Fs b, where F b l represents the value of F b evaluated at the implied index where a particular discrete variable is set equal to l while keeping the remaining part of the index equal to its sample mean. Results in Schuster (1972) also allow us to approximate their 14

15 standard errors as the square root of cvar( F b t ) + cvar( F b s ). Although has the same normalization in all the estimators, estimated marginal e ects were still generally closer across speci cations than estimates of, suggesting that small sample biases in estimating and F may be o setting to some extent. Moreover, marginal e ects are more directly economically interpretable as the impacts of regressors on the probability of migrating. We therefore focus on summarizing marginal e ects. In all the estimates age has a negative impact on the probability of moving as expected by theory, however, the e ect of age is more than 50% larger in our estimators that take endogeneity of income into account, which suggests the importance of controlling for endogeneity. The endogenous regressor log income has a negative e ect in all the speci cations, consistent with higher wage individuals having less income incentive to move. It is not clear how education should a ect migration probabilities, and the estimates of this e ect vary inconclusively across models. The e ects of family size were not statistically signi cant, but our estimates controlling for income endogeneity suggest that larger families are more likely to move. 6 Conclusion and Extensions The new identi cation and estimation results in this paper are applicable to various models including latent index models with an endogenous regressor, and nonlinear models with sample selection. We discuss how identi cation based on functional form, without exclusion restrictions or instruments, extends to a semiparametric setting. For estimation, the asymptotic properties of the proposed semiparametric least squares estimators are derived with random trimming and data-driven bandwidths. The asymptotic properties of the proposed estimator is derived from a novel uniform expansion of sample means of weighted nonparametric residuals in Escanciano, Jacho-Chávez and Lewbel (2011). It is shown that the asymptotic properties of the estimator remain unchanged when using data-dependent bandwidths and random trimming. For bandwidth selection, we have proposed a numerically e ective procedure that, while ad hoc, performs well in the Monte Carlo experiments and in an empirical application to a migration decision model. The formal justi cation of this bandwidth selection remains an open problem. This and other possible extensions are deferred for future research. References Andrews, D. W. K. (1995): Nonparametric Kernel Estimation for Semiparametric Models, Econometric Theory, 11, Blundell, R. W., and J. L. Powell (2004): Endogeneity in Semiparametric Binary Response Models, Review of Economic Studies, 71, Chamberlain, G. (1986): Asymptotic E ciency in Semi-parametric Models with Censoring, Journal of Econometrics, 32(2),

16 Chen, X., O. B. Linton, and I. van Keilegom (2003): Estimation of Semiparametric Models when the Criterion Function Is Not Smooth, Econometrica, 71(5), Cragg, J. G. (1971): Some Statistical Models for Limited Dependent Variables with Application to the Demand for Durable Goods, Econometrica, 39(5), Delecroix, M., M. Hristache and V. Patilea (2006): On Semiparametric M -Estimation in Single-Index Regression, Journal of Statistical Planning and Inference, 136(3), Dong, Y. (2010): Endogenous Regressor Binary Choice Models Without Instruments, With an Application to Migration, Economics Letters, 107(1), Escanciano, J. C., D. T. Jacho-Chávez and A. Lewbel (2011): Uniform Convergence for Semiparametric Two Step Estimators and Tests, Unpublished manuscript. Härdle, W., P. Hall, and H. Ichimura (1993): Optimal Smoothing in Single-Index Models, The Annals of Statistics, 21(1), Hayfield, T., and J. S. Racine (2008): Nonparametric Econometrics: The np Package, Journal of Statistical Software, 27(5), Heckman, J. J. (1979): Sample Selection Bias as a Speci cation Error, Econometrica, 47(1), Ichimura, H., and L. Lee (1991): Semiparametric least squares estimation of multiple index models: single equation estimation", in Nonparametric and Semiparametric Methods in Econometrics and Statistics, ed. by W. A. Barnett, J. Powell, and G. Tauchen, pp Cambridge University Press. Ichimura, H. (1993): Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single Index Models, Journal of Econometrics, 58, Ichimura, H., and S. Lee (2010): Characterization of the asymptotic distribution of semiparametric M-estimators, Journal of Econometrics, 159(2), Lewbel, A., and O. B. Linton (2007): Nonparametric Matching and E cient Estimators of Homothetically Separable Functions, Econometrica, 75(4), Lewbel, A., O. B. Linton, and D. McFadden (forthcoming): Estimating Features of a Distribution from Binomial Data, forthcoming in Journal of Econometrics. Li, K.-C. (1991): Sliced Inverse Regression for Dimension Reduction, Journal of the American Statistical Association, 86(414), Marron, J. S. (1994): Visual Understanding of Higher-Order Kernels, Journal of Computational and Graphical Statistics, 3(4),

17 Matzkin, R. L. (1992): Nonparametric and Distribution-Free Estimation of the Binary Threshold Crossing and the Binary Choice Models, Econometrica, 60, Neumeyer, N., and Van Keilegom, I. (2010): Estimating the Error Distribution in Nonparametric Multiple Regression with Applications to Model Testing, Journal of Multivariate Analysis, 101, Newey, W. K., and D. McFadden (1994): Large Sample Estimation and Hypothesis Testing, in Handbook of Econometrics, ed. by D. McFadden, and R. F. Engle, vol. IV, pp Elsevier, North-Holland, Amsterdam. Pinkse, J. (2001): Nonparametric Regression Estimation using Weak Separability, Unpublished manuscript. Racine, J. S., and Q. Li (2004): Nonparametric Estimation of Regression Functions With Both Catagorical and Continuous Data, Journal of Econometrics, 119(1), Rivers, D., and Q. H. Vuong (1988): Limited information estimators and exogeneity tests for simultaneous probit models, Journal of Econometrics, 39(3), Robinson, P.M. (1988): Root-n-consistent Semiparametric Regression, Econometrica, 56(4), Rothe, C. (2009): Semiparametric Estimation of Binary Response Models with Endogenous Regressors, Journal of Econometrics, 153(1), Schuster, E. F. (1972): Joint Asymptotic Distribution of the Estimated Regression Function at a Finite Number of Distinct Points, Annals of Mathematical Statistics, 43(1), Sperlich, S. (2009): A note on non-parametric estimation with predicted variables, The Econometrics Journal, 12(2), Appendix A: Proofs of Main Results Proof of Theorem 3.1: Dropping V for now for clarity, de r F (r; g) =@r g F (r; g) =@g. Equating m with F and taking derivatives shows Vk m r F 0;k g Vk g. (A-1) Since 0;1 = 1 we have for each k = 2; : : : ; V1 Vk m! = V1 g Vk r g F!. 17

18 By Assumption Vk g V1 g 0;k so the matrix in the above equation is nonsingular. Inverting to solve r F g F r F V k g@ V1 V1 g@ Vk Vk V1 g g F V k V1 m Vk V1 g 0;k Equating the above expression r F based on coe cients indexed by k with the same expression evaluated at some other index i gives Vk g@ V1 V1 g@ Vk m V i g@ V1 V1 g@ Vi Vk V1 g Vi V1 g 0;i (@ Vk g@ V1 V1 g@ Vk m) (@ Vi V1 g 0;i ) = (@ Vi g@ V1 V1 g@ Vi m) (@ Vk V1 g 0;k ) which simpli es to (@ Vk g@ Vi Vi g@ Vk m) = (@ Vk g@ V1 V1 g@ Vk m) 0;i (@ Vi g@ V1 V1 g@ Vi m) 0;k (A-2) which is linear in 0;k and 0;i. The same equation is obtained if one equates the expression g F based on two indices k and i. Recalling the dependence of the functions above on V, equation (A-2) evaluated at V = v and at V = ev with i = j can be written as where the g Vj m Vk g Vj m (v) is given Vj g Vk m (ev) Vj g Vk m (v) V k g V1 m V1 g Vk m Vj g V1 m V1 g Vj m Vk g V1 m V1 g Vk m Vj g V1 m V1 g Vj m (v) 0;j 0;k!! (A-3) Using equation (A-1), each term in the matrix has the Vk g@ V1 V1 g@ Vk m Vk g (@ r F 0;1 g V1 V1 g (@ r F 0;k g Vk g) = (@ Vk g V1 r F (A-4) so rf (ev 0 ; g (ev)) 0 r F (v 0 ; g Vk g (ev) V1 g Vj g (ev) V1 g Vk g (v) V1 g Vj g (v) V1 g (v)! : Assumption 2 imposes su cient conditions to ensure that the determinants of each of the two matrices on the right above are nonzero, which shows that is nonsingular. 18

Identification and Estimation of Semiparametric Two Step Models

Identification and Estimation of Semiparametric Two Step Models Juan Carlos Escanciano Indiana University David Jacho-Chávez Emory University Arthur Lewbel Boston College First Draft: May 2010 This Draft: