Nonparametric Identi cation and Estimation of Nonclassical Errors-in-Variables Models Without Additional Information

Size: px

Start display at page:

Download "Nonparametric Identi cation and Estimation of Nonclassical Errors-in-Variables Models Without Additional Information"

Vivian George
5 years ago
Views:

1 Nonparametric Identi cation and Estimation of Nonclassical Errors-in-Variables Models Without Additional Information Xiaohong Chen y Yale University Yingyao Hu z Johns Hopkins University Arthur Lewbel x Boston College First version: November 26; Revised July 27. Abstract This paper considers identi cation and estimation of a nonparametric regression model with an unobserved discrete covariate. The sample consists of a dependent variable and a set of covariates, one of which is discrete and arbitrarily correlates with the unobserved covariate. The observed discrete covariate has the same support as the unobserved covariate, and can be interpreted as a proxy or mismeasure of the unobserved one, but with a nonclassical measurement error that has an unknown distribution. We obtain nonparametric identi cation of the model given monotonicity of the regression function and a rank condition that is directly testable given the data. Our identi cation strategy does not require additional sample information, such as instrumental variables or a secondary sample. We then estimate the model via the method of sieve maximum likelihood, and provide root-n asymptotic normality and semiparametric e ciency of smooth functionals of interest. Two small simulations are presented to illustrate the identi cation and the estimation results. Keywords: Errors-In-Variables (EIV), Identi cation; Nonclassical measurement error; Nonparametric regression; Sieve maximum likelihood. We thank participants at June 27 North American Summer Meetings of the Econometric Society at Duke for helpful comments. Chen acknowledges support from NSF. y Department of Economics, Yale University, Box 2828, New Haven, CT , USA. Tel: xiaohong.chenyale.edu. z Department of Economics, Johns Hopkins University, 44 Mergenthaler Hall, 34 N. Charles Street, Baltimore, MD 228, USA. Tel: yhujhu.edu. x Department of Economics, Boston College, 4 Commonwealth Avenue, Chestnut Hill, MA 2467 USA. Tel: lewbelbc.edu.

2 INTRODUCTION We consider identi cation and estimation of the nonparametric regression model Y = m (X ) + ; E[jX ] = (.) where Y and X are scalars and X is not observed. We assume X is discretely distributed, so for example X could be categorical, qualitative, or count data. We observe a random sample of Y and a scalar X, where X could be arbitrarily correlated with the unobserved X, and is independent of X and X. The extension to Y = m (X ; W ) + ; E[jX ; W ] = where W is an additional vector of observed error-free covariates is immediate (and is included in the estimation section) because our assumptions and identi cation results for model (.) can be all restated as conditional upon W. For convenience we will interpret X as a measure of X that is contaminated by measurement or misclassi cation error, but more generally X could represent some latent, unobserved quanti able discrete variable like a health status or life expectancy quantile, and X could be some observed proxy such as a body mass index quantile or the response to a health related categorical survey question. We will require that X and X have the same support. Equation (.) can be interpreted as a latent factor model Y = m +, with two unobserved independent factors m and, with identi cation based on observing the proxy X and on existence of a measurable function m() such that m = m(x ). Regardless of whether X is literally a awed measure of X or

3 just some related covariate, we may still de ne X X as a measurement error, though in general this error will be nonclassical, that is, it will have nonzero mean not be independent of X. Numerous data sets possess nonclassical measurement errors (see, e.g., Bound, Brown and Mathiowetz (2) for a review), however, to the best of our knowledge, there is no published work that allows for nonparametric identi cation and estimation of nonparametric regression models with nonclassical measurement errors without parametric restrictions or additional sample information such as instrumental variables, repeated measurements, or validation data, which our paper provides. In short, we nonparametrically recover the conditional density f Y jx (or equivalently, the function m and the density of ) just from the observed joint distribution f Y;X, while imposing minimal restrictions on the joint distribution f X;X. The relationship between the latent model f Y jx and the observed density f Y;X is Z f Y;X (y; x) = f Y jx (yjx )f X;X (x; x ) dx : (.2) All the existing published papers for identifying the latent model f Y jx use one of the three methods: i) assuming the measurement error structure f XjX belongs to a parametric family (see, e.g., Fuller (987), Bickel and Ritov (987), Hsiao (99), Murphy and Van der Vaart (996), Wang, Lin, Gutierrez and Carroll (998), Liang, Hardle and Carroll (999), Taupin (2), Hong and Tamer (23), Liang and Wang (25)); ii) assuming there exists an additional exogenous variable Z in the sample (such as an instrument or a repeated measure) that does not enter the latent model f Y jx, and exploiting assumed restrictions on f Y jx ;Z and f X;X ;Z to identify f Y jx given the joint distribution of fy; x; zg (see, e.g., Hausman,

4 Ichimura, Newey and Powell (99), Li and Vuong (998), Li (22), Wang (24), Schennach (24), Carroll, Ruppert, Crainiceanu, Tosteson and Karagas (24), Mahajan (26), Lewbel (27), Hu (26) and Hu and Schennach (26)); or iii) assuming a secondary sample to provide information on f X;X to permit recovery of f Y jx from the observed f Y;X in the primary sample (see, e.g., Carroll and Stefanski (99), Lee and Sepanski (995), Chen, Hong, and Tamer (25), Chen, Hong, and Tarozzi (27), Hu and Ridder (26)). Detailed reviews on existing approaches and results can be found in several recent books and surveys on measurement error models; see, e.g., Carroll, Ruppert, Stefanski and Crainiceanu (26), Chen, Hong and Nekipelov (27), Bound, Brown and Mathiowetz (2), Wansbeek and Meijer (2), and Cheng and Van Ness (999). In this paper, we obtain identi cation by exploiting nonparametric features of the latent model f Y jx, such as independence of the regression error term. Our results are useful because many applications specify the latent model of interest f Y jx, while often little is known about f XjX, that is, about the nature of the measurement error or the exact relationship between the unobserved latent X and a proxy X. In addition, our key rank condition for identi cation is directly testable from the data. Our identi cation method is based on characteristic functions. Suppose X and X have support X = f; 2; :::; Jg. Then by equation (.), exp (ity ) = exp (it) P J j= (X = j) exp [im (j) t] for any given constant t, where () is the indicator function that equals one if its argument is true and zero otherwise. This equation and independence of yield moments JX E [exp (ity ) j X = x] = E [exp (it)] f X jx(x jx) exp [im (x ) t] (.3) x =

5 Evaluating equation (.3) for t 2 ft ; :::; t K g and x 2 f; 2; :::; Jg provides KJ equations in J 2 + J + K unknown constants. These unknown constants are the values of f X jx(x jx), m (x ), and E [exp (it)] for t 2 ft ; :::; t K g, x 2 f; 2; :::; Jg, and x 2 f; 2; :::; Jg. Given a large enough value of K, these moments provide more equations than unknowns. We provide su cient regularity assumption to ensure existence of some set of constants ft ; :::; t K g such that these equations do not have multiple solutions, and the resulting unique solution to these equations provides identi cation of m() and of f X;X, and hence of f Y jx. As equation (.3) shows, our identi cation results depend on many moments of Y or equivalently of, rather than just on the conditional mean restriction E[jX ] = that would su ce for identi cation if X were observed. Previous results, for example, Reiersol (95), Kendall and Stuart (979), and Lewbel (997), exist that obtain identi cation based on higher moments as we do (without instruments, repeated measures, or validation data), but all these previous results have assumed either classical measurement error or/and parametric restrictions. Equation (.3) implies that independence of the regression error is actually stronger than necessary for identi cation, since e.g. we would obtain the same equations used for identi cation if we only had exp (it)? X jx for t 2 ft ; :::; t K g. To illustrate this point further, we provide an alternative identi cation result for the dichotomous X case without the independence assumption, and in this case we can identify (solve) for all the unknowns in closed-form. Estimation could be based directly on equation (.3) using, for example, Hansen s (982) Generalized Method of Moments (GMM). However, this would require knowing or choosing constants t ; :::; t K. Moreover, under the independence assumption of and X, we

6 have potentially in nitely many constants t that solves equation (.3); hence GMM estimation using nitely many such t s will not be e cient in general. One could apply the in nite-dimensional Z-estimation as described in Van der Vaart and Wellner (996), here we instead apply the method of sieve Maximum Likelihood (ML) of Grenander (98) and Geman and Hwang (982), which does not require knowing or choosing constants t ; :::; t K, and easily allows for an additional vector of error-free covariates W. The sieve ML estimator essentially replaces the unknown functions f, m, and f X jx;w with polynomials, Fourier series, splines, wavelets or other sieve approximators, and estimates the parameters of these approximations by maximum likelihood. By simple applications of the general theory on sieve MLE developed in Wong and Shen (995), Shen (997), Van de Geer (993, 2) and others, we provide consistency and convergence rate of the sieve MLE, and root-n asymptotic normality and semiparametric e ciency of smooth functionals of interest, such as the weighted averaged derivatives of the latent nonparametric regression function m(x ; W ), or the nite-dimensional parameters () in a semiparametric speci cation of m(x ; W ; ). The rest of this paper is organized as follows. Section 2 provides the identi cation results. Section 3 describes the sieve ML estimator and presents its large sample properties. Section 4 provides two small simulation studies. Section 5 brie y concludes and all the proofs are in the appendix. 2 NONPARAMETRIC IDENTIFICATION Our basic nonparametric regression model is equation (.) with scalar Y and X 2 X = f; 2; :::; Jg. We observe a random sample of (X; Y ) 2 X Y, where X is a proxy of X. The

7 goal is to consider restrictions on the latent model f Y jx that su ce to nonparametrically identify f Y jx and f X jx from f Y jx. Assumption 2. X? jx. This assumption implies that the measurement error X X is independent of the dependent variable Y conditional on the true value X. In other words, we have f Y jx ;X(yjx ; x) = f Y jx (yjx ) for all (x; x ; y) 2 X X Y. Assumption 2.2 X? : This assumption implies that the regression error is independent of the regressor X so f Y jx (yjx ) = f (y ones are then: m(x )). The relationship between the observed density and the latent JX f Y;X (y; x) = f (y m(x )) f X;X (x; x ) : (2.) x = Assumption 2.2 rules out heteroskedasticity or other heterogeneity of the regression error, but allows its density f to be completely unknown and nonparametric. Let denote a characteristic function (ch.f.). Equation (2.) is equivalent to JX Y;X=x (t) = (t) exp (itm(x )) f X;X (x; x ) ; (2.2) x = for all real-valued t, where Y;X=x (t) = R exp(ity)f Y;X (y; x)dy and x 2 X. Since may not be symmetric, (t) = R exp(it)f ()d need not be real-valued. We therefore let (t) (t) exp (ia (t)) ;

8 where (t) q Ref (t)g 2 + Imf (t)g 2 ; Ref (t)g a (t) arccos (t) : We then have for any real-valued scalar t, Y;X=x (t) = (t) JX exp (itm(x ) + ia (t)) f X;X (x; x ) : (2.3) x = De ne F X;X = B f X;X (; ) f X;X (; 2) ::: f X;X (; J) f X;X (2; ) f X;X (2; 2) ::: f X;X (2; J) ::: ::: ::: ::: f X;X (J; ) f X;X (J; 2) ::: f X;X (J; J) ; C A and for any real-valued vector t = (; t 2 ; :::; t J ), we de ne Y;X (t) = B f X () Y;X= (t 2 ) ::: Y;X= (t J ) f X (2) Y;X=2 (t 2 ) ::: Y;X=2 (t J ) ::: ::: ::: ::: f X (J) Y;X=J (t 2 ) ::: Y;X=J (t J ) ; D jj (t) = C B A ::: (t 2 ) ::: ::: ::: ::: ::: ::: (t J ) ; C A and de ne m j = m(j) for j = ; 2; :::; J; m;a (t) = B exp (it 2 m + ia (t 2 )) ::: exp (it J m + ia (t J )) exp (it 2 m 2 + ia (t 2 )) ::: exp (it J m 2 + ia (t J )) ::: ::: ::: ::: exp (it 2 m J + ia (t 2 )) ::: exp (it J m J + ia (t J )) : C A

9 With these matrix notations, for any real-valued vector t equation (2.3) is equivalent to Y;X (t) = F X;X m;a (t) D jj (t): (2.4) Equation (2.4) relates the known parameters Y;X (t) (which may be interpreted as reduced form parameters of the model) to the unknown structural parameters F X;X, m;a (t), and D jj (t). Equation (2.4) provides a su cient number of equality constraints to identify the structural parameters given the reduced form parameters, so what is required are su cient invertibility or rank restrictions to rule out multiple solutions of these equations. To provide these conditions, consider both the real and imaginary parts of Y;X (t). Since D jj (t) is real by de nition, we have Ref Y;X (t)g = F X;X Ref m;a (t)g D jj (t); (2.5) and Imf Y;X (t)g = F X;X Imf m;a (t)g D jj (t): (2.6) The matrices Imf Y;X (t)g and Imf m;a (t)g are not invertible because their rst columns are zeros, so we replace equation (2.6) with (Imf Y;X (t)g + X ) = F X;X (Imf m;a (t)g + ) D jj (t); (2.7)

10 where X = B f X () ::: f X (2) ::: ::: ::: ::: ::: f X (J) ::: C A and = B ::: ::: ::: ::: ::: ::: ::: : C A Equation (2.7) holds because F X;X = X and D jj (t) =. Let C t (Ref Y;X (t)g) (Imf Y;X (t)g + X ). Assumption 2.3 (rank).there is a real-valued vector t = (; t 2 ; :::; t J ) such that: (i) Ref Y;X (t)g and (Imf Y;X (t)g + X ) are invertible; (ii) For any real-valued J J diagonal matrices D k = Diag (; d k;2 ; :::; d k;j ), if D + C t D C t + D 2 C t C t D 2 = then D k = for k = ; 2. We call Assumption 2.3 the rank condition, because it is analogous to the rank condition for identi cation in linear models, and in particular implies identi cation of the two diagonal matrices and D lnjj (t) = Diag ; t ln (t 2 ) ; :::; t ln (t J ) D a (t) = Diag ; t a(t 2); :::; t a(t J). Assumption 2.3(ii) is rather complicated, but can be replaced by some simpler su cient alternatives, which we will describe later. Note also that the rank condition, Assumption 2.3, is testable, since it is expressed entirely in terms of f X and the matrix Y;X (t), which, given a vector t, can be directly estimated from data.

11 In the appendix, we show that Re Y;X (t) A t (Re Y;X (t)) = F XjX D m F XjX ; (2.8) where A t on the left-hand side is identi ed when D lnjj (t) and D a (t) are identi ed, D m = Diag (m(); :::; m(j)), and F XjX = B f XjX (j) f XjX (j2) ::: f XjX (jj) f XjX (2j) f XjX (2j2) ::: f XjX (2jJ) ::: ::: ::: ::: f XjX (Jj) f XjX (Jj2) ::: f XjX (JjJ) : C A Equation (2.8) implies that f XjX (jx ) and m(x ) are eigenfunctions and eigenvalues of an identi ed J J matrix on the left-hand. We may then identify f XjX (jx ) and m(x ) under the following assumption: Assumption 2.4 (i) m(x ) < and m(x ) 6= for all x 2 X ; (ii) m(x ) is strictly increasing in x 2 X. Assumption 2.4(i) implies that each possible value of X is relevant for Y, and the monotonicity assumption 2.4(ii) allows us to assign each eigenvalue m(x ) to its corresponding value x. If we only wish to identify the support of the latent factor m = m(x ) and not the regression function m() itself, then this monotonicity assumption can be dropped. Assumption 2.4 could be replaced by restrictions on f X;X instead (e.g., by exploiting knowledge about the eigenfunctions rather than eigenvalues to properly assign each m(x )

12 to its corresponding value x ), but assumption 2.4 is more in line with our other assumptions, which assume that we have information about our regression model but know very little about the relationship of the unobserved X to the proxy X. Theorem 2. Suppose that assumptions 2., 2.2, 2.3 and 2.4 hold in equation (.). Then the density f Y;X uniquely determines f Y jx and f X;X. Given our model, de ned by assumptions 2. and 2.2, Theorem 2. shows that assumptions 2.3 and 2.4 guarantee that the sample of (Y; X) is informative enough to identify, m(x ) and f X;X in equation (2.2), meaning that all the unknowns in equation (2.) are nonparametrically identi ed. This identi cation is obtained without additional sample information such as an instrumental variable or a secondary sample. Of course, if we have additional covariates such as instruments or repeated measures, they could exploited along with Theorem 2.. Our results can also be immediately applied if we observe an additional covariate vector W that appears in the regression function, so Y = m (X ; W ) +, since our assumptions and results can all be restated as conditioned upon W. Now consider some simpler su cient conditions for assumption 2.3(ii) in Theorem 2.. Denote C T t as the transpose of C t. Let the notation "" stand for the Hadamard product, i.e., the element-wise product of two matrices. Assumption 2.5 The real-valued vector t = (; t 2 ; :::; t J ) satisfying assumption 2.3(i) also satis es: C t C T t + I is invertible and all the entries in the rst row of the matrix C t are nonzero. Assumption 2.5 implies assumption 2.3(ii), and is in fact stronger than assumption 2.3(ii), since if it holds then we may explicitly solve for D lnjj (t) and D a (t) in simple closed form.

13 Another alternative to assumption 2.3(ii) is the following: Assumption 2.6 (symmetric rank) a (t) = for all t and for any real-valued JJ diagonal matrix D = Diag (; d ;2 ; :::; d ;J ), if D + C t D C t = then D =. The condition in assumption 2.6 that a (t) = for all t is the same as assuming that the distribution of the error term is symmetric. We call assumption 2.6 the symmetric rank condition because it implies our previous rank condition when is symmetrically distributed. Finally, as noted in the introduction, the assumption that the measurement error is independent of the regression error, assumption 2.2, is stronger than necessary. All independence is used for is to obtain equation (.3) for some given values of t. More formally, all that is required is that equation (2.4), and hence that equations (2.6) and (2.7) hold for the vector t in assumption 2.3. When there are covariates W in the regression model, which we will use in the estimation, the requirement becomes that equation (2.4) hold for the vector t in assumption 2.3 conditional on W. Therefore, Theorem 2. holds replacing assumption 2.2 with the following, strictly weaker assumption. Assumption 2.7 For the known t = ; t 2 ; :::; t J that satis es assumption 2.3, jx =x (t) = jx = (t) and t jx =x (t) = t jx = (t) for all x 2 X. This condition permits some correlation of the proxy X with the regression error, and allows some moments of to correlate with X 2. THE DICHOTOMOUS CASE We now show how the assumptions required for Theorem 2. can be relaxed and simpli ed in the special case where X is a - dichotomous variable, i.e., X = f; g. De ne m j = m(j)

14 for j = ;. Given just assumption 2., the relationship between the observed density and the latent ones becomes f Y jx (yjj) = f X jx (jj) f jx (y m jj) + f X jx (jj) f jx (y m jj) for j = ; : (2.9) With assumption 2.2, equation (2.9) simpli es to f Y jx (yjj) = f X jx (jj) f (y m ) + f X jx (jj) f (y m ) for j = ; ; (2.) which says that the observed density f Y jx (yjj) is a mixture of two distributions that only di er in their means. Studies on mixture models focus on parametric or nonparametric restrictions on f for a single value of j that su ce to identify all the unknowns in this equation. For example, Bordes, Mottelet and Vandekerkhove (26) shows that all the unknowns in equation (2.) are identi ed for each j when the distribution of is symmetric. In contrast, errors-in-variables models typically impose restrictions on f X jx (or exploit additional information regarding f X jx such as instruments or validation data) along with equation (2.9) or (2.) to obtain identi cation with few restrictions on the distribution f. Now consider assumptions 2.3 or 2.5 in the dichotomous case. We then have for any real-valued 2 vector t = (; t), Y;X (t) = B f X () Y jx= (t)f X () f X () Y jx= (t)f X () C A

15 Ref Y;X (t)g = B f X () Re Y jx= (t)f X () f X () Re Y jx= (t)f X () det (Ref Y;X (t)g) = f X ()f X () Re Y jx= (t) Imf Y;X (t)g + X = B C A f X () Im Y jx= (t)f X () f X () Im Y jx= (t)f X () Re Y jx= (t) C A det (Imf Y;X (t)g + X ) = f X ()f X () Im Y jx= (t) Im Y jx= (t) Also, C t (Ref Y;X (t)g) (Imf Y;X (t)g + X ) 2 = = 6 det (Ref Y;X (t)g) Re Y jx= (t)f X () Re Y jx= (t)f X () f X () f X () f X()f X ()[Im Y jx= (t) Re Y jx= (t) Re Y jx= (t) Im Y jx= (t)] det(ref Y;X (t)g) det(imf Y;X (t)g+ X) det(ref Y;X (t)g) ; 3 7 B 5 f X () Im Y jx= (t)f X () f X () Im Y jx= (t)f X () C A thus 2 C t Ct T + I = det(imf Y;X (t)g+ X) det(ref Y;X (t)g) ; which is always invertible. Therefore, for the dichotomous case, assumption 2.3 and assumption 2.5 become the same, and can be expressed as the following rank condition for binary data: Assumption 2.8 (binary rank) (i) f X ()f X () > ; (ii) there exist a real-valued scalar t such that Re Y jx= (t) 6= Re Y jx= (t), Im Y jx= (t) 6= Im Y jx= (t), Im Y jx= (t) Re Y jx= (t) 6=

16 Re Y jx= (t) Im Y jx= (t). It should be generally easy to nd a real-valued scalar t that satis es this binary rank condition. In the dichotomous case, instead of imposing Assumption 2.4, we may obtain the ordering of m j from that of observed j E(Y jx = j) under the following assumption: Assumption 2.9 (i) > ; (ii) f X jx (j) + f X jx (j) < : Assumption 2.9(i) is not restrictive because one can always rede ne X as X if needed. Assumption 2.9(ii) reveals the ordering of m and m, by making it the same as that of and because f X jx (j) f X jx (j) = m m ; so m > m. Assumption 2.9(ii) says that the sum of misclassi cation probabilities is less than one, meaning that, on average, the observations X are more accurate predictions of X than pure guesses. See Lewbel (27) for further discussion of this assumption. The following Corollary is a direct application of Theorem 2.; hence we omit its proof. Corollary 2.2 Suppose that X = f; g, equations (.) and (2.), assumptions 2.8 and 2.9 hold. Then the density f Y;X uniquely determines f Y jx and f X;X. 2.. IDENTIFICATION WITHOUT INDEPENDENCE We now show how to obtain identi cation in the dichotomous case without the independent regression error assumption 2.2. Given just assumption 2., Equation (2.9) implies that

17 the observed density f Y jx (yjj) is a mixture of two conditional densities f jx (y m jj) and f jx (y m jj). Instead of assuming the independence between X and, we impose the following weaker assumption: Assumption 2. E k jx = E k for k = 2; 3. Only these two moment restrictions are needed because we only need to solve for two unknowns, m and m. Identi cation could also be obtained using other, similar restrictions such as quantiles or modes. For example, one of the moments in this assumption 2. might be replaced with assuming that the density f jx = has zero median. Equation (2.9) then implies that :5 = m Z m f Y jx= (y)dy + m Z m f Y jx= (y)dy which may uniquely identify m under some testable assumptions. An advantage of Assumption 2. is that we obtain a closed-form solution for m and m. h h 2 3 De ne v j E Y j jx = j i, s j E Y j jx = j i, C (v + 2 ) (v + 2 ) ; C 2 2 ( ) v v s s : 2 We leave the detailed proof to the appendix and present the result as follows: Theorem 2.3 Suppose that X = f; g, equations (.) and (2.9), assumptions 2.9 and 2. hold. Then the density f Y jx uniquely determines f Y jx and f X jx. To be speci c, we

18 have m = 2 C r 2 C 2; m = r 2 C + 2 C 2; 2 C f X jx (j) = p 2C2 2 ; f X jx (j) = 2 C p 2C2 2 ; and m j f Y jx =j(y) = f Y jx= (y) + m j f Y jx= (y): 3 SIEVE MAXIMUM LIKELIHOOD ESTIMATION This section considers the estimation of a nonparametric regression model as follows: Y = m (X ; W ) + ; where the function m () is unknown and W is a vector of error-free covariates and is independent of (X ; W ). Let fz t (Y t ; X t ; W t )g n t= denote a random sample of Z (Y; X; W ). We have shown that f Y jx ;W and f X jx;w are identi ed from f Y jx;w. Let (f ; f 2 ; f 3 ) T f ; f X jx;w ; m T be the true parameters of interest. Then the observed likelihood of Y given (X; W ) (or the likelihood for ) is ny f Y jx;w (Y t jx t ; W t ) = t= ( ) ny X f (Y t m (x ; W t )) f X jx;w (x jx t ; W t ) : t= x 2X Before we present a sieve ML estimator b for, we need to impose some mild smoothness restrictions on the unknown functions f ; f X jx;w ; m T. The sieve method allows for

19 unknown functions belonging to many di erent function spaces such as Sobolev space, Besov space and others; see e.g., Shen and Wong (994), Wong and Shen (995), Shen (997) and Van de Geer (993, 2). But, for the sake of concreteness and simplicity, we consider the widely used Hölder space of functions. Let = ( ; :::; d ) T 2 R d, a = (a ; :::; a d ) T be a vector of non-negative integers, and r a h() jaj a a h( d ; :::; d ) d denote the jaj = a + + a d -th derivative. Let kk E denote the Euclidean norm. Let V R d and be the largest integer satisfying >. The Hölder space (V) of order > is a space of functions h : V 7! R such that the rst derivatives are continuous and bounded, and the -th derivative are Hölder continuous with the exponent 2 (; ]. The Hölder space (V) becomes a Banach space under the Hölder norm: khk = max sup jaj jr a h()j + max sup jaj= jr a h() r a h( )j 6= (k k E ) < : Denote c (V) fh 2 (V) : khk c < g as a Hölder ball. Let 2 R, W 2 W with W a compact convex subset in R dw. Also denote F = pf () 2 c (R) : f () >, Z R f ()d = ; F 2 = pf2 (x jx; ) 2 2 c (W) : f 2 (j; ) >, Z X f 2 (x jx; w) dx = for all x 2 X, w 2 W ;

20 and F 3 = ff 3 (x ; ) 2 3 c (W) : f 3 (i; w) > f 3 (j; w) for all i > j, i; j 2 X, w 2 Wg We impose the following smoothness restrictions on the densities: Assumption 3. (i) all the assumptions in Theorem 2. hold; (ii) f () 2 F with > =2; (iii) f X jx;w (x jx; ) 2 F 2 with 2 > d w =2 for all x ; x 2 X f; :::; Jg; (iv) m (x ; ) 2 F 3 with 3 > d w =2 for all x 2 X : Denote A = F F 2 F 3 and = (f ; f 2 ; f 3 ) T. Let E[] denote the expectation with respect to the underlying true data generating process for Z t. Then (f ; f 2 ; f 3 ) T = arg max 2A E[`(Z t ; )], where ( ) X `(Z t ; ) ln f (Y t f 3 (x ; W t )) f 2 (x jx t ; W t ) : (3.) x 2X Let A n = F n F n 2 F n 3 be a sieve space for A, which is a sequence of approximating spaces that are dense in A under some pseudo-metric. The sieve MLE b n = bf ; b f 2 ; b f 3 T 2 An for 2 A is de ned as: b n = arg max 2A n nx `(Z t ; ): (3.2) t= We could apply in nite-dimensional approximating spaces as sieves F n j for F j ; j = ; 2; 3. However, in applications, we shall use nite-dimensional sieve spaces since they are easier to implement. For j = ; 2; 3, let p k j;n j () be a k j;n vector of known basis functions, such as power series, splines, Fourier series, etc. Then we denote the sieve space for F j ; j = ; 2; 3

21 as follows: F n = F n 2 = ( pf2 (x jx; ) = F n 3 = ( np f () = p k ;n () T 2 F o ; JX JX k= j= f 3 (x ; ) = JX k= I (x = k) I (x = j) p k 2;n 2 () T 2;kj 2 F 2 ) ; I (x = k) p k 2;n 3 () T 3;k 2 F 3 ) : We note that the method of sieve MLE is very exible and we can easily impose prior information on the parameter space (A) and the sieve space (A n ). For example, if the functional form of the true regression function m (x ; w) is known upto some nite-dimensional parameters 2 B, where B is a compact subset of R d, then we can take A = F F 2 F B and A n = F n F n 2 F B with F B = ff 3 (x ; w) = m (x ; w; ) : 2 Bg. The sieve MLE becomes b n = arg max 2A n nx `(Z t ; ); t= ( ) X with `(Z t ; ) = ln f (Y t m (x ; W t ; )) f 2 (x jx t ; W t ) : x 2X (3.3) 3. Consistency The consistency of the sieve MLE b n can be established by applying either Geman and Hwang (982) or lemma A. of Newey and Powell (23). First we de ne a norm on A as follows: kk s = sup h() + 2 =2 + sup x ;x;w jf 2 (x jx; w)j + sup jf 3 (x ; w)j for some > : x ;w

22 We assume Assumption 3.2 (i) < E [`(Z t ; )] <, E [`(Z t ; )] is upper semicontinuous on A under the metric kk s ; (ii) there are a nite > and a random variable U(Z t ) with EfU(Z t )g < such that sup 2An:k k s j`(z t ; ) `(Z t ; )j U(Z t ). Assumption 3.3 (i) p k ;n () is a k ;n vector of spline wavelet basis functions on R, and for j = 2; 3, p k j;n j () is a k j;n vector of tensor product of spline basis functions on W; (ii) k n maxfk ;n ; k 2;n ; k 3;n g! and k n =n!. The following consistency lemma is a direct application of lemma A. of Newey and Powell (23) or theorem 3. (or remark 3.(4), remark 3.3) of Chen (26); hence we omit its proof. Lemma 3. Let b n be the sieve MLE. Under assumptions , we have kb n k s = o p (). 3.2 Convergence rate under the Fisher metric Given Lemma 3., we can now restrict our attention to a shrinking jj jj s neighborhood around. Let A s f 2 A : jj jj s = o(); jjjj s c < cg and A sn f 2 A n : jj jj s = o(); jjjj s c < cg. Then, for the purpose of establishing a convergence rate under a pseudo metric that is weaker than jj jj s, we can treat A s as the new parameter space and A sn as its sieve space, and assume that both A s and A sn are convex parameter spaces. For any, 2 2 A s, we consider a continuous path f () : 2 [; ]g in A s such that () = and () = 2. For simplicity we assume that for any ; + v 2 A s,

23 f + v : 2 [; ]g is a continuous path in A s, and that `(Z t ; + v) is twice continuously di erentiable at = for almost all Z t and any direction v 2 A s. De ne the pathwise rst derivative as d`(z t ; ) d [v] d`(z t; + v) j = a.s. Z t ; d and the pathwise second derivative as d 2`(Z t ; ) [v; v] d2`(z t ; + v) j dd T d 2 = a.s. Z t : De ne the Fisher metric kk on A s as follows: for any, 2 2 A s, k 2 k 2 E ( d`(zt ; ) d ) 2 [ 2 ] : We show that b n converges to at a rate faster than n =4 under the Fisher metric kk with the following assumptions: Assumption 3.4 (i) > ; (ii) minf ; 2 =d w ; 3 =d w g > =2: Assumption 3.5 (i) A s is convex at ; (ii) `(Z t ; ) is twice continuously pathwise di erentiable with respect to 2 A s. Assumption 3.6 sup e2as sup 2Asn d`(z t;e) d with Ef[U(Z t )] 2 g <. h k k s i U(Zt ) for a random variable U(Z t )

24 Assumption 3.7 (i) sup v2as :jjvjj s= E d`(zt; ) A s and 2 A sn, we have d 2 [v] c < ; (ii) uniformly over e 2 E d2`(z t ; e) [ dd T ; ] = k k 2 f + o()g: Assumption 3.4 guarantees that the sieve approximation error under the strong norm jj jj s goes to zero at the rate of (k n ). Assumption 3.5 makes sure that the twice pathwise derivatives are well de ned with respect to 2 A s, hence the pseudo metric k k is well de ned on A s. Assumption 3.6 impose an envelope condition. Assumption 3.7(i) implies that k k p c k k s for all 2 A s. Assumption 3.7(ii) implies that there are positive nite constants c and c 2 such that for all 2 A sn, c k k 2 E[`(Z t ; ) `(Z t ; )] c 2 k k 2, that is, k k 2 is equivalent to the Kullback- Leibler discrepancy on the local sieve space A sn. The following convergence rate theorem is a direct application of theorem 3.2 of Shen and Wong (24) or theorem 3.2 of Chen (26) to the local parameter space A s and the local sieve space A sn ; hence we omit its proof. Theorem 3.2 Under assumptions , we have kb n k = O P max ( r )! kn kn ; = O P n n 2+ if k n = O n 2+ :

25 3.3 Asymptotic normality and semiparametric e ciency Let V denote the closure of the linear span of A s f g under the Fisher metric kk. Then V; kk is a Hilbert space with the inner product de ned as d`(zt ; ) hv ; v 2 i E d d`(zt ; ) [v ] [v 2 ] : d We are interested in estimation of a functional ( ), where : A! R. It is known that the asymptotic properties of (^ n ) depend on the smoothness of the functional and the rate of convergence of the sieve MLE ^ n. For any v 2 V, we denote d( ) d [v] lim [(( + v) ( ))=]! whenever the right hand-side limit is well de ned. We impose the following additional conditions for asymptotic normality of plug-in sieve MLE (^ n ): Assumption 3.8 (i) for any v 2 V, ( + v) is continuously di erentiable in 2 [; ] near =, and k d( ) d k sup v2v:jjvjj> d( ) d jjvjj [v] < ; (ii) there exist constants c > ;! >, and a small " > such that for any v 2 V with jjvjj ", we have ( + v) ( ) d( ) d [v] cjjvjj! : Under Assumption 3.8 (i), by the Riesz representation theorem, there exists 2 V such

26 that and h ; vi = d( ) [v] for all v 2 V (3.4) d jj jj 2 k d( ) d( ) [v] 2 d d k2 sup < : (3.5) v2v:jjvjj> jjvjj 2 Under Theorem 3.2, we have jjb n jj = O P ( n ) with n = n 2+. In the following we denote N = f 2 A s : k k = O( n )g and N n = f 2 A sn : k k = O( n )g. Assumption 3.9 (i) ( n )! = o(n =2 ); (ii) there is a n 2 A n f g such that jj n jj = o() and n k n k = o(n =2 ). Assumption 3. there is a random variable U(Z t ) with Ef[U(Z t )] 2 g < and a nonnegative measurable function with lim! () = such that for all 2 N n, sup d 2`(Z t ; ) [ dd T ; n] U(Z t) (jj jj s ): 2N Assumption 3. Uniformly over 2 N and 2 N n, E d2`(z t ; ) dd T [ ; n] d 2`(Z t ; ) [ dd T ; n] = o n =2 : Assumption 3.8(i) is critical for obtaining the p n convergence of plug-in sieve MLE (^ n ) to ( ) and its asymptotic normality. If Assumption 3.8(i) is not satis ed, then the plug-in sieve MLE (^ n ) is still consistent for ( ), but the best achievable convergence rate is slower than the p n rate. Assumption 3.9 implies that the asymptotic bias of the Riesz representer is negligible. Assumptions 3. and 3. control the remainder term.

27 Applying theorems and 4 of Shen (997), we immediately obtain Theorem 3.3 Suppose that assumptions hold. Then the plug-in sieve MLE (^ n ) is semiparametrically e cient, and p n ((^ n ) ( )) d! N (; jj jj 2 ). Following Ai and Chen (23), the asymptotic e cient variance, jj jj 2, of the plug-in sieve MLE (^ n ) can be consistently estimated by b 2 n: b 2 n = max v2a n n P n t= d(bn) [v] 2 d d`(zt;b n) [v] d 2 : Instead of estimating this asymptotic variance, one could also construct con dence intervals by applying the likelihood ratio inference as in Murphy and Van der Vaart (996, 2). 4 Simulation This section presents two small simulation studies: the rst one corresponds to the identi - cation strategy, and the second one checks the performance of sieve MLE. 4. Moment-based estimation This subsection applies the identi cation procedure to a simple nonlinear regression model with simulated data. We consider the following regression model y = + :25 (x ) 2 + : (x ) 3 + ;

28 where N(; ) is independent of x. The marginal distribution Pr(x ) is as follows: Pr(x ) = :2 [(x = ) + (x = 4)] + :3 [(x = 2) + (x = 3)] and the misclassi cation probability matrix F xjx are in Tables -2. We consider two examples of the misclassi cation probability matrix. Example considers a strictly diagonally dominant matrix F xjx as follows: F xjx = B :6 :2 : : :2 :6 : : : : :7 : : : : :7 : C A Example 2 has a misclassi cation probability matrix F xjx = :7F u + :3I; where I is an identify matrix and F u = h i uij P k u kj ij with u ij independently drawn from a uniform distribution on [; ]. In each repetition, we directly follow the identi cation procedure shown in the proof of theorem 2.. The matrix Y;X is estimated by replacing the function Y;X=x (t) with its corresponding empirical counterpart as follows: b Y;X=x (t) = nx exp(ity j ) (x j = x): j=

29 Since it is directly testable, assumption 2.3 is veri ed with t j in the vector t = (; t 2 ; t 3 ; t 4 ) independently drawn from a uniform distribution on [ ; ] until a desirable t is found. The sample size is 5 and the repetition times is. The simulation results in Tables -2 include the estimates of regression function m(x ), the marginal distribution Pr(x ), and the estimated misclassi cation probability matrix F xjx, together with standard errors of each element. As shown in Tables -2, the estimator following the identi cation procedure performs well with the simulated data. 4.2 Sieve MLE This subsection applies the sieve ML procedure to a semiparametric model as follows: Y = W + 2 ( X ) W ; where is independent of X 2 f; g and W. The unknowns include the parameter of interest = ( ; 2 ; 3 ) and the nuisance functions f and f X jx;w. We simulate the model from N(; ) and X 2 f; g according to the marginal distribution f X (x ) = :4 (x = ) + :6 (x = ). We generate the covariate W as W = ( :5X ), where N(; ) is independent of X. The observed mismeasured X is generated as follows: 8 >< () p(x ) X = >: otherwise ; where p() = :5 and p() = :3. The Monte Carlo simulation consists of 4 repetitions. In each repetition, we randomly

30 draw 3 observations of (Y; X; W ), and then apply three ML estimators to compute the parameter of interest. All three estimators assume that the true density f of the regression error is unknown. The rst estimator uses the contaminated sample fy i ; X i ; W i g n i= as if it were accurate; this estimator is inconsistent and its bias should dominate the squared root of mean square error (root MSE). The second estimator is the sieve MLE using uncontaminated data fy i ; Xi ; W i g n i= ; this estimator is consistent and most e cient. However, we call it infeasible MLE since X i is not observed in practice. The third estimator is the sieve MLE (3.3) presented in Section 3, using the sample fy i ; X i ; W i g n i= and allowing for arbitrary measurement error by assuming f XjX ;W is unknown. In this simulation study, all three estimators are computed by approximating the unknown p f using the same Hermite polynomial sieve; for the third estimator (the sieve MLE) we also approximate p f XjX ;W by another Hermite polynomial sieve. The Monte Carlo results in Table 3 show that the sieve MLE has a much smaller bias than the rst estimator ignoring measurement error. Since the sieve MLE has to estimate the additional unknown function f XjX ;W, its b j, j = ; 2; 3 estimate may have larger standard error compared to the other two estimators. In summary, our sieve MLE performs well in this Monte Carlo simulation. 5 Discussion We have provided nonparametric identi cation and estimation of a regression model in the presence of a mismeasured discrete regressor, without the use of additional sample information such as instruments, repeated measurements or validation data, and without parameterizing the distributions of the measurement error or of the regression error.

31 Identi cation mainly comes from the monotonicity of the regression function, the limited support of the mismeasured regressor, su cient variation in the dependent variable, and from some independence related assumptions regarding the regression model error. It may be possible to extend these results to continuously distributed nonclassically mismeasured regressors, by replacing many of our matrix related assumptions and calculations with corresponding linear operators. APPENDIX. MATHEMATICAL PROOFS Proof. (Theorem 2.) Notice that () t = and a () =. we de ne t ie [Y jx = ] f X () t t Y;X=(t 2 ) ::: t Y;X=(t J ) Y;X(t) = B ie [Y jx = 2] f X (2) t Y;X=2(t 2 ) ::: t Y;X=2(t J ) C ::: ::: ::: ::: A : ie [Y jx = J] f X (J) t Y;X=J(t 2 ) ::: t Y;X=J(t J ) By taking the derivative with respect to scalar t, we have from equation (2.3) t Y;X=x (t) = t +i (t) t a (t) X exp (itm(x ) + ia (t)) f X;X (x; x ) (A.) x (t) X exp (itm(x ) + ia (t)) f X;X (x; x ) x +i (t) X x exp (itm(x ) + ia (t)) m(x )f X;X (x; x ) : Equation (A.) is equivalent to t Y;X(t) = F X;X m;a (t) D jj (t) (A.2) +i F X;X m;a (t) D jj (t) D a (t) + i F X;X D m m;a (t) D jj (t); where D a (t) = ::: D jj (t) = B t (t 2 ) ::: ::: ::: ::: ::: ::: (t t J ) ::: B C A ; D m = B t a(t 2) ::: ::: ::: ::: ::: ::: t a(t J) C A ; m ::: m 2 ::: ::: ::: ::: ::: ::: m J C A :

32 Since by de nition, D jj (t) and D a (t) are real-valued, we also have from equation (A.2) Ref t Y;X(t)g = F X;X Ref m;a (t)g D jj (t) (A.3) F X;X Imf m;a (t)g D jj (t) D a (t) F X;X D m Imf m;a (t)g D jj (t): In order to replace the singular matrix Imf m;a (t)g with the invertible (Imf m;a (t)g + ), we de ne E [Y jx = ] f X () ::: E[Y jx] = B E [Y jx = 2] f X (2) ::: C ::: ::: ::: ::: A = F X;X D m : E [Y jx = J] f X (J) ::: We then have Ref t Y;X(t)g E[Y jx] = F X;X Ref m;a (t)g D jj (t) (A.4) F X;X (Imf m;a (t)g + ) D jj (t) D a (t) F X;X D m (Imf m;a (t)g + ) D jj (t); where D jj (t) D a (t) = and = D jj (t). Similarly, we have Imf t Y;X(t)g = F X;X Imf m;a (t)g D jj (t) +F X;X Ref m;a (t)g D jj (t) D a (t) + F X;X D m Ref m;a (t)g D jj (t) = F X;X (Imf m;a (t)g + ) D jj (t) +F X;X Ref m;a (t)g D jj (t) D a (t) + F X;X D m Ref m;a (t)g D jj (t); where D jj (t) =. De ne Y jx (t) = m;a (t) D jj (t). We then have Ref Y jx (t)g = Ref m;a (t)gd jj (t); Imf Y jx (t)g + = (Imf m;a (t)g + )D jj (t): In summary, we have Ref Y;X (t)g = F X;X Ref Y jx (t)g; (A.5) (Imf Y;X (t)g + X ) = F X;X Imf Y jx (t)g + ; (A.6) Re t Y;X(t) E[Y jx] = F X;X Re m;a (t) D jj (t) (A.7) F X;X Im Y jx (t) + D a (t) F X;X D m Im Y jx (t) + ; Im t Y;X(t) = F X;X (Im m;a (t) + ) D jj (t) (A.8) +F X;X Re Y jx (t) D a (t) + F X;X D m Re Y jx (t):

33 The left-hand sides of these equations are all observed, while the right-hand sides contain all the unknowns. Assumption 2.3(i) also implies that F X;X, Ref m;a (t)g and (Imf m;a (t)g + ) are invertible in equations (2.5) and (2.7). Recall the de nition of the observed matrix C t ; which by equations (A.5) and (A.6) equals C t (Re Y;X (t)) (Im Y;X (t) + X ) = Re Y jx (t) Im Y jx (t) + : Denote A t Re Y jx (t) Dm Re Y jx (t). With equations (A.5) and (A.7), we consider B R (Re Y;X (t)) Re t Y;X(t) E[Y jx] = Re m;a (t) D jj (t) Re m;a (t) D jj (t) Re Y jx (t) Im Y jx (t) + D a (t) Re Y jx (t) Dm Im Y jx (t) + = [D jj (t)] D jj (t) C t D a (t) Re Y jx (t) Dm Re Y jx (t) C t D lnjj (t) C t D a (t) A t C t : (A.9) Similarly, we have by equations (A.6) and (A.8) B I (Im Y;X (t) + X ) Im t Y;X(t) = (Im m;a (t) + ) D jj (t) (Im m;a (t) + ) D jj (t) + Im Y jx (t) + Re Y jx (t) D a (t) + Im Y jx (t) + Dm Re Y jx (t); = D lnjj (t) + C t D a (t) + C t A t (A.) We eliminate the matrix A t in equations (A.9) and (A.) to have B R + C t B I C t = D lnjj (t) + C t D lnjj (t) C t + D a (t) C t C t D a (t): (A.) Notice that both D lnjj (t) and D a (t) are diagonal, Assumption 2.3(ii) implies that D lnjj (t) and D a (t) are uniquely identi ed from equation (A.). Further, since the diagonal terms of (D a (t) C t C t D a (t)) are zeros, we have diag (B R + C t B I C t ) = diag D lnjj (t) + C t Ct T diag D lnjj (t) +D a (t) diag (C t ) D a (t) diag (C t ) = C t Ct T + I diag D lnjj (t) ; where the function diag() generates a vector of the diagonal entries of its argument and the notation "" stands for the Hadamard product or the element-wise product. By assumption

34 2.5(i), we may solve D lnjj (t) as follows: diag D lnjj (t) = C t C T t Furthermore, equation (A.) implies that + I diag (BR + C t B I C t ) : (A.2) U B R + C t B I C t D lnjj (t) C t D lnjj (t) C t (A.3) = D a (t) C t C t D a (t); De ne a J by vector e = (; ; ; :::; ) T. The de nition of D a (t) implies that e T D a (t) =. Therefore, equation A.3 implies e T U = e T C t D a (t): Assumption 2.5(ii) implies that all the entries in the row vector e T C t are nonzero. Let e T C t (c ; c 2 ; :::; c J ). The vector diag (D a (t)) is then uniquely determined as follows: diag (D a (t)) = B c ::: c 2 ::: ::: ::: ::: ::: ::: c J C A U T e : After D lnjj (t) and D a (t) are identi ed, we may then identify the matrix A t Re Y jx (t) D m Re Y jx (t) from equation (A.) Notice that A t = C t B I D lnjj (t) D a (t): Re Y jx (t) = (F X;X ) Re Y;X (t) = F XjX F X Re Y;X (t) where F XjX = F X = B B f XjX (j) f XjX (j2) ::: f XjX (jj) f XjX (2j) f XjX (2j2) ::: f XjX (2jJ) ::: ::: ::: ::: f XjX (Jj) f XjX (Jj2) ::: f XjX (JjJ) f X () ::: f X (2) ::: ::: ::: ::: ::: ::: f X (J) C A : C A ; Therefore, we have Re Y;X (t) A t (Re Y;X (t)) = F XjX F X Dm F XjX F X = F XjX D m F XjX : (A.4)

35 Equation (A.4) implies that the unknowns m j in matrix D m are eigenvalues of a directly estimatable matrix on the left-hand side, and each column in the matrix F XjX is an eigenvector. Assumption 2.4 guarantees that all the eigenvalues are distinctive and nonzero in the diagonalization in equation (A.4). We may then identify m j as the roots of det (A t m j I) = : To be speci c, m j may be identi ed as the j-th smallest root. Equation (A.4) also implies that the j-th column in the matrix F XjX is the eigenvector corresponding to the eigenvalue m j. Notice that each eigenvector is already normalized because each column of F XjX is a conditional density and the sum of entries in each column equals one. Therefore, each column of F XjX is identi ed as normalized eigenvectors corresponding to each eigenvalue m j. Finally, we may identify f Y;X through equation (2.) as follows, for any y 2 Y: f Y;X (y; ) f Y;X (y; 2) ::: f Y;X (y; J) T = F XjX f Y;X (y; ) f Y;X (y; 2) ::: f Y;X (y; J) T : Proof. (Theorem 2.3) First, we introduce notations as follows: for j = ; h v j = E Y m j = m (j) ; j = E(Y jx = j); i h 2 j jx = j, s j = E Y j 3 jx = j i, p = f X jx (j) ; q = f X jx (j) ; f Y jx=j (y) = f Y jx (yjj): We start the proof with equation (2.9), which is equivalent to fy jx (yj) fx = jx (j) f X jx (j) fjx =(y m ) f Y jx (yj) f X jx (j) f X jx (j) f jx =(y m ) Using the notations above, we have fy jx= (y) p p = f Y jx= (y) q q Since E[jX ] =, we have We may solve for p and q as follows: fjx =(y m ) f jx =(y m ) = ( p) m + pm and = qm + ( q) m : : : (A.5) p = m m m and q = m m m : (A.6)

36 We also have p q = Assumption 2.9 implies that m m + m m m > m : = m m : and fjx =(y m ) f jx =(y m ) = q p p q q p Plug-in the expression of p and q in equation (A.6), we have fy jx= (y) f Y jx= (y) : p p q = m ; q p q = m ; and p p q = q p q ; q p q = p p q ; fjx =(y m ) f jx =(y m ) = = m m m m m m m m! fy jx= (y) f Y jx= (y) :! fy jx= (y) f Y jx= (y) In other words, we have for j = ; m j f jx =j(y) = f Y jx= (y + m j ) + m j f Y jx= (y + m j ): (A.7) In summary, f X jx (or p and q) and f jx are identi ed if we can identify m and m. Next, we show that m and m are indeed identi ed. By assumption 2., we have E k jx = E k for k = 2; 3: For k = 2, we consider v = E (m(x ) ) 2 jx = + E 2 Similarly, we have = E m(x ) 2 jx = 2 + E 2 = qm 2 + ( q) m E 2 : We eliminate E ( 2 ) to obtain, v = ( p) m 2 + pm E 2 : ( p) m 2 + pm 2 v + 2 = qm 2 + ( q) m 2 v + 2 : That is v + 2 v + 2 = ( p q) m 2 m 2 ;

37 We have shown that p q = m m : Thus, m and m satisfy the following linear equation: m + m = (v + 2 ) (v + 2 ) C : This means we need one more restriction to identify m and m.we consider and s = E (Y ) 3 jx = = E (m (X ) ) 3 jx = + E 3 = q (m ) 3 + ( q) (m ) 3 + E 3 s = ( p) (m ) 3 + p (m ) 3 + E 3 : We eliminate E ( 2 ) in the two equations above to obtain, ( p) (m ) 3 + p (m ) 3 s = q (m ) 3 + ( q) (m ) 3 s Plug in the expression of p and q in equation (A.6), we have (m ) (m ) (m + m 2 ) s = (m ) (m ) (m + m 2 ) s ; Since m + m = C ; we have (C m ) (m ) (C 2 ) + s = (C m ) (m ) (C 2 ) + s ; that is, Moreover, we have m 2 2 (C 2 ) + (m ) C (C 2 ) + s = m 2 2 (C 2 ) + (m ) C (C 2 ) + s 2m 2 ( ) + 2C ( ) m = 2 (C 2 ) 2 (C 2 ) C (C 2 ) + C (C 2 ) + s s = 2 2 C ( ) C C + s s Since ( ) >, we have 2m 2 + 2C m = 3 ( + ) C C 2 + s s : Finally, we have 2 2 m 2 C + C 2 =

38 where C 2 = 3 2 C2 3 ( + ) C s s = 3 2 [C ( + )] ( + ) s s = 3 2 [C ( + )] ( ) 2 s s 2 v v s s = 2 ( ) h s j = E Y j 3 jx = j i = E Y 3 jx = j 3E Y 2 jx = j j j 3 j = E Y 3 jx = j 3E Y 2 jx = j j j j 3 j j j, C 2 = 3 2 C2 3 ( + ) C s s = 3 2 C2 3 ( + ) C ( ) = 3 2 C2 3 ( + ) C 3 ( 3 ) = 3 2 C2 3 ( + ) = Notice that we also have 2 2 m 2 C + C 2 = ; which implies that m and m are two roots of this quadratic equation. Since m > m, we have m = r 2 C 2 C 2; m = r 2 C + 2 C 2: After we have identi ed m and m, p and q (or f X jx) are identi ed from equation (A.6), and the density f (or f Y jx ) is also identi ed from equation (A.7). Thus, we have identi ed the latent densities f Y jx and f X jx from the observed density f Y jx. REFERENCES Ai, C., and Chen, X. (23), E cient Estimation of Models With Conditional Moment Restric-

39 tions Containing Unknown Functions, Econometrica, 7, Bickel, P. J.; Ritov, Y., 987, "E cient estimation in the errors in variables model." Ann. Statist. 5, no. 2, Bordes, L., S. Mottelet, and P. Vandekerkhove, 26, "Semiparametric estimation of a twocomponent mixture model," Annals of Statistics, 34, Bound, J. C. Brown and N. Mathiowetz, 2, Measurement error in survey data, in Handbook of Econometrics, Vol. 5, ed. by J. Heckman and E. Leamer. North Holland. Carroll, R.J., D. Puppert, C. Crainiceanu, T. Tostenson and M. Karagas, 24), Nonlinear and Nonparametric Regression and Instrumental Variables, Journal of the American Statistical Association, 99 (467), pp Carroll, R.J., D. Puppert, L. Stefanski and C. Crainiceanu, 26, Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition, CRI. Carroll, R.J. and L.A. Stefanski, 99, Approximate quasi-likelihood estimation in models with surrogate predictors, Journal of the American Statistical Association 85, pp Chen, X. (26): Large Sample Sieve Estimation of Semi-nonparametric Models, in J.J. Heckman and E.E. Leamer (eds.), The Handbook of Econometrics, vol. 6. North-Holland, Amsterdam, forthcoming. Chen, X., H. Hong, and D. Nekipelov, 27, Measurement error models, working paper of New York University and Stanford University, a survey prepared for the Journal of Economic Literature. Chen, X., H. Hong, and E. Tamer, 25, Measurement error models with auxiliary data, Review of Economic Studies, 72, pp Chen, X., H. Hong, and A. Tarozzi (27): Semiparametric E ciency in GMM Models with Auxiliary Data, Annals of Statistics, forthcoming. Cheng, C. L., Van Ness, J. W., 999, Statistical Regression with Measurement Error, Arnold, London. Fuller, W., 987, Measurement error models. New York: John Wiley & Sons. Geman, S., and Hwang, C. (982), Nonparametric Maximum Likelihood Estimation by the Method of Sieves, The Annals of Statistics,, Grenander, U. (98), Abstract Inference, New York: Wiley Series. Hansen, L.P. (982): Large Sample Properties of Generalized Method of Moments Estimators, Econometrica, 5,

Nonparametric Identi cation of Regression Models Containing a Misclassi ed Dichotomous Regressor Without Instruments

Nonparametric Identi cation of Regression Models Containing a Misclassi ed Dichotomous Regressor Without Instruments Xiaohong Chen Yale University Yingyao Hu y Johns Hopkins University Arthur Lewbel z