IDENTIFICATION OF THE BINARY CHOICE MODEL WITH MISCLASSIFICATION

Similar documents
UNIVERSITY OF CALIFORNIA Spring Economics 241A Econometrics

A Local Generalized Method of Moments Estimator

Simple Estimators for Semiparametric Multinomial Choice Models

Nonparametric Identi cation of Regression Models Containing a Misclassi ed Dichotomous Regressor Without Instruments

Econ 273B Advanced Econometrics Spring

Simple Estimators for Monotone Index Models

Course Description. Course Requirements

Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case

Nonparametric Identi cation of Regression Models Containing a Misclassi ed Dichotomous Regressor Without Instruments

Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case

A Note on the Closed-form Identi cation of Regression Models with a Mismeasured Binary Regressor

Course Description. Course Requirements

Estimation of Average Treatment Effects With Misclassification

An Overview of the Special Regressor Method

A Simple Way to Calculate Confidence Intervals for Partially Identified Parameters. By Tiemen Woutersen. Draft, September

Course Description. Course Requirements

Working Paper No Maximum score type estimators

Identification and Estimation of Regression Models with Misclassification

Nonparametric Identication of a Binary Random Factor in Cross Section Data and

Informational Content in Static and Dynamic Discrete Response Panel Data Models

Simple Estimators For Hard Problems: Endogeneity and Dependence in Binary Choice Related Models. and

Parametric identification of multiplicative exponential heteroskedasticity ALYSSA CARLSON

APEC 8212: Econometric Analysis II

Conditions for the Existence of Control Functions in Nonseparable Simultaneous Equations Models 1

Quantile methods. Class Notes Manuel Arellano December 1, Let F (r) =Pr(Y r). Forτ (0, 1), theτth population quantile of Y is defined to be

A Simple Estimator for Binary Choice Models With Endogenous Regressors

Estimating Semi-parametric Panel Multinomial Choice Models

Increasing the Power of Specification Tests. November 18, 2018

Nonseparable multinomial choice models in cross-section and panel data

CENTER FOR LAW, ECONOMICS AND ORGANIZATION RESEARCH PAPER SERIES

Rank Estimation of Partially Linear Index Models

Nonparametric Identi cation of a Binary Random Factor in Cross Section Data

A Simple Estimator for Binary Choice Models With Endogenous Regressors

NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E.

ORTHOGONALITY TESTS IN LINEAR MODELS SEUNG CHAN AHN ARIZONA STATE UNIVERSITY ABSTRACT

16/018. Efficiency Gains in Rank-ordered Multinomial Logit Models. June 13, 2016

Non-linear panel data modeling

ESTIMATION OF NONPARAMETRIC MODELS WITH SIMULTANEITY

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

1. GENERAL DESCRIPTION

Syllabus. By Joan Llull. Microeconometrics. IDEA PhD Program. Fall Chapter 1: Introduction and a Brief Review of Relevant Tools

Estimating the Derivative Function and Counterfactuals in Duration Models with Heterogeneity

A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples

Identification in Nonparametric Limited Dependent Variable Models with Simultaneity and Unobserved Heterogeneity

Limited Dependent Variables and Panel Data

Identification and Estimation of Partially Linear Censored Regression Models with Unknown Heteroscedasticity

Partial Rank Estimation of Transformation Models with General forms of Censoring

The Identification Zoo - Meanings of Identification in Econometrics: PART 3

Parametric Identification of Multiplicative Exponential Heteroskedasticity

Estimation of Average Treatment Effects With Misclassification

The relationship between treatment parameters within a latent variable framework

IDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY

Identification and Estimation of Semiparametric Two Step Models

Made available courtesy of Elsevier:

Simple Estimators for Binary Choice Models With Endogenous Regressors

The Identification Zoo - Meanings of Identification in Econometrics: PART 2

NONPARAMETRIC IDENTIFICATION AND ESTIMATION OF NONCLASSICAL ERRORS-IN-VARIABLES MODELS WITHOUT ADDITIONAL INFORMATION

7 Semiparametric Estimation of Additive Models

Discrete Time Duration Models with Group level Heterogeneity

Partial Identification and Inference in Binary Choice and Duration Panel Data Models

Microeconometrics. C. Hsiao (2014), Analysis of Panel Data, 3rd edition. Cambridge, University Press.

Identification of Regression Models with Misclassified and Endogenous Binary Regressor

A simple alternative to the linear probability model for binary choice models with endogenous regressors

Econometric Analysis of Cross Section and Panel Data

Exclusion Restrictions in Dynamic Binary Choice Panel Data Models

Fall, 2007 Nonlinear Econometrics. Theory: Consistency for Extremum Estimators. Modeling: Probit, Logit, and Other Links.

ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS

A dynamic model for binary panel data with unobserved heterogeneity admitting a n-consistent conditional estimator

Birkbeck Working Papers in Economics & Finance

11. Bootstrap Methods

Kernel Logistic Regression and the Import Vector Machine

A SEMIPARAMETRIC MODEL FOR BINARY RESPONSE AND CONTINUOUS OUTCOMES UNDER INDEX HETEROSCEDASTICITY

Estimating Semi-parametric Panel Multinomial Choice Models using Cyclic Monotonicity

Unobserved Preference Heterogeneity in Demand Using Generalized Random Coefficients

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Rank-Ordered Logit Model with Unobserved Heterogeneity in Ranking Capabilities

Discussion Paper Series

DEPARTAMENTO DE ECONOMÍA DOCUMENTO DE TRABAJO. Nonparametric Estimation of Nonadditive Hedonic Models James Heckman, Rosa Matzkin y Lars Nesheim

Nonparametric Welfare Analysis for Discrete Choice

Semiparametric Estimation with Mismeasured Dependent Variables: An Application to Duration Models for Unemployment Spells

Lecture Notes on Measurement Error

Semiparametric Efficiency in Irregularly Identified Models

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity

Problem Set 3: Bootstrap, Quantile Regression and MCMC Methods. MIT , Fall Due: Wednesday, 07 November 2007, 5:00 PM

Irregular Identification, Support Conditions, and Inverse Weight Estimation

The random coefficients logit model is identified

Random Coefficients on Endogenous Variables in Simultaneous Equations Models

ESTIMATING PANEL DATA DURATION MODELS WITH CENSORED DATA

Unobserved Preference Heterogeneity in Demand Using Generalized Random Coefficients

Estimation of Treatment Effects under Essential Heterogeneity

leebounds: Lee s (2009) treatment effects bounds for non-random sample selection for Stata

Identification in Discrete Choice Models with Fixed Effects

A Bootstrap Test for Conditional Symmetry

Control Functions in Nonseparable Simultaneous Equations Models 1

CALCULATION METHOD FOR NONLINEAR DYNAMIC LEAST-ABSOLUTE DEVIATIONS ESTIMATOR

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006

Endogeneity in Semiparametric Binary Response Models

Endogenous binary choice models with median restrictions: A comment

Maximum score estimation of preference parameters for a binary choice model under uncertainty

Simple Estimators for Invertible Index Models

Transcription:

IDENTIFICATION OF THE BINARY CHOICE MODEL WITH MISCLASSIFICATION Arthur Lewbel Boston College December 19, 2000 Abstract MisclassiÞcation in binary choice (binomial response) models occurs when the dependent variable is measured with error, that is, when an actual one response is sometimes recorded as a zero, and vice versa. This paper shows that binary response models with misclassiþcation are semiparametrically identiþed, even when the probabilities of misclassiþcation depend in unknown ways on model covariates, and the distribution of the errors is unknown. JEL Codes: C14 C25 C13 Keywords: Binary Choice, MisclassiÞcation, Measurement error, Semiparametric, Latent Variable Department of Economics, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467 USA. Tel: (617) 552-3678. email: lewbel@bc.edu 1

1 Introduction This paper shows that binary response models with misclassiþcation of the dependent variable are semiparametrically identiþed, even when the probabilities of misclassiþcation depend in unknown ways on model covariates, and the distribution of the errors is unknown. Let x i be a vector of covariates that may affect both the response of observation i, and the probability that the response is observed incorrectly. For identiþcation, assume there exists a covariate L i that affects the true response but does not affect the probability of misclassiþcation. If more than one one such covariate exists, let L i be any one of the available candidates (that satisþes the regularity conditions listed below), and the others can without loss of generality be included in the vector x i Let yi be an unobserved latent variable associated with observation i, given by y i, L i + x i + e i where the e i are independently, identically distributed errors. The true response is given by Ay i, I y i 0 where I equals one if is true and zero otherwise. When Ay i is observed, this is the standard latent variable speciþcation of the binary response model (see, e.g., McFadden 1984). Now permit the true response (i.e., classiþcation of observation i) to be observed with error. Letting y i denote the observed binary dependent variable, the misclassiþcation probabilities are a x i, Pr y i, 1 Ay i, 0 x i a x i, Pr y i, 0 Ay i, 1 x i So a x i is the probability that an actual zero response is misclassiþed (i.e., incorrectly recorded) as a one, and a x i is the probability that a one response is misclassiþed as a zero. These misclassiþcation probabilitiesarepermittedtodependinanunknownwayonobservedcovariatesx i This framework encompasses models where misclassiþcation probabilities may also depend on variables that do not affect the true response, since any covariate x ji that affects a or a but not y is just a covariate that has a coefþcient j that equals zero. DeÞne b x as b x i, [1 a x i a x i ] and deþne the function g to be the conditional expectation of y, which in this model is 1 g L i x i, E y i L i x i, a x i + b x i F L i + x i 2

where F is the cumulative distribution function of the random variable e Another model that corresponds to equation (1) is when a fraction a x of respondents having characteristics x always answer one, a fraction a x always answer zero, and the remainder respond with I L + x + e 0 In this interpretation some respondents give natural responses that are due to factors other than the latent variable, while the other respondents follow the latent variable model. While this model is observationally equivalent to the misclassiþcation model, the interpretation of the natural response model (in particular, the implied marginal effects) is quite different. See, e.g., Finney (1964). Examples of recent papers that consider estimation of misclassiþcation model parameters or misclassiþcation probabilities include Manski (1985), Chua and Fuller (1987), Brown and Light (1992), Poterba and Summers (1995), Abrevaya and Hausman (1997), and Hausman, Abrevaya, and Scott-Morton (1998). These last two papers provide parametric (maximum likelihood) estimators of the model when the function F is known, and a semiparametric estimator for the case where F is unknown and the misclassiþcation probabilities a and a are constants (independent of all covariates). They also show that when F is unknown, the coefþcients of covariates that do not affect the misclassiþcation probabilities can be estimated. This paper shows that (given some regularity) the entire model is identiþed even when the functions a a,andf are unknown. ASSUMPTION A1. Assume for all x that 0 a x 0 a x and a x + a x 1 Assume L, conditional on x is continuously distributed. Assume that F M is three times differentiable with f M, df M dm $, 0and f M, df M dm. Assume,1and, for all $, prob [ f L + x f L + x ] $, E 0 The assumption that the sum of misclassiþcation probabilities be less than one is what Hausman et. al. (1998) call the monotonicity condition, and holds by construction in the natural response form of the model. Letting,1isanarbitrary free normalization, as long as $, 0 Only the covariate L is assumed to be continuous. The Þnal condition in Assumption A.1 is a parametric identiþcation assumption that would provide identiþcation of from the score function if f was a known function and there was no misclassiþcation. DeÞne the function L x by 2 L x, (2 g L x (L 2 (g L x (L x sign[e (g L ] (L Let r L x be any function such that r L x 0 sup r L x is Þnite, and E[r L x ], 1 Lemma 1 Given Assumption D A1, L x, f L +x f L +x, sign E[r L x (g L x (L] and, arg min E L x E[ L x L + x ] 2E Also,, E r L x [( L x (x] [( L x (L] 3

This Lemma shows identiþcation of the model coefþcients. Estimation based on this Lemma could proceed as follows. First, estimate >g as a nonparametric regression of y on L and x Next deþne > by equation (2), replacing g with >g and the expectation with a sample average. Then let > equal the sign of any weighted average derivative of E y L x with respect to L (using, e.g., the estimator of Powell, Stock and Stoker 1989). The Lemma suggests two different estimators for Let L + x, E[ L x L + x ] for any and let > L> + x be a nonparametric regression of > L x on L> + x The estimate > is then the value of that minimizes the sample average of [> L x > L> + x ] 2 This is essentially Ichimura s (1993) linear index model estimator, using > L x as the dependent variable. Another estimator for suggested by the Lemma is to let > equal the sample average of r L x [(> L x (x] [(> L This is an average derivative type estimator, which is only feasible for continuously distributed regressors because of the need to estimate the term (> L x (x. More generally, Lemma 1 shows that L x, L + x, so can be estimated using any of a variety of linear index model estimators, treating > L x as the dependent variable. For example, Powell, Stock, and Stoker (1989) could be used to estimate the coefþcients of the continuous regressors, and Horowitz and Härdle (1996) for the discrete regressors. The limiting distributions of these estimators will be affected by the use of an estimated dependent variable > L x instead of an observed one. However, all of these estimators involve unconditional expectations, estimated as averages of functions of nonparametric regressions. With sufþcient regularity (including judicious selection of the function r e.g., having r be a density function that equals zero wherever might be small), such expectations can typically be estimated at rate root n. See, e.g., Newey and McFadden (1994). Also, some relevant results on the uniform convergence and limiting distribution of nonparametric kernel estimators based on estimated (generated) variables include Andrews (1995) and Ahn (1997). DeÞne M, L + x, which by Lemma 1 is identiþed. Let f M M denote the unconditional probability density function of M DeÞne h by h M x, E y M x, a x + b x F M DeÞne the function by the indeþnite integral M, ( 2 h M x (M 2 E (h M x (M 3, exp I d M Let M and e denote the supports of M and e respectively. DeÞne the constant c by c, I M M dm. Lemma 2 Given Assumption A1, f M, M c and b x, E [(h M x (M] M x c If e is a subset of M then c, E[ M f M M ] This Lemma shows that the density function f M and the misclassiþcation function b x are identiþed 4

up to the constant c and the constant c is also identiþed (and can be estimated as a sample average), provided that the data generating process for M has sufþciently large support. Estimators based directly on Lemma 2 would consist of the following steps. First, construct >M, L> + x> and let >h >M x be a nonparametric regression of y on >M and x Next, let > >M be a nonparametric regression of [( 2 >h >M x ( >M 2 ] [(>h >M x (>M] on>m, anddeþne the function > M, exp I> M dm The scalar >c then equals the sample average of > >M >f M >M, where >f M is a nonparametric estimator (for example, a kernel estimator) of the density of >M Finally, >f M M, > M >c, and>b x equals >c times a nonparametric regression of [(>h >M x ( >M] > >M on x The resulting estimates should be consistent, as long as uniformly consistent nonparametric estimators are used at each stage. Note that consistency may require trimming (possibly asymptotic trimming) to a compact subset of M, because of division by the density f M The above Lemmas show that the marginal effects ( Pr Ay, 1 L x (x, f L + x and ( Pr Ay, 1 L x (L, f L + x are identiþed, and that the misclassiþcation error function b x is also identiþed. If a x, a x, that is, if the probability of misclassiþcation doesn t depend on Ay then Lemma 2 implies that the misclassiþcation probability a x, a x, [1 b x ] 2 is also identiþed. Instead of using Lemma 2, log derivatives of b x (and hence of a x and a x when they are equal) with respect to continuously distributed elements of x can be directly estimated, without requiring numerical integration, the large M support assumption, or the generated variable >M by the following Lemma. Lemma 3 Let x j be any continuously distributed element of x and let j be the corresponding element of Let Assumption A1 hold, and assume b x is differentiable in x j Then ( ln b x ( 2 g L x (L(x j, E L x (x j (g L x (L j x By Lemma 3, a nonparametric regression of [( 2 >g L x (L(x j ] [(>g L x (L] > L x > j on x is an estimator of ( ln b x (x j Dividing this estimate by 2 yields an estimate of ( ln a x (x j and ( ln a x (x j when a x, a x Next, consider identiþcation of a x and a x when they are not equal. Let F M M x denote the conditional cumulative distribution function of M given x and let f M M x, ( F M M x (M be the conditional probability density function of M given x and let M x denote the support of M given x Let x, 1 E[ f M F M M x f M M x x] Lemma 4 Let Assumption A1 hold, and assume that e is a subset of M x Then a x, E[h M x x] b x x,a x, b x 1 + a x and F M, E [h M x a x ] b x M Estimation of x requires extreme values of M given x, and hence of L to be observable. Some intuition for this result comes from the observation that g L x a x for very large L and g L x 5

1 a x for very small L Hence, analogous to the estimation of c, data in the tails are required for estimation of a x and a x. Estimation proceeds as in the previous Lemmas, that is, employing >M in place of M, nonparametric estimation of the density function f M M x, and nonparametric regression to estimate conditional expectations. Taken together, these Lemmas show that the entire model is identiþed. The parameters and can be consistently estimated (with regularity, at rate root n), and the functions a x a x and F M, can be consistently estimated nonparametrically. The estimators provided here are not likely to be very practical, since they involve up to third order derivatives and repeated applications of nonparametric regression, and they do not exploit some features of the model such as monotonicity of F. However, the demonstration that the entire model is identiþed suggests that the search for better estimators would be worthwhile. A Appendix Proof of Lemma 1: (g L x (L, b x f L + x b x 0 and f L + x 0 so, sign[(g L x (L] ( 2 g L x (L 2, b x f L + x 2 and 2, 1 so L x, f L + x f L + x Let L + x, E[ L x L + x ] It follows from the previous expression for that L x and the Þnal equality in Assumption A1 that Dprob[ L x, L + x ] 0 for all $,,and L x, L + x so, arg min E L x E[ L x L + x ] 2E The alternative expression, [( L x (x] [( L x (L] follows because depends on x and L only through L + x so E r L x [( L x (x] [( L x (L], E[r L x ], ProofofLemma2: (h M x (M, b x f M ( 2 h M x (M 2, b x f M so [( 2 h M x (M 2 ] [(h M x (M], f M f M, E [( 2 h M x (M 2 ] [(h M x (M] M Then, exp[i f f d ], f c where ln c is the constant of integration. E [(h M x (M] M x c, E [b x f M ] M x c, E [b x f M ] [ f M c] x c, b x E[ M f M M ], I M [ M f M M ] f M M dm, I M M dm, I M f M cdm, c where the last equality holds as long as M contains every value of e for which f e is nonzero. ProofofLemma3: ( 2 g L x (L(x j, f L+ x (b x (x j +b x f L+ x j so [( 2 g L x (L(x j ] [(g L x (L], [(b x (x j ] b x + [ f L + x f L + x ] j, ( ln b x (x j + L x j The lemma then follows immediately. 6

ProofofLemma4 Let ( M x denote the boundary of the support M x. Applying an integration by parts gives E[F M x], I M x F M f M M x dm, F M F M M x M,( M x I M x f M F M M x dm Having e be a subset of M x ensures that F M F M M x M,( M x, 1 and so x, 1 I M x [ f M F M M x f M M x ] f M M x dm, E[F M x] Therefore, E[h M x x], a x +b x E[F M x], a x +b x x which gives the identiþcation of a x a x, b x 1 + a x then follows from the deþnition of b x and h M x, a x + b x F M is then used to obtain F M ACKNOWLEDGEMENT This research was supported in part by the National Science Foundation through grant SBR-9514977. I would like to thank the associate editor and two referees for helpful comments and suggestions. Any errors are my own. References [1] ABREVAYA, J. AND J. A. HAUSMAN, (1997), Semiparametric Estimation with Mismeasured Dependent Variables: An Application to Panel Data on Employment Spells. Unpublished manuscript. [2] AHN, H., (1997), Semiparametric Estimation of Single Index Model With Nonparametric Generated Regressors, Econometric Theory, 13, 3-31. [3] ANDREWS, D. W. K., (1995), Nonparametric Kernel Estimation for Semiparametric Models, Econometric Theory, 11, 560 596 [4] BROWN, J. N., AND A. LIGHT, (1992), Interpreting Panel Data on Job Tenure, Journal of Labor Economics, 10, 219-257. [5] CHUA, T. C. AND W. A FULLER, (1987) A Model For Multinomial Response Error Applied to Labor Flows, Journal of the American Statistical Association, 82, 46-51. [6] FINNEY, D. J. (1964) Statistical Method in Biological Assay. Havner: New York. [7] HOROWITZ, J. L. AND W. HÄRDLE, (1996), Direct Semiparametric Estimation of Single-Index Models With Discrete Covariates, Journal of the American Statistical Association, 91, 1632-1640. [8] HAUSMAN, J. A., J. ABREVAYA, AND F. M. SCOTT-MORTON (1998), MisclassiÞcation of the Dependent Variable in a Discrete-Response Setting, Journal of Econometrics, 87, 239-269. 7

[9] ICHIMURA, H. (1993), Semiparametric Least Squares (SLS) and Weighted SLS estimation of Singleindex Models, Journal of Econometrics, 58, 71 120. [10] MCFADDEN, D., (1984), Econometric Analysis of Qualitative Response Models, In: Griliches, Z., Intriligator, M.D.(Eds.), Handbook of Econometrics, vol. 2. North-Holland, Amsterdam. [11] MANSKI, C. F. (1985) Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator, Journal of Econometrics, 82, 46-51. [12] NEWEY, W. K., AND MCFADDEN, D., (1994), Large Sample Estimation and Hypothesis Testing. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. [13] POTERBA, J. M. AND L. H. SUMMERS (1995) Unemployment BeneÞts and Labor Market Transitions: A Multinomial Logit Model With Errors in ClassiÞcation, Review of Economics and Statistics, 77, 207-216. [14] POWELL, J.L.,J.H.STOCK, AND T. M. STOKER (1989), Semiparametric Estimation of Index CoefÞcients, Econometrica 57, 1403 1430. 8