CENSORED DATA AND CENSORED NORMAL REGRESSION

Similar documents
What s New in Econometrics? Lecture 14 Quantile Methods

New Developments in Econometrics Lecture 16: Quantile Estimation

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Control Function and Related Methods: Nonlinear Models

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Econometric Analysis of Cross Section and Panel Data

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Jeffrey M. Wooldridge Michigan State University

Truncation and Censoring

INVERSE PROBABILITY WEIGHTED ESTIMATION FOR GENERAL MISSING DATA PROBLEMS

Gibbs Sampling in Latent Variable Models #1

1 Motivation for Instrumental Variable (IV) Regression

A simple alternative to the linear probability model for binary choice models with endogenous regressors

CHAPTER 7. + ˆ δ. (1 nopc) + ˆ β1. =.157, so the new intercept is = The coefficient on nopc is.157.


Thoughts on Heterogeneity in Econometric Models

CORRELATED RANDOM EFFECTS MODELS WITH UNBALANCED PANELS

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Final Exam. Economics 835: Econometrics. Fall 2010

1 The problem of survival analysis

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

New Developments in Econometrics Lecture 9: Stratified Sampling

ECON 594: Lecture #6

Ch 7: Dummy (binary, indicator) variables

Lecture 12: Application of Maximum Likelihood Estimation:Truncation, Censoring, and Corner Solutions

Introduction to GSEM in Stata

CRE METHODS FOR UNBALANCED PANELS Correlated Random Effects Panel Data Models IZA Summer School in Labor Economics May 13-19, 2013 Jeffrey M.

Applied Health Economics (for B.Sc.)

Wooldridge, Introductory Econometrics, 3d ed. Chapter 16: Simultaneous equations models. An obvious reason for the endogeneity of explanatory

Estimating the Fractional Response Model with an Endogenous Count Variable

ECONOMETRICS FIELD EXAM Michigan State University May 9, 2008

Multiple Linear Regression CIVL 7012/8012

Probability and Samples. Sampling. Point Estimates

ivporbit:an R package to estimate the probit model with continuous endogenous regressors


Limited Information Econometrics

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

AGEC 661 Note Fourteen

Non-linear panel data modeling

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

ECNS 561 Multiple Regression Analysis

Working Paper No Maximum score type estimators

The Simple Linear Regression Model

Practice exam questions

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

ECON3150/4150 Spring 2015

A Guide to Modern Econometric:

ECON 450 Development Economics

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Intro to Applied Econometrics: Basic theory and Stata examples

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Lecture Notes 12 Advanced Topics Econ 20150, Principles of Statistics Kevin R Foster, CCNY Spring 2012

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Econometrics II Censoring & Truncation. May 5, 2011

Comparing IRT with Other Models

ECON5115. Solution Proposal for Problem Set 4. Vibeke Øi and Stefan Flügel

Introduction to Econometrics

Ordinary Least Squares Regression Explained: Vartanian

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Extended regression models using Stata 15

ECO 310: Empirical Industrial Organization Lecture 2 - Estimation of Demand and Supply

Chapter 3: Examining Relationships

WISE International Masters

ECON3150/4150 Spring 2016

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Panel Data Exercises Manuel Arellano. Using panel data, a researcher considers the estimation of the following system:

Answer Key: Problem Set 6

STATS DOESN T SUCK! ~ CHAPTER 16

Handout 12. Endogeneity & Simultaneous Equation Models

Nonlinear Regression Functions

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

4.8 Instrumental Variables

PhD/MA Econometrics Examination January 2012 PART A

Handout 11: Measurement Error

Chapter 11. Regression with a Binary Dependent Variable

A Simple Estimator for Binary Choice Models With Endogenous Regressors

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

UNIVERSITY OF CALIFORNIA Spring Economics 241A Econometrics

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

SIMPLE SOLUTIONS TO THE INITIAL CONDITIONS PROBLEM IN DYNAMIC, NONLINEAR PANEL DATA MODELS WITH UNOBSERVED HETEROGENEITY

Ordinary Least Squares Regression Explained: Vartanian

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

A Simple Estimator for Binary Choice Models With Endogenous Regressors

appstats8.notebook October 11, 2016

1. Basic Model of Labor Supply

Chapter 9: The Regression Model with Qualitative Information: Binary Variables (Dummies)

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

More on Specification and Data Issues

Identification and Estimation Using Heteroscedasticity Without Instruments: The Binary Endogenous Regressor Case

Simultaneous Equation Models (Book Chapter 5)

Linear Regression With Special Variables

Models of Qualitative Binary Response

Econometric Modelling Prof. Rudra P. Pradhan Department of Management Indian Institute of Technology, Kharagpur

1. Overview of the Basic Model

Transcription:

CENSORED DATA AND CENSORED NORMAL REGRESSION Data censoring comes in many forms: binary censoring, interval censoring, and top coding are the most common. They all start with an underlying linear model for y, the variable of interest: y x u E u x 0, (1) (2) where x is 1 K with first element unity. Under (1) and (2), if we have random draws x i,y i from the population, then OLS is consisitent and N -asymptotically normal for the parameters of interest,. But suppose what we observe is a censored version of y. In the top coding example, suppose that wealth is measured in thousands of dollars and it is top coded at 1

$200,000. Then we can define the censored version of weath (for any unit that can be drawn from the population) as w min y,200. For each random draw i, w i min y i,200. What would the data set look like on x i and w i,where w i is called wealth? We should notice that the maximum value of the w i in the sample is 200, with a nontrivial fraction of observations at exactly 200. Because there are no behavioral reasons to see a focal point for wealth at 200 let alone, to observe no values greater than 200 we would recognize that the wealth variable has been top coded at 200. BINARY CENSORING Suppose we want to estimate the factors that affect willingness to pay. Hard to elicit a precise figure. So present families with a cost and allow them to simply state whether their wtp is above the cost. The model for 2

the population is first assumed to be wtp x u, E u x 0, (3) where x 1 1. Let r i denote the cost of the project to household i. Presented with this cost, the household either says it is in favor of the project or not. Thus, along with x i and r i, we observe the binary response w i 1 wtp i r i (4) whereweassumethechancethatwtp i equals r i is zero. We are given data on x i,r i,w i. What is the most natural way to proceed to estimate? If we impose some strong assumptions on the underlying population and the nature of r i, then we can proceed with maximum likelihood. In particular, assume u i x i,r i ~ Normal 0, 2. (5) Assumption (5) implies that (3) actually satisfies the classical linear model (CLM). It also requires that r i is 3

independent of wtp i conditional on x i,thatis, D wtp i x i,r i D wtp i x i. (6) This assumption is satisfied if r i is randomized or if r i is chosen as a function of x i, or some combination of these. P w i 1 x i,r i P wtp i r i x i,r i P u i / r i x i / x i,r i 1 r i x i / x i r i /. (7) So the estimating equation is a probit model with coefficients / and 1/. All parameters, including, are identified if x i does not contain perfect collinearity and r i varies across i (in a way not perfectly linearly related to x i ). If r i is the same for all i, cannot identify the parameters. Estimate, by MLE. The costs of the binary censoring scheme are severe. If we could observe wtp i,e wtp i x i x i would suffice for consistent estimation of ;infact,wecouldjust 4

specify a linear projection and use OLS. With censoring, we must assume the underlying model satisfies the CLM. Now, discussions of the deleterious effects of nonnormality and heteroskedasticity when using probit models makes much more sense. There are ways to estimate the slope parameters up to scale without without placing strong restrictions on D u i x i,r i. For example, if the distribution of x i,r i,y i implies linear conditional expectations for all elements conditional on y i, then the Chung and Goldberger (1984) results can be used for OLS. Manski s (1975, 1988) maximum score estimator requires only symmetry of D u i x i,r i (around zero), and Horowitz s (1992) smoothed version is more convenient for inference. But in every case these methods only estimate the slope coefficients up to scale (and the intercept cannot be estimated at all), and therefore we cannot learn the 5

magnitude of the effect of any element of x on willingness to pay, nor do we have a way of predicting willingness to pay for given values of the covariates. An interesting puzzle: What if the appropriate population model for WTP is the type I Tobit, wtp max 0,x u, under assumption (5). Now, for example, E wtp x has the (nonlinear) Tobit form, and we can easily compute it becaus we have consistent estimates of and. The estimation procedure is the same provided the r i are strictly positive. Because we do not observe wtp i, with our data x i,r i,w i we cannot distinguish between (3) and (8). But, if we believe that wtp is zero for some fraction of the population, any calculations should take that into account by using the type I Tobit formulas. Because estimation of and 2 can be sensitive to heteroskedasticity and nonnormality, it makes sense to 6

be flexible in specifying D u x. INTERVAL CODING We can generalize the WTP example to allow multiple intervals. Let y i be the (unobserved) response for random draw i. We only know whether y i falls into a specific interval. In many cases, these intervals are fixed across all i, such as surveys asking people about their income bracket. In this case we say we have interval-coded data. First let a 1 a 2... a J denote the known interval limits that are common across i; these are specified as part of the survey method. If u x ~Normal(0, 2, (8) we can estimate and 2 by MLE, provide J 2. Not surprisingly, the structure of the problem is similar that for an ordered probit for an ordered, qualitative response (such as a credit rating). But with ordered probit we 7

estimate cut points, whereas here we know the intervals and hope to estimate the parameters of the underly CLM. In fact, we can define w 0 ify a 1 w 1 ifa 1 y a 2 w J if y a J (9) and easily obtain the conditional probabilities P w j x for j 0,1,...,J. The log-likelihood is l i, 1 w i 0 log a 1 x i / 1 w i 1 log a 2 x i / a 1 x i /... 1 w i J log 1 a J x i /. (10) The maximum likelihood estimators, and 2, are often called interval regression estimators, with the understanding that the underlying distribution is normal. Importantly, when we obtain the interval regression estimates, we interpret the as if we had been able to 8

run the regression y i on x i, i 1,...,N. Imposing the assumptions of the classical linear model allows us to estimate the parameters in the distribution D y x even though they data are censored by being put into intervals. Sometimes in applications of interval regression the observed, censored variable, w, is set to some value within the interval that y belongs to. For example, if y is wealth, we might set w to the midpoint of the interval that y falls into. (Of course, we have to use some other rule if y a 1 or y a J. Provided the definition of w determines the proper interval, the maximum likelihood estimators of and will be the same. In Stata, for each observation specify two dependent variables, which are the upper and lower bounds for each i. When w is defined to have the same units as y,itis tempting to ignore the grouping of the data and just to 9

run an OLS regression of w i on x i, i 1,...,N. Naturally, such a procedure is generally inconsistent for. Nevertheless, the results of Chung and Goldberger apply: if E x y is linear.in y, then the OLS regression produces consistent estimators up to a common scale factor (at least for the slope coefficients). If we allow the endpoints to depend on i, we encompass the case of binary censoring. More generally, we might have multiple endpoint that change across i. In Stata, specify the two endpoints for each observation. That is, a lower bound and an upper bound of the interval that observation is known to fall into. The command specifies these bounds as dependent variables: intreg lower upper x1 x2... xk When the interval limits change across i, we assume they do so exogenously, namely, 10

D y i x i,a i1,...,a ij D w i x i,a i1,...,a ij. (11) In the WTP example, this holds because the one limit value J 1 is randomly assigned. Generally, the limits can be a function of x i (because these are being conditioned on). Because of the underlying normality assumption, we can use the Rivers-Vuong (1988) control function approach to test and correct for endogeneity of explanatory variables. It is analog of the Smith-Blundell approach that we discussed for Tobit. The underlying model is linear y 1 z 1 1 1 y 2 u 1. (12) We would just like to use 2SLS on this model, but y 1 is interval coded (and we have the lower and upper limits for each observation). reg y2 z1 z2... zl 11

predict v2h, resid intreg lower1 upper1 z1... zl1 y2 v2h where L 1 L. We are just interested in 1 and 1 here, because (12) is the equation of interest. Bootstrap is easy to implement for standard errors. Again, beware of schemes for discrete y 2 that involve plugging in fitted values. RIGHT AND LEFT CENSORING Now we consider the more common cases of left and right censoring (or censoring from below and above). In top coding cases (right censoring) and minimum wage or price floors (left censoring), the censoring point is typically fixed. But in other applications, particularly duration models, the censoring point changes with i. The right censoring case, where y i is again the underlying variable of interest, is 12

y i x i u i w i min w i,c i (13) (14) where c i 0 is the censoring point for unit i. Sometimes the linear model is specified for y i log durat i, and so, of course, the censoring values have to be logs, too. If we assume exogenous censoring (and exogenous explanatory variables) that is, D u i x i,c i D u i Normal 0, 2, (15) we can use censored normal regression (called censored Tobit, too). (In Stata, however, the tobit command does not allow limits to depend on i, soneed to use cnreg command.) Log likelihood is similar to type I Tobit: l i, 2 1 w i c i log x i c i / 1 w i c i log w i x i / /. (16) 13

MLE is straightforward, as before, and can focus just on. Note that we only have to observe c i when y i is actually censored; we just have to know which observations are censored. In some duration data sets, the censoring value is reported only when the duration is censored. This causes no problem for MLE (but does for certain semiparametric procedures). In Stata, need to have a variable that tells when an observation is censored (and whether it is left or right censored). So, cens is 1 for left censoring, 0 for uncensored, and 1 for right censoring. cnreg y x1 x2... xk, censored(cens) Smith-Blundell applies directly. Add reduced form residuals to cnreg, test for endogeneity. No need to compute complicated partial effects, but should bootstrap standard errors. 14

reg y2 z1... zl predict v2h, resid cnreg y1 z1... zl1 y2 v2h, censored(cens) With fixed censoring limits, can use the ivtobit command and use full MLE. We can combine corner solution responses and data censoring. For example, suppose that in a survey on charitable giving we observe several zero outcomes which represent no contributions and, in addition, contributions are top coded at, say, $10,000. Estimation is with two-limit tobit (with gift in $1000s): tobit gift inc educ married fsize age, ll(0) ul(10) predict gifth, ystar(0,.) How come I didn t put 20 as the upper bound? Because I want to estimate a model for gifts, which 15

follows a standard corner solution Tobit at zero. For response variables that may have very large values (wealth, income, charitable contributions), we might want to intentionally right censor a variable to guard against outliers. Of course, we might use somethink like LAD on the original data (if the original variable is not a corner). CENSORED LEAST ABSOLUTE DEVIATIONS FOR LEFT OR RIGHT DATA CENSORING If we assume the model in (13) with right censoring as in (14), but now assume then Med u i x i,c i 0, (17) Med w i x i,c i min x i,c i, (18) and we can use Powell s CLAD estimator of : 16

min b N w i min x i,c i. (19) i 1 Requires observing a censoring value for all individuals, which can be a problem in duration applications. Not for top coding, where the c i is a known value fixed across i. Currently, Stata only allows a fixed upper limit or a fixed lower limit (not both). clad nettfac inc incsq age agesq e401k, ul(20) Remember, we are doing this to estimate the parameters in Med nettfa x x, unlike in the corner solution case where we wanted Med hours x max 0,x. 17