CENSORED DATA AND CENSORED NORMAL REGRESSION

Size: px

Start display at page:

Download "CENSORED DATA AND CENSORED NORMAL REGRESSION"

Isabel Lyons
5 years ago
Views:

1 CENSORED DATA AND CENSORED NORMAL REGRESSION Data censoring comes in many forms: binary censoring, interval censoring, and top coding are the most common. They all start with an underlying linear model for y, the variable of interest: y x u E u x 0, (1) (2) where x is 1 K with first element unity. Under (1) and (2), if we have random draws x i,y i from the population, then OLS is consisitent and N -asymptotically normal for the parameters of interest,. But suppose what we observe is a censored version of y. In the top coding example, suppose that wealth is measured in thousands of dollars and it is top coded at 1

2 $200,000. Then we can define the censored version of weath (for any unit that can be drawn from the population) as w min y,200. For each random draw i, w i min y i,200. What would the data set look like on x i and w i,where w i is called wealth? We should notice that the maximum value of the w i in the sample is 200, with a nontrivial fraction of observations at exactly 200. Because there are no behavioral reasons to see a focal point for wealth at 200 let alone, to observe no values greater than 200 we would recognize that the wealth variable has been top coded at 200. BINARY CENSORING Suppose we want to estimate the factors that affect willingness to pay. Hard to elicit a precise figure. So present families with a cost and allow them to simply state whether their wtp is above the cost. The model for 2

3 the population is first assumed to be wtp x u, E u x 0, (3) where x 1 1. Let r i denote the cost of the project to household i. Presented with this cost, the household either says it is in favor of the project or not. Thus, along with x i and r i, we observe the binary response w i 1 wtp i r i (4) whereweassumethechancethatwtp i equals r i is zero. We are given data on x i,r i,w i. What is the most natural way to proceed to estimate? If we impose some strong assumptions on the underlying population and the nature of r i, then we can proceed with maximum likelihood. In particular, assume u i x i,r i ~ Normal 0, 2. (5) Assumption (5) implies that (3) actually satisfies the classical linear model (CLM). It also requires that r i is 3

4 independent of wtp i conditional on x i,thatis, D wtp i x i,r i D wtp i x i. (6) This assumption is satisfied if r i is randomized or if r i is chosen as a function of x i, or some combination of these. P w i 1 x i,r i P wtp i r i x i,r i P u i / r i x i / x i,r i 1 r i x i / x i r i /. (7) So the estimating equation is a probit model with coefficients / and 1/. All parameters, including, are identified if x i does not contain perfect collinearity and r i varies across i (in a way not perfectly linearly related to x i ). If r i is the same for all i, cannot identify the parameters. Estimate, by MLE. The costs of the binary censoring scheme are severe. If we could observe wtp i,e wtp i x i x i would suffice for consistent estimation of ;infact,wecouldjust 4

5 specify a linear projection and use OLS. With censoring, we must assume the underlying model satisfies the CLM. Now, discussions of the deleterious effects of nonnormality and heteroskedasticity when using probit models makes much more sense. There are ways to estimate the slope parameters up to scale without without placing strong restrictions on D u i x i,r i. For example, if the distribution of x i,r i,y i implies linear conditional expectations for all elements conditional on y i, then the Chung and Goldberger (1984) results can be used for OLS. Manski s (1975, 1988) maximum score estimator requires only symmetry of D u i x i,r i (around zero), and Horowitz s (1992) smoothed version is more convenient for inference. But in every case these methods only estimate the slope coefficients up to scale (and the intercept cannot be estimated at all), and therefore we cannot learn the 5

6 magnitude of the effect of any element of x on willingness to pay, nor do we have a way of predicting willingness to pay for given values of the covariates. An interesting puzzle: What if the appropriate population model for WTP is the type I Tobit, wtp max 0,x u, under assumption (5). Now, for example, E wtp x has the (nonlinear) Tobit form, and we can easily compute it becaus we have consistent estimates of and. The estimation procedure is the same provided the r i are strictly positive. Because we do not observe wtp i, with our data x i,r i,w i we cannot distinguish between (3) and (8). But, if we believe that wtp is zero for some fraction of the population, any calculations should take that into account by using the type I Tobit formulas. Because estimation of and 2 can be sensitive to heteroskedasticity and nonnormality, it makes sense to 6

7 be flexible in specifying D u x. INTERVAL CODING We can generalize the WTP example to allow multiple intervals. Let y i be the (unobserved) response for random draw i. We only know whether y i falls into a specific interval. In many cases, these intervals are fixed across all i, such as surveys asking people about their income bracket. In this case we say we have interval-coded data. First let a 1 a 2... a J denote the known interval limits that are common across i; these are specified as part of the survey method. If u x ~Normal(0, 2, (8) we can estimate and 2 by MLE, provide J 2. Not surprisingly, the structure of the problem is similar that for an ordered probit for an ordered, qualitative response (such as a credit rating). But with ordered probit we 7

8 estimate cut points, whereas here we know the intervals and hope to estimate the parameters of the underly CLM. In fact, we can define w 0 ify a 1 w 1 ifa 1 y a 2 w J if y a J (9) and easily obtain the conditional probabilities P w j x for j 0,1,...,J. The log-likelihood is l i, 1 w i 0 log a 1 x i / 1 w i 1 log a 2 x i / a 1 x i /... 1 w i J log 1 a J x i /. (10) The maximum likelihood estimators, and 2, are often called interval regression estimators, with the understanding that the underlying distribution is normal. Importantly, when we obtain the interval regression estimates, we interpret the as if we had been able to 8

9 run the regression y i on x i, i 1,...,N. Imposing the assumptions of the classical linear model allows us to estimate the parameters in the distribution D y x even though they data are censored by being put into intervals. Sometimes in applications of interval regression the observed, censored variable, w, is set to some value within the interval that y belongs to. For example, if y is wealth, we might set w to the midpoint of the interval that y falls into. (Of course, we have to use some other rule if y a 1 or y a J. Provided the definition of w determines the proper interval, the maximum likelihood estimators of and will be the same. In Stata, for each observation specify two dependent variables, which are the upper and lower bounds for each i. When w is defined to have the same units as y,itis tempting to ignore the grouping of the data and just to 9

10 run an OLS regression of w i on x i, i 1,...,N. Naturally, such a procedure is generally inconsistent for. Nevertheless, the results of Chung and Goldberger apply: if E x y is linear.in y, then the OLS regression produces consistent estimators up to a common scale factor (at least for the slope coefficients). If we allow the endpoints to depend on i, we encompass the case of binary censoring. More generally, we might have multiple endpoint that change across i. In Stata, specify the two endpoints for each observation. That is, a lower bound and an upper bound of the interval that observation is known to fall into. The command specifies these bounds as dependent variables: intreg lower upper x1 x2... xk When the interval limits change across i, we assume they do so exogenously, namely, 10

11 D y i x i,a i1,...,a ij D w i x i,a i1,...,a ij. (11) In the WTP example, this holds because the one limit value J 1 is randomly assigned. Generally, the limits can be a function of x i (because these are being conditioned on). Because of the underlying normality assumption, we can use the Rivers-Vuong (1988) control function approach to test and correct for endogeneity of explanatory variables. It is analog of the Smith-Blundell approach that we discussed for Tobit. The underlying model is linear y 1 z y 2 u 1. (12) We would just like to use 2SLS on this model, but y 1 is interval coded (and we have the lower and upper limits for each observation). reg y2 z1 z2... zl 11

12 predict v2h, resid intreg lower1 upper1 z1... zl1 y2 v2h where L 1 L. We are just interested in 1 and 1 here, because (12) is the equation of interest. Bootstrap is easy to implement for standard errors. Again, beware of schemes for discrete y 2 that involve plugging in fitted values. RIGHT AND LEFT CENSORING Now we consider the more common cases of left and right censoring (or censoring from below and above). In top coding cases (right censoring) and minimum wage or price floors (left censoring), the censoring point is typically fixed. But in other applications, particularly duration models, the censoring point changes with i. The right censoring case, where y i is again the underlying variable of interest, is 12

13 y i x i u i w i min w i,c i (13) (14) where c i 0 is the censoring point for unit i. Sometimes the linear model is specified for y i log durat i, and so, of course, the censoring values have to be logs, too. If we assume exogenous censoring (and exogenous explanatory variables) that is, D u i x i,c i D u i Normal 0, 2, (15) we can use censored normal regression (called censored Tobit, too). (In Stata, however, the tobit command does not allow limits to depend on i, soneed to use cnreg command.) Log likelihood is similar to type I Tobit: l i, 2 1 w i c i log x i c i / 1 w i c i log w i x i / /. (16) 13

14 MLE is straightforward, as before, and can focus just on. Note that we only have to observe c i when y i is actually censored; we just have to know which observations are censored. In some duration data sets, the censoring value is reported only when the duration is censored. This causes no problem for MLE (but does for certain semiparametric procedures). In Stata, need to have a variable that tells when an observation is censored (and whether it is left or right censored). So, cens is 1 for left censoring, 0 for uncensored, and 1 for right censoring. cnreg y x1 x2... xk, censored(cens) Smith-Blundell applies directly. Add reduced form residuals to cnreg, test for endogeneity. No need to compute complicated partial effects, but should bootstrap standard errors. 14

15 reg y2 z1... zl predict v2h, resid cnreg y1 z1... zl1 y2 v2h, censored(cens) With fixed censoring limits, can use the ivtobit command and use full MLE. We can combine corner solution responses and data censoring. For example, suppose that in a survey on charitable giving we observe several zero outcomes which represent no contributions and, in addition, contributions are top coded at, say, $10,000. Estimation is with two-limit tobit (with gift in $1000s): tobit gift inc educ married fsize age, ll(0) ul(10) predict gifth, ystar(0,.) How come I didn t put 20 as the upper bound? Because I want to estimate a model for gifts, which 15

16 follows a standard corner solution Tobit at zero. For response variables that may have very large values (wealth, income, charitable contributions), we might want to intentionally right censor a variable to guard against outliers. Of course, we might use somethink like LAD on the original data (if the original variable is not a corner). CENSORED LEAST ABSOLUTE DEVIATIONS FOR LEFT OR RIGHT DATA CENSORING If we assume the model in (13) with right censoring as in (14), but now assume then Med u i x i,c i 0, (17) Med w i x i,c i min x i,c i, (18) and we can use Powell s CLAD estimator of : 16

17 min b N w i min x i,c i. (19) i 1 Requires observing a censoring value for all individuals, which can be a problem in duration applications. Not for top coding, where the c i is a known value fixed across i. Currently, Stata only allows a fixed upper limit or a fixed lower limit (not both). clad nettfac inc incsq age agesq e401k, ul(20) Remember, we are doing this to estimate the parameters in Med nettfa x x, unlike in the corner solution case where we wanted Med hours x max 0,x. 17

What s New in Econometrics? Lecture 14 Quantile Methods

What s New in Econometrics? Lecture 14 Quantile Methods Jeff Wooldridge NBER Summer Institute, 2007 1. Reminders About Means, Medians, and Quantiles 2. Some Useful Asymptotic Results 3. Quantile Regression