University of Michigan School of Public Health

Size: px

Start display at page:

Download "University of Michigan School of Public Health"

Austin Hunt
6 years ago
Views:

1 University of Michigan School of Public Health The University of Michigan Deartment of Biostatistics Working Paer Series ear 003 Paer 5 Robust Likelihood-based Analysis of Multivariate Data with Missing Values Rod Little An Hyonggin University of Michigan, rlittle@umich.edu University of Michigan, hyongg@umich.edu This working aer is hosted by The Berkeley Electronic Press (beress) and may not be commercially reroduced without the ermission of the coyright holder. htt://biostats.beress.com/umichbiostat/aer5 Coyright c 003 by the authors.

2 Robust Likelihood-based Analysis of Multivariate Data with Missing Values Rod Little and An Hyonggin Abstract The model-based aroach to inference from multivariate data with missing values is reviewed. Regression rediction is most useful when the covariates are redictive of the missing values and the robability of being missing, and in these circumstances redictions are articularly sensitive to model missecification. The use of enalized slines of the roensity score is roosed to yield robust model-based inference under the missing at random (MAR) assumtion, assuming monotone missing data. Simulation comarisons with other methods suggest that the method works well in a wide range of oulations, with little loss of efficiency relative to arametric models when the latter are correct. Extensions to more general atterns are outlined.

3 Draft Dec 4, 003 Robust likelihood-based analysis of multivariate data with missing values Roderick Little and Hyonggin An University of Michigan Hosted by The Berkeley Electronic Press

4 Abstract The model-based aroach to inference from multivariate data with missing values is reviewed. Regression rediction is most useful when the covariates are redictive of the missing values and the robability of being missing, and in these circumstances redictions are articularly sensitive to model missecification. The use of enalized slines of the roensity score is roosed to yield robust model-based inference under the missing at random (MAR) assumtion, assuming monotone missing data. Simulation comarisons with other methods suggest that the method works well in a wide range of oulations, with little loss of efficiency relative to arametric models when the latter are correct. Extensions to more general atterns are outlined. KEWORDS: double robustness, incomlete data, enalized slines, regression imutation, weighting. Introduction Missing values arise in emirical studies for many reasons. For examle, in longitudinal studies, data are missing because of attrition, when subjects dro out rior to the end of the study. In most surveys, some individuals rovide no information because of non-contact or refusal to resond (unit nonresonse). Other individuals are contacted and rovide some information, but fail to answer some of the questions (item nonresonse). Often indices are constructed by summing values of articular items. For examle, in economic studies, total net worth is a combination of values of individual assets or liabilities, some of which may be missing. If any of the items that form the index are missing, some rocedure is needed to deal with the missing data. htt://biostats.beress.com/umichbiostat/aer5

5 The missing data attern simly indicates which values in the data set are observed and which are missing. Secifically, let = ( ) denote an ( n ) rectangular dataset without y ij missing values, with ith row yi = ( yi,..., yi) where y ij is the value of variable j for subject i. With missing values, the attern of missing data is defined by the missing-data indicator matrix M = ( ) with ith row mi = ( mi,..., mi), such that m ij = if y ij is missing and m ij = 0 if y ij is m ij resent. We assume throughout that ( y, m ) are indeendent over i. i i Some methods for handling missing data aly to any attern of missing data, whereas other methods assume a secial attern. For simlicity we consider methods for the simle attern of univariate nonresonse, where missingness is confined to a single variable, say and,..., are fully observed. In Section 7 we discuss extensions of our methods to more general atterns, such as monotone missing data, where the variables can be arranged so that,,..., j+ is missing for all cases where j is missing, for all j =,..., -. This attern arises commonly in longitudinal data subject to attrition. The erformance of alternative missing-data methods deends strongly on the missingdata mechanism, which concerns the reasons why values are missing, and in articular whether missingness deends on the values of variables in the data set. For examle, subjects in a longitudinal intervention may more likely to dro out of a study because they feel the treatment was ineffective, which might be related to a oor value of an outcome measure. Rubin (976) treated M as a random matrix, and characterized the missing-data mechanism by the conditional distribution of M given, say f ( M, φ ), where φ denotes unknown arameters. When missingness does not deend on the values of the data, missing or observed, that is: f ( M, φ) = f ( M φ) for all, φ, Hosted by The Berkeley Electronic Press

6 the data are called missing comletely at random (MCAR). With the excetion of lanned missing-data designs, MCAR is a strong assumtion, and missingness often does deend on recorded variables. Let obs denote the observed values of and mis the missing values. A less restrictive assumtion is that missingness deends only on values obs that are observed, and not on values mis that are missing. That is: f ( M, φ) = f ( M, φ) for all, φ. The missing data mechanism is then called missing at random (MAR). Many methods for handling missing data assume the mechanism is MCAR or MAR, and yield biased estimates when the data are not MAR (NMAR). The main ideas of this article can be summarized in the following roositions: (a) When the missing data mechanism is unknown and NMAR, methodological otions are limited and not very aealing to the ractitioner. Thus, in studies where missing data are likely to arise, efforts should be made to render the MAR assumtion lausible, by measuring covariates that characterize nonresondents (Little and Rubin, 999). (b) The most useful covariates for nonresonse adjustment are (i) redictive of the missing values mis and (ii) redictive of the missing data indicator M. Of the two, criterion (i) is the most imortant, since conditioning on a covariate that is redictive of M but not of mis leads to a loss of efficiency without a comensating reduction in bias. Section 3 resents an analysis in suort of these statements. (c) All missing-data adjustments require modeling assumtions relating the missing data to observed covariates. Sensitivity to assumtions is a articularly serious issue for analysis involving covariates that are useful for missing-data adjustments, as described in (b). (d) Given (a) - (c), missing-data methods based on MAR and models that make relatively weak assumtions relating the covariates to the missing data are useful. Methods of this kind based on obs mis 3 htt://biostats.beress.com/umichbiostat/aer5

7 roensity slines are roosed in Sections 4 and 5 below, for the secial case of univariate nonresonse. These methods are assessed by simulation in Section 6. Some extensions of these methods to more general missing data roblems are outlined in Section 7, and Section 8 resents concluding remarks.. Limitations of NMAR analyses when the missing data mechanism is unknown. There is an extensive literature of methods for NMAR missing-data mechanisms; early examles include Heckman s (976) roosals for handling selectivity bias, and Rubin s (977) Bayesian analysis. See also Little and Rubin (00, chater 5). The difficulty of the roblem can be seen by considering the simlest situation of a single variable (that is, = ), observed for r cases and missing for n - r cases, with no covariate information. Suose the resondent values of are indeendently distributed with mean µ R and variance σ, and the nonresondent values are indeendently distributed with mean µ NR and variance σ. If the observations are indeendent, then MCAR=MAR, and µ R = µ NR. In that case, the samle mean y based on the r comlete cases is unbiased, and in many cases otimal for the mean. If, on the other hand, the data are NMAR, the bias of y for inference about the overall mean is easily seen to be f λσ, where f = ( n r)/ n is the fraction of missing values and / λ = ( µ µ )/ σ is / R NR the standardized difference in resondent and nonresondent means. Assuming asymtotic normality and ignoring t corrections, the noncoverage rate of the usual 95% confidence interval y ±.96 s / r based on the comlete cases is Φ (.96 + rfλ) +Φ (.96 rfλ), 4 Hosted by The Berkeley Electronic Press

8 Table. Coverage of 95% confidence interval for oulation mean when the resondent mean has a bias f λ = 0.. Resondent samle size Coverage rate (%) where Φ denotes the normal cumulative density function. Table tabulates this noncoverage rate as a function of the resondent samle size r, for a fixed bias of f λ = 0.. Clearly bias has an increasing distorting effect on the noncoverage as the samle size increases. Analysis otions are clearly limited in the absence of information about the nonresondents. Other than assuming the bias away, the only alternative is to widen the interval to allow for otential bias. Three aroaches to this are: (a) To develo bounds for the quantity of interest that include all ossible values of the missing data. For examle, for a binary outcome, one might calculate the samle roortion with all missing values imuted as one, and all missing values imuted as zero (Horowitz and Manski, 000). This aroach tends to be very conservative, and is limited to variables that have finite suort. (b) Conduct a sensitivity analysis for alternative models for nonignorable nonresonse (Rubin, 977; Little and Wang, 996; Scharfstein, Rotnitsky and Robins, 999). (c) Add a rior distribution for the nonresondent values and aly the Bayesian aradigm. For examle, Rubin (977) considers the model: µ R ~ const. ; µ NR µ R ~ N( µ R, λσ ) An alternative aroach is to attemt to measure covariates that cature differences between resondents and nonresondents, so that the missing-data mechanism can be considered MAR. For the remainder of this aer we consider models under the assumtion that the missing 5 htt://biostats.beress.com/umichbiostat/aer5

9 data are MAR, while recognizing that residual deendence of the missing data indicators on missing values of the data may require one of the aroaches (a) - (c) delineated above. 3. Covariates to the rescue? Suose now that fully observed covariates are available, and let,... denote the variables observed for all n cases, and r cases. The mean of can be written as the variable with missing values, observed for the first µ = E[( M ) ] + EME [ ( X)], and E[( M) ] can be estimated from the comlete cases. To estimate the second term EME [ ( X )], note that under MAR, E ( X) = E ( X, M = 0) = E ( X, M = ). Hence for incomlete cases ( M = ) one can estimate E ( X ) from the comlete cases and redict the for each incomlete case by substituting the X for that case into the regression formula. If the regression is linear, this leads to the regression estimate: µ ˆ r n ˆ = n yi + yi i= i= r+ where { yi, i =,..., r} are the observed values of, and () yˆ = βˆ + βˆ y is the rediction i 0 j= j ij from the regression of on (,..., ), comuted on the r comlete cases. Eq. () is the maximum likelihood (ML) estimate of µ for a variety of models, including multivariate normality for (,..., ) (e.g., see Little and Rubin, 00). The imact of regressing on covariates for inference about µ can be assessed by comaring the mean squared error of µ ˆ relative to the estimate based on the comlete cases, 6 Hosted by The Berkeley Electronic Press

10 y = r yi / r. Cons ider this comarison for a single covariate ( = ), where and are i= bivariate normal, and the missing data are MAR. The regression estimate () is then unbiased for µ with mean squared error (e.g. Little and Rubin, 00) ignoring ˆ ( ) mse( µ ) = ( σ / r) ( ρ ) + ( r / n) ρ + ( ρ )( r/ n), O(/ r ) terms, where ρ is the correlation between resondent values of and, and is the difference in the nonresondent and resondent mean of, divided by the resondent variance of. Note the ρ measures the association between and and measures the association between and M. The mean squared error of y is mse( y ) = ( σ / r) + ( r/ n) ρσ, where the first term on the right side is the variance and the second term is the bias. Subtracting and simlifying yields mse( y ) mse( µ ˆ ) = ( r/ n) σ ( r/ n) ρ + ρ / r ( r/ n)( ρ ) / r. () The first term in the square arenthesis in Eq. () is O () and is the bias that has been eliminated by the regression of on (more generally under NMAR one exects the regression to reduce bias, although it could increase). Both ρ and must be large for this term to be substantial. The second and third terms in the square arenthesis reresent variance reduction from the regression on. This variance reduction is substantial when ρ is large. In fact, if ρ is small and is large, as when is redictive of M but not redictive of, the net value of these terms may be negative, reflecting an increase in variance from the regression on. These results are summarized in Table. Eq. () generalizes to a multivariate set of redictors, with the 7 htt://biostats.beress.com/umichbiostat/aer5

11 Table. Effect on bias and variance of the estimated mean of of regression on a fullyobserved covariate, for combinations of the association between and ( ρ ) and the association between and M ( ). ρ Low Low bias change: 0 variance change: 0 High bias change: 0 variance change: ρ High bias change : 0 variance change: bias change : variance change: obvious generalizations of ρ and. Clearly, the key for both bias and variance reduction is that is a good redictor of. 4. Robust MAR inference with a single covariate 4.. Robust Prediction In the revious section we noted that the key to reduce mean squared error for inference about the mean of is to find redictors that are redictive of and the missing data indicator M. These are the circumstances under which inference are most sensitive to missecification of the regression of on,...,, since the bias reduction is deendent on an aroriate secification of the model relating to,...,. Thus we now consider robust alternatives to the linear additive model (). We first consider the case of a single covariate, =. Extensions to more than one covariate are discussed in Sections 5 and 7. Standard regression modeling methods, such as adding olynomial terms and interactions to the regression in (), are useful strategies. Perhas the simlest way to weaken assumtions 8 Hosted by The Berkeley Electronic Press

12 about the relationshi between and a continuous covariate is to grou the covariate into categories and regress on dummy variables for the categories. The resulting regression estimate, µ ˆ C = y, (3) c c c= is the average of the resondent mean y c in each category weighted by the samle roortion = n / n in that category. c c An attractive alternative to categorization of continuous covariates is to fit a smooth but relatively nonarametric relationshi between and the covariate (Cheng, 994). For examle, one might model the regression of on via a enalized sline (Eilers and Marx, 996, Ruert and Carroll, 000) with a ower-truncated sline basis: ( φ σ ) (, φ)~ N s (, ), q j ( φ) = φ + φ + φ ( τ ) s, y, 0 j i q+ k k + j= k= K q, (4) q q where q is the degree of olynomial, ( x) x I( x 0) =, τ + < < τk are selected fixed knots, and K is the total number of knots. Then, the enalized least-squares estimator φˆ = ( φ ) T 0,, φ q + K can be obtained by minimizing the enalized sum of squared errors n q K K j q yi φ0 φjyi φq+ k( yi τ k) + λ ζ ( φq+ k), + i= j= k= k= where ζ is a suitable nonnegative function, and λ is a smoothing arameter. The smoothing arameter can be estimated by generalized cross validation or by ML for a linear mixed model, treating ( φ,, 0 φq ) T as a fixed arameter vector and ( φ φ ) T,, as a random vector. Cheng q+ q+ K (994) achieves nonarametric smoothing by another method, kernel regression; an attractive feature of the ML version of enalized slines is that they are easily imlemented with widely 9 htt://biostats.beress.com/umichbiostat/aer5

13 available software such as PROC MIXED in SAS (SAS, 99) and lme( ) in S-lus (Pinheiro and Bates, 000). 4.. Weighting the comlete cases. An alternative to rediction, commonly used for unit nonresonse adjustments in samle surveys, is to weight the comlete cases by the inverse of an estimate of the robability of resonse (e.g. Little and Rubin, 003, Section 3.3). The mean of can be written as: µ ( M ) M, = E / E π( ) π( ) where π ( ) = Pr( M = 0 ) is the robability that is observed given. The denominator in this equation can be ignored under correct secification of the π ( ), since it then equals one. Relacing oulation quantities by samle estimates yields the weighted comlete-case estimate: or where the weight r r µ ˆ yw = wy i i / wi i= i=, (5) r µ ˆ yw = wy i i / n, (5A) i= w i for resondents is a recirocal of an estimate of y i π ( ). If is groued into categories, and resondents in category c are weighted by the inverse of the estimated resonse rate r / n in category c, then the resulting estimator (5) or (5A) is identical to the c c regression estimate (3). Note that if the true resonse rate is the same for all the categories c, as when the data are MCAR, then weighting by the true resonse rate yields the unweighted samle mean y based on the comlete cases, which is less efficient if the categorized covariate is redictive of resonse. This is a simle and instructive illustration of increased efficiency when weights are estimated from the samle rather than from oulation arameters (e.g. Robins, Rotnitsky and Zhao, 994). 0 Hosted by The Berkeley Electronic Press

14 In section 4. we used slines to smooth the redictions from a regression model. A different use of smoothing is to smooth the weights in the weighted estimator (5). That is, the weights are relaced by the inverse of the estimated roensity to resond, comuted by fitting a sline to the logistic regression of the missing-data indicator M on. w = / π ( φˆ ), π ( φ) = Pr( M = y, φ), i i i i i ( π φ ) logit ( ) = s ( y ; φ), i M i (6) where sm( yi; φ ) is a sline for the binary outcome M i analogous to (5), with unknown arameters φ, and ˆ φ is an estimate of φ. The latter can be obtained by fitting a generalized linear mixed model for the sline regression of M on (Breslow and Clayton, 993). The utility of slines for rediction, as in Eq. (4), and for weighting, as in Eq. (6) is comared in the simulations in Section Calibration estimators weighting: The mean of can be written in a way that combines the features of rediction and µ ( M ) = E E + E E π( ) ( ( )) ( ( )) Estimating quantities in this exression leads to a calibration estimator of the form r n µ ˆ ˆ ˆ = n wi( yi yi) + n yi i= i=. (7) where the redictions y ˆi from the model are calibrated by adding a term consisting of weighted residuals from the model. The estimator (7) has roerties of semi-arametric efficiency and double-robustness (Robins, Rotnitsky and Zhao, 994; Robins and Rotnitsky, 00), in the sense that the estimate is consistent if just one of the models for rediction and weighting are correctly secified. However, since the calibration of the redictions is to correct effects of. htt://biostats.beress.com/umichbiostat/aer5

15 model missecification, we believe that the calibration of the redictions in Eq. (7) is unnecessary if the rediction model does not make strong arametric assumtions, as in (4). This conjecture is suorted by the simulation studies in Section Robust MAR Inferences with More than One Covariate With sufficient samle size, a enalized sline rovides a useful model for redictions based on a single covariate. With several covariates an additive model might be fitted with slines on the continuous covariates. In articular, Scharfstein and Izzary (003) consider a flexible class of estimators that includes (7) as a secial case where the roensity score model and mean model follow generalized additive regressions. We roose here a rediction model that addresses the curse of dimensionality by focusing the sline on a articular function of the covariates most sensitive to model missecification, namely the roensity score. Suose that is subject to missing values and,..., are fully observed covariates, and 3 so that there are at least covariates. We first define the logit of the roensity score for to be observed, given the covariates,..., : ( ) = logit Pr( M = 0,..., ). (8) The key roerty of the roensity score is that conditional on the roensity score and assuming MAR, missingness of does not deend on,..., (Rosenbaum and Rubin, 983). Thus the mean of can be written as: µ = E[( M ) ] + EM [ E ( )]. This motivates the following method: Hosted by The Berkeley Electronic Press

16 (a) Estimate by a logistic regression of M on (,..., ), yielding estimated roensity ˆ ; this regression can include nonlinear terms and interactions in (,..., ), and with sufficient data could be modeled by a sline as in (6). (b) Predict the missing values of by a sline regression of on ˆ. Since variables other than ˆ may be good redictors of, the other covariates are entered in the regression arametrically, for examle as linear additive terms. More secifically, we relace one of the redictor variables, say, by multicollinearity; we then redict the missing values of, to avoid using the following model for the distribution of,... given : (( ) Σ) ( + ) (,..., )~ N s ( ),..., s ( ), (,,...,, β)~ N s ( ) g (,,...,, β), σ, (9) where s ( ) = E ( ) is a sline for the regression of j on j j, = s ( ), j j j j =,...,, and g is a arametric function indexed by unknown arameters β, with the roerty that g (,0,...,0, β ) = 0 for all β. (0) Examles of functions g() that satisfy (0) include a linear additive model for (,..., ), g (,,...,, ) =, () β βj j j= and a model that includes first order interactions between and (,..., ): β = βj j + β j+ j j= j=. g (,,...,, ) 3 htt://biostats.beress.com/umichbiostat/aer5

17 We call Eq. (9) a roensity sline rediction model. The idea of exlicitly including the roensity score as a covariate in the rediction model was reviously roosed by David et al. (983) and in a more general context in Robins (999). The use of a sline on one regressor variable is an alication of u and Ruert s (00) artially linear single-index model in the missing-data setting. The following theorem defines a double robustness roerty of rediction estimates of the mean of based on the model (9). Theorem. Let µ ˆ be the rediction estimator () based on the model (9), and assume MAR. Then µ ˆ is a consistent estimator of µ if either (a) the mean of given (,,..., ) in the model (9) is correctly secified, or (b) the roensity is correctly secified, and (b) E ( ) = s ( ) for j=,...,, that is, the slines j j s ( ) correctly model the regressions of j j on for j =,...,. The robustness feature of condition (b) is that the regression function g does not have to be correctly secified. Outline roof of Theorem. Consistency under (a) follows from the usual roerties of rediction under a well-secified regression model. For consistency under (b) and (b), we need to show that n ( i ) yˆ /( n r) E ( M = ) as ( n r). () i= r+ Let ˆi y be a rediction for a nonresondent ( i = r+,..., n). Note that yˆ = s ( ˆy ) + g( yˆ, yˆ,..., yˆ ; β ), i i i i i 4 Hosted by The Berkeley Electronic Press

18 where yˆ ˆ ( ˆ ij = yij sj yi) and s ˆ j denotes the samle estimate of the sline s j. Since by assumtion the roensity model and the slines are correctly secified, yˆ y = s ( y ) + gy (, y,..., y ; β ), i i i i i i where β is the limiting value of ˆβ, and the estimates y ˆi and s ˆ j have been relaced by limiting values y i and s j. Now ( β ) ( β ) ( β ) E( y y ) = s ( y ) + E g( y, y,..., y ; ) y i i i i i i i s ( y ) + g y, E( y y ),... E( y y ); i i i i i i = s ( y ) + g y,0,...0; = s ( y ) = E( y y ). i i i i i Hence the conditional exectation of y ˆi given y i converges to E y y, which equals ( i i) E y y m = by the balancing roerty of roensity scores. Hence ( i i, i ) n i= r+ y ˆ /( ) i n r converges to E ( M = ), as required for consistency. The double robustness roerty in this theorem is more restrictive than the double robustness roerty for the calibration estimator (7), which requires only (a) or (b) and does not require (b) in addition to (b). On the other hand, (b) is weak, since the univariate regressions on are modeled nonarametrically in (9) with slines { s ( )}. The inclusion of g in (9) does not affect consistency even if it is missecified, and it has the otential of imroving efficiency, as for examle when the variables (,..., ) are redictive of but the roensity score is not. The roerties of roensity sline rediction are further exlored in the simulation study in the next section. j 5 htt://biostats.beress.com/umichbiostat/aer5

19 6. Simulations We conducted three simulation studies to examine the erformance of the estimators of the mean of with missing data under MAR. The first simulation study involved a single fullyobserved covariate, generated from a uniform distribution between and, and one variable with missing values, generated from a normal distribution with one of four mean structures: (I) constant : ( 6, ) N, (II) linear : N( 6 0, ) +, ( ), (III) cubic: N ( ) + +, and ( 6 5sin, ) (IV) sine : N ( π ) +. The exected value of is 6 for four mean structures, and (II) (IV) model a strong redictive relationshi between and. We created missing values from four different models for the roensity to resond: ( ) (I) constant (MCAR) : r( M ) (II) linear : logit r( M 0 ) logit = 0 = 0.5, ( ) = 3 (III) cubic : ( ) =, 3 ( r M ) logit = 0 = 3, and (IV) sine: ( ) ( r M ) ( π ) logit = 0 =.5sin. The resonse rate for all these roensity structures is 0.5, and (II) (IV) model a strong redictive relationshi between and M. The non-mcar simulations are thus focused on the fourth cell of Table, when a well-secified regression adjustment has strong gains. For each combination of mean and roensity structure, 500 simulated data sets with samle size n=00 6 Hosted by The Berkeley Electronic Press

20 were generated. Then, six estimators of the mean of are comared with the mean of before deletion (BD), namely: (CC) The comlete-case estimate, deleting the incomlete cases. (LP) The rediction estimator () based on linear regression. (SP) The rediction estimator () based on a enalized sline regression model (4). (LW) The weighted estimator (5) with weights comuted as the inverse of the resonse roensity estimated by linear logistic regression of the missing-data indicator on. (SW) The weighted estimator (5) with weights comuted as the inverse of the resonse roensity estimated by sline logistic regression of the missing-data indicator on, as in (6). (SPW) The calibration estimator (7) with redictions comuted as for SP and weights comuted as for SW. For the enalized slines methods, we chose 0 equally saced fixed knots over and a truncated linear basis. The results from this simulation study are summarized in Table 3 and Figure. For each combination of mean structure and resonse roensity structure and for each estimator, the standardized bias STDBIAS = 00 (bias/emirical standard error), is tabulated, where bias is the deviation of the average estimate over the 500 simulated data sets from the true arameter value, and the emirical standard error is standard deviation of the estimates over the 500 simulated data sets. Also the relative root mean squared error comared with the BD estimator RRMSE = 00 (RMSE(estimator) RMSE(BD))/RMSE(BD) is tabulated, where RMSE is the square root of the average squared deviation of the estimate from the true value over the 500 simulated data sets. The RRMSE values for methods other than 7 htt://biostats.beress.com/umichbiostat/aer5

21 CC are dislayed in Figure, with means and medians as summaries of overall erformance. We conclude the following from Table 3 and Figure : () In terms of median or mean RRMSE over roblems, SP and SPW are the best methods, SW, LW and LP are intermediate, and CC is much worse than the other methods. In articular, methods that include redictions based a sline (SP or SPW) do better than the method that uses weights based on a sline (SW). () The bias of SP is minor and comarable to that of SPW, and the biases and RRMSE s of SP and SPW are very similar. This suggests that calibration is not needed for SP, although it does not hurt the method. (3) For the constant mean model, and are indeendent; in this case the missing data mechanism is always MCAR since missingness deends on a variable that is indeendent of. None of the methods dislay bias in this situation, as theory would redict. For data from the constant roensity model, the RRMSE s of all the methods are very similar; for data from the other roensity models, CC analysis is best. For non-constant mean models and missing-data mechanisms other than MCAR, CC analysis has a large bias and is not cometitive with other methods. (4) When and are linearly related, LP is the best method, as redicted by theory, but SP is nearly as good, showing little loss in efficiency. LW and SW are noticeably inferior in this case. (5) When and are not linearly related and data are not MCAR, LP redictably suffers from bias from model missecification; SP does much better in these cases since it is not based on a linearity assumtion. (6) When the model for the roensity is not linear, there is some evidence that SW is better than LW, consistent with the fact that SW does not make a linearity assumtion for the logit of the roensity. However gains are less dramatic than for SP over LP. 8 Hosted by The Berkeley Electronic Press

22 The second simulation study involved two fully observed covariates and, and one variable 3 with missing values. We generated and as indeendent uniform deviates between and, and 3 from a normal distribution with one of four mean structures: (I) constant: ( 0, ) N, ( 0 3, ) (II) linear : N ( ) + +, ( , ) (III) additive : N ( ) ( ) + +, ( 0 4, ) (IV) non-additive : N ( ) The exected value of 3 for all these mean structures is 0. We simulated four resonse roensity structures, all of which yield an exected resonse rate of 0.5: ( ) (I) constant : r( M ) logit = 0, = 0.5, ( ) (II) linear : logit ( 0, ) r M = = +, 3 3 ( r M ) (III) additive : logit ( 0, ) = = +, ( ) logit r M = 0, = (IV) non-additive : ( ) For each combination of mean and resonse roensity structures, 500 simulated data sets of size n=00 were generated. Then, we comared the mean before deletion (BD) with the following eight estimators of the mean of 3 from the incomlete data: (CC) The comlete-case mean. (LP) Regression rediction from a linear regression of 3 on and. (ASP) Additive Sline Prediction, namely the rediction estimator () based on an additive regression model of 3 on enalized slines for and. 9 htt://biostats.beress.com/umichbiostat/aer5

23 (LW) The weighted estimator (5) with weights comuted as the inverse of the resonse roensity estimated by linear logistic regression of the missing-data indicator on and. (ASW) Additive Sline Weighting, namely the weighted estimator (5) with weights comuted as the inverse of the resonse roensity estimated by an additive sline logistic regression of the missing-data indicator on and. (ASPW) The calibration estimator (7) with redictions comuted as for ASP and weights comuted as for ASW. (SPP) Penalized sline roensity rediction based on a regression of 3 on the sline of the linear redictor of the estimated roensity to resond from a linear logistic regression of M on, ; that is, Eq. (9) with g = 0. (SPPL) Penalized sline roensity rediction based on Eq. (9) with a linear arametric term for. That is, Eq. (9) with g given by Eq. (). We chose 5 equally saced knots over and resectively and a truncated linear basis for the ASW and ASPW. We also chose 0 equally saced knots over the estimated resonse roensity and a linear truncated basis for the SPP and SPPL. Results in Table 4 and Figure can be summarized as follows: () The roensity sline rediction methods (SPP, SPPL) and the rediction methods based on additive slines (ASP, ASPW) do best overall, followed by the linear rediction methods LP and the weighting methods, LW and ASW; CC is much worse than the other methods since it is very biased excet for simulations with a constant mean model or an MCAR resonse roensity. () The methods based on additive slines (ASP, ASPW) do slightly better than roensity sline methods (SPP, SPPL) when the rediction model is additive, and considerably worse when the rediction model is non-additive. In that case the additive models are missecified, and 3, 0 Hosted by The Berkeley Electronic Press

24 the calibration estimator is also biased when the roensity model is not additive. In fact all the methods were biased for this rather demanding roblem, but the roensity sline methods have less bias than the others. (3) Little gain was seen from adding to the roensity sline, since SPP and SPPL erformed very similarly. Greater gains might be exected in roblems with more useful covariates. (4) Results for ASPW are very similar to those of ASP, suggesting that calibration of ASP has little effect for this articular set of roblems. (5) As in the first simulation, the weighting methods LW and ASW are less efficient than the rediction methods, and smoothing the weight by a sline does not aear to hel much. This simulation study does not dislay the otential for SPPL to yield gains in recision over SPP when the covariates other than the roensity are redictive of the outcome. We thus simulated two additional cases where the logit of resonse was linear in +, but the mean was more strongly correlated with. In the first case the mean of 3 is a nonlinear additive function of and, namely 3, ~ N( (3 3) 3 -(3/7)(3 3) 3, ). In the second case the mean of 3 is a nonlinear non-additive function of and, namely 3, ~ N( 0 ( ), ): The results of these simulations are shown in Table 5. Note that in these cases both SPP and SPPL dislay minimal bias, but SPPL shows the exected gain in recision, reflected in lower RRMSE s. As might be exected, the methods that fit additive slines, ASP and ASPW, have the lowest RRMSE s in the additive case, but are inferior to SPP and SPPL in the non-additive case. The results of any simulation study are limited by the choice of oulations simulated, and should be interreted with caution. Our conclusion from these simulations is that the redictions from sline models can yield relatively robust estimates of the oulation mean. htt://biostats.beress.com/umichbiostat/aer5

25 With several covariates, additive slines work well when the effects of the variables are additive, but the roensity sline method rovides an attractive alternative way of addressing the curse of dimensionality. 7. Extensions to Monotone and General Patterns The roensity sline model (9) of the revious section can be extended to a monotone attern by alying it sequentially to each block of missing variables. Missing values of covariates are relaced by their redictions in this sequence of regressions. Multile imutation versions of this aroach, where draws from the redictive distribution are imuted rather than means, and extensions to general atterns of missing data based on the sequential imutation methods of Raghunathan et al. (00), will be examined in future research. 8. Conclusions Desite the large literature devoted to nonignorable missing data adjustments, we believe that the key to successful treatment of missing data is to measure covariates that are redictive of the missing values, and to model carefully the relationshis between the missing variables and these covariates. Likelihood-based methods based on multivariate models for the data are useful tools for making efficient use of the available data, but standard models such as the multivariate normal imly linear additive relationshis between the variables that may be too simlistic in certain settings. We roose easily-fitted sline models that yield regression redictions that are more robust to nonlinearity in the relationshi between the missing variables and the covariates, under the MAR assumtion. A key idea is to single out the roensity score for this robust form of modeling. Hosted by The Berkeley Electronic Press

26 A limitation of the work on roensity sline methods described here is that it is focuses on oint estimation. Inferences for the roensity sline rediction model require valid estimates of standard errors, and ideally Student t corrections for small samles. Possible aroaches to estimating standard errors include: (a) comuting the estimate on a set of bootstra samles, and calculating a bootstra standard error from the samle variance over the bootstra samles, or from ercentiles of the bootstra distribution; (b) ignoring samling error in estimating the roensity π (,..., ), and using asymtotic standard errors for the model (9) based on standard linear mixed model formulae. (c) using the roensity sline rediction model to multily-imute draws from the redictive distribution of the missing values, and then using multile imutation methods for estimating the variance (e.g. Rubin, 987; Little and Rubin, 00, chater 0). Simulations comaring these aroaches are currently underway. Future work will also consider extensions to general atterns of missing data and non-normal outcomes, based on extensions of the sequential imutation method of Raghunathan et al. (00). Acknowledgments This research was suorted by National Science Foundation Grant DMS We greatly areciate the helful comments of Bin Nan, Trivellore Raghunathan, an associate editor and a referee. 3 htt://biostats.beress.com/umichbiostat/aer5

27 References Breslow, N.E. and Clayton, D.G. (993). Aroximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9-5. Cheng, P.E. (994). Nonarametric Estimation of Mean Functionals with Data Missing at Random. Journal of the American Statistical Association, 89, David, M., Little, R.J.A., Samuhel, M.E. and Triest, R.K. (983). Imutation models based on the roensity to resond. Proceedings of the Business and Economics Section, American Statistical Association 983, Eilers, P.H.C. and Marx B.D. (996). Flexible smoothing with B-slines and enalties (with discussion), Statistical Science, 89-. Heckman, J. (976). The common structure of statistical models of truncation, samle selection and limited deendent variables, and a simle estimator for such models, Annals of Economic and Social Measurement 5, Horowitz, J.L. and Manski, C.F. (000). Nonarametric analysis of randomized exeriments with missing covariate and outcome data, Journal of the American Statistical Association, 95, (with discussion). Little, R.J. and Rubin, D.B. (999). Comment on Adjusting for Non-Ignorable Dro-out Using Semiarametric Models by D. O. Scharfstein, A. Rotnitsky and J.M. Robins. Journal of the American Statistical Association, 94, Little R.J.A. and Rubin, D.B. (00). Statistical Analysis with Missing Data, Wiley, New ork. Little, R.J., and Wang, -X (996). Pattern-mixture models for multivariate incomlete data with covariates, Biometrics 5, Hosted by The Berkeley Electronic Press

28 Pinheiro, J.C. and Bates, D.M. (000). Mixed-Effects Models in S and S-PLUS, New ork: Sringer-Verlag. Raghunathan, T. E., Lekowski, J.M., VanHoewyk, J. and Solenberger, P. (00). A multivariate technique for multily imuting missing values using a sequence of regression models. Survey Methodology, 7, Robins, J.M. and Rotnitsky (00). Comment on Inference for semiarametric models: some questions and an answer by P. Bickel and J. Kwon. Statistica Sinica,, Robins, J.M., Rotnitsky, A. and Zhao, L.P. (994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, Rosenbaum, P.R. and Rubin, D.B. (983). The central role of the roensity score in observational studies for causal effects, Biometrika 70, Rubin, D.B. (976). Inference and missing data, Biometrika 63, Rubin, D.B. (977). Formalizing subjective notions about the effect of nonresondents in samle surveys, Journal of the American Statistical Association 7, Ruert, D. and Carroll R.J. (000). Satially adative enalties for sline fitting, Australia and New Zealand Journal of Statistics 4, SAS (99). The Mixed Procedure, Chater 6 in SAS/STAT Software: Changes and Enhancements, Release 6.07, Technical Reort P-9, SAS Institute, Inc., Cary: NC. Scharfstein, D. and Irizarry, R. Generalized additive selection models for the analysis of nonignorable missing data. Biometrics, in ress. 5 htt://biostats.beress.com/umichbiostat/aer5

29 Scharfstein, D., Rotnitsky, A. and Robins, J. (999). Adjusting for nonignorable droout using semiarametric models, J. Am. Statist. Assoc. 94, (with discussion). u,. and Ruert, D. (00). Penalized sline estimation for artially linear single-index models, Journal of the American Statistical Association 97, Hosted by The Berkeley Electronic Press

30 Table 3. Simulation Study Comaring Estimators with a Single Covariate. Proensity Model Constant (MCAR)(I) Linear (II) Cubic (III) SINE (IV) Mean Model STDBIAS RRMSE STDBIAS RRMSE STDBIAS RRMSE STDBIAS RRMSE Constant (I) BD CC LP SP LW SW SPW BD Linear (II) CC LP SP LW SW SPW BD CC Cubic (III) LP SP LW SW SPW BD CC SINE (IV) LP SP LW SW SPW htt://biostats.beress.com/umichbiostat/aer5

31 Table 4. Simulation Study Comaring Estimators with Two Covariates. Proensity Model Constant (MCAR) (I) Linear (II) Additive (III) Non-Additive (IV) Mean Model STDBIAS RRMSE STDBIAS RRMSE STDBIAS RRMSE STDBIAS RRMSE Constant (I) Linear (II) Additive (III) Non- Additive (IV) BD CC LP ASP LW ASW SPP SPPL ASPW BD CC LP ASP LW ASW SPP SPPL ASPW BD CC LP ASP LW ASW SPP SPPL ASPW BD CC LP ASP LW ASW SPP SPPL ASPW Hosted by The Berkeley Electronic Press

32 Table 5. Sulemental Simulations where Covariates Other than the Proensity Score are Predictive of Outcome. Additive Mean Non-Additive Mean METHOD STDBIAS RRMSE STDBIAS RRMSE BD CC LP ASP LW ASW SPP SPPL -3 ASPW htt://biostats.beress.com/umichbiostat/aer5

33 Figure. Histograms of RRMMSE for Methods other than CC Analysis, for Simulation with a Single Covariate SW SPW mean = 34.6 mean = median = 9.0 median = Percent of Total 60 LP mean = 47.4 SP mean =.7 0 LW mean = 4.0 median = 35.0 median = 0.0 median = Histograms of RRMSE for Simulation 30 Hosted by The Berkeley Electronic Press

34 Figure. Histograms of RRMMSE for Methods other than CC Analysis, for Simulation with Two Covariates SPP SPPL ASPW mean = 8. mean = 7.9 mean = median = 8.5 median = 9.5 median = Percent of Total 60 LP mean = 43.0 ASP mean = 43.0 LW mean = ASW mean = 54.8 median = 6.5 median = 5.5 median = 34.5 median = Histograms of RRMSE for Simulation 3 htt://biostats.beress.com/umichbiostat/aer5

Extensions of the Penalized Spline Propensity Prediction Method of Imputation

Extensions of the Penalized Spline Propensity Prediction Method of Imputation Extensions of the Penalized Sline Proensity Prediction Method of Imutation by Guangyu Zhang A dissertation submitted in artial fulfillment of the requirements for the degree of Doctor of Philosohy (Biostatistics)