Extensions of the Penalized Spline Propensity Prediction Method of Imputation

Size: px

Start display at page:

Download "Extensions of the Penalized Spline Propensity Prediction Method of Imputation"

Damian Simmons
6 years ago
Views:

1 Extensions of the Penalized Sline Proensity Prediction Method of Imutation by Guangyu Zhang A dissertation submitted in artial fulfillment of the requirements for the degree of Doctor of Philosohy (Biostatistics) in The University of Michigan 007 Doctoral Committee: Professor Roderick J. Little, Chair Professor Susan A. Murhy Professor Trivellore E. Raghunathan Associate Professor Bin Nan

2 Guangyu Zhang 007

3 ACKNOWLEGEMENTS I would like to thank my advisor, Dr. Roderick Little, for his advice, suort and intellectual stimulation. I also want to thank my husband, Huanyuan Sheng, for his love. ii

4 TABLE OF CONTENTS ACKNOWLEGEMENTS LIST OF FIGURES LIST OF TABLES ABSTRACT CHAPTER I. INTRODUCTION II. ETENSIONS OF THE PENALIZED SPLINE PROPENSITY PREDICTION (PSPP) METHOD OF IMPUTATION 9 Abstract 9. Introduction 9. Penalized Sline of Proensity Prediction (PSPP).3 PSPP is not doubly robust for subgrou means 3.4 Stratified Penalized Sline Proensity Prediction for subgrou means 6.5 A Bivariate PSPP Method for estimating the conditional mean of Y given a continuous covariate 7.6 An Examle: Online Weight Loss Study 9.7 Discussion Aendix 6 ii v vi vii III. A COMPARATIVE STUDY OF THE PENALIZED SPLINE PROPENSITY PREDICTION METHOD WITH ALTERNATIVE DOUBLY ROBUST ESTIMATORS 3 Abstract 3 3. Introduction 3 3. Doubly robust estimators Simulation studies An Examle: Online Weight Loss Study Conclusion 55 iii

5 IV. THE PSPP METHOD FOR THE MONOTONE PATTERN MISSING DATA Introduction PSPP for the monotone attern of missing data Simulation study Examle Conclusion 80 V. CONCLUSION AND THE FUTURE WORK 86 REFERENCES 89 iv

6 LIST OF FIGURES Figure % Increase of RMSE of Simulation 57 % Increase of Width of CI of Simulation 58 3 Non-coverage rate of Simulation 59 4 % Increase of RMSE of Simulation 60 5 % Increase of Width of CI of Simulation 6 6 Non-coverage rate of Simulation 6 7 % Increase of RMSE of Simulation % Increase of Width of CI of Simulation Non-coverage rate of Simulation Examle of monotone missing data structure 7 v

7 LIST OF TABLES Table. Examle : Emirical Bias, Standard Deviation (SD) and Root Mean Squared Error (RMSE) for (A) Marginal mean of Y, and (B) Conditional Mean of Y given. 3. Examle : Emirical Bias, Root Mean Squared Error (RMSE) and Coverage rate (Cov) for (A) Marginal mean of Y, (B) Conditional Mean of Y given, and (C) Intercet and Sloes for Regression of Y on,. 4.3 BMI reduction within grous 5 3. Simulation classified by degree of missecification in the mean function and the degree of diversity of the roensity function Simulation classified by degree of missecification in the mean function and the degree of diversity of the roensity score Simulation 3 classified by degree of missecification in the mean function and the degree of diversity of the roensity score Emirical bias, emirical standard error and RMSE when roensity function is wrong secified BMI reduction within grous Bias, STD and RMSE for the marginal and conditional means 8 4. Covariates in the roensity model The baseline covariates in the g function of model b, d and f BMI reduction within grous 85 vi

8 ABSTRACT Little and An (004) roosed a enalized sline roensity rediction (PSPP) method of imutation of missing values that yields robust model-based inference under the missing at random assumtion. The roensity score for a missing variable is estimated and a regression model is fit with the sline of the roensity score as a covariate. The redicted marginal mean of the missing variable is doubly robust (DR) under the missecification of the imutation model. In the first art of the thesis, we study roerties of a simlified version of the PSPP that does not center the regressors rior to including them in the rediction model. We then extend the PSPP to multivariate data with both continuous and categorical variables so as to yield consistent estimates of both marginal and conditional means. The extended PSPP method is comared with the PSPP method and simle alternatives in a simulation study. For the second art of the thesis, we comare the PSPP method with several other DR estimators. The PSPP method uses a sline of roensity score to imute the missing values and the resulting estimates have a double robustness roerty. The DR roerty can also be achieved by modeling the relationshi arametrically, such as the linear in the weight method and calibration method (Firth and Bennett, 998; Scharfstein, Rotnitzky and Robins, 999; Robins, Rotnitzky and Zhao, 994; Scharfstein, Rotnitzky and Robins, 999). We comare root mean square error (RMSE), width of confidence interval and non-coverage rate of the PSPP method and these alternatives under different mean functions and roensity score functions. We also study the effects of samle size and missecification of the roensity scores. The PSPP method yields estimates with smaller RMSE and width of confidence interval comared with other methods under most situations. It yields estimates with non-coverage rates close to the 5% nominal level. vii

9 For the third art of the thesis, we extend the PSPP methods to the monotone missing data. We roose to imute the missing values based on a stewise PSPP rocedure and simulation studies show the stewise PSPP method yields consistent marginal and conditional mean estimates. We illustrate the roosed method by alying it to an online weight loss study conducted by Kaiser Permanente. We finish the thesis with a short discussion and future work. viii

10 CHAPTER I INTRODUCTION Missing data arise in scientific research for many reasons. For examle, in a two stage clinical study, a subset of subjects is selected for exensive medical tests and those who have not been selected will have missing values. On the other hand, subjects who have been selected may dro out of the study so that it is imossible to collect test results. No matter what are the reasons, it is imortant to include the information in the incomlete cases in the analysis to yield efficient estimators and better inferences. A dataset with missing values can be described by the missing-data attern, which indicates which observations are resent and which are missing. Let Y = ( y ij ) be a (n ) rectangular data set with the ith row yi = ( yi,..., yi), where y ij is the jth observation for subject i, j =,...,. Let M = ( m ) ij be a missing-data indicator matrix with the th row m = ( m,..., m ), such that m = if y is missing and m = 0 if i i i i y is resent. The matrix M then reresents the missing data attern of the dataset Y. ij We assume that ( y, m ) are indeendent over i throughout the dissertation. i i ij ij ij We first focus on the univariate missing data, where missingness confines to a single variable. Let (,...,, Y) denote a ( + ) -dimensional vector of variables with Y subject to missing values and,..., fully observed covariates. We consider the roblem of estimating the mean of Y, and the conditional means of Y in subclasses defined by a categorical variable, and the regression coefficient of Y on a continuous variable.

11 Estimation of the marginal and conditional means of Y requires the assumtions on the missing-data mechanism, which concerns the relationshi between the missingness and the values of the variables in the data matrix. Rubin (976) treated M as a random matrix and described the missing data mechanism by the conditional distribution of M given Y, f( M Y, φ ), where φ denotes unknown arameters. When missingness does not deend on Y, missing or observed, that is, f( M Y, φ) = f( M φ) for all Y, φ, the data are called missing comlete at random (MCAR). If the missingness deends only on Yobs, the observed art of Y, but not the missing art of Y, Y mis, that is f( M Y, φ) = f( M Y, φ) for all Y, φ, then the missing data mechanism is called missing at random (MAR). If the missingness deends on the missing art of the variables, that is, obs obs mis mis f( M Y, φ) = f( M Y, Y, φ) for all φ, then the data are called not missing at random (NMAR). MAR is a less restrictive assumtion than MCAR. In alications researchers are encouraged to take efforts to render the MAR assumtions lausible by measuring covariates that characterize the nonresondents (Little and Rubin, 999). We assume the missing data are MAR throughout the dissertation. Many methods have been roosed to deal with missing information. A simle aroach is comlete-case analysis (CC), which deletes units with Y missing, so information contained in the deleted cases is lost. In the context of our roblem, CC analysis yields consistent estimates of marginal mean of Y, if the missing-data mechanism is missing comlete at random. But it yields biased estimates if the missingness of Y deends on the observed covariates,..., or deends on Y. Weighted comlete-cases analysis is an alternative of the CC analysis (Little, 986; Horvitz, and Thomson, 95; Cochran, 968). Let r be the number of comlete r r cases, the marginal mean of Y can be derived as ˆ μ = ( wy)/( w) or i i i i= i=

12 3 r ˆ μ = ( wy) / n. The weight, w, is the design weight or the robability of being selected i= i i in samle surveys without nonresonse. For missing data due to nonresonse the weight is the inverse of the robability of being observed. When the weight is unknown, we can estimate it based on a set of observed variables that characterizes the resondents and nonresondents. One way is to grou subjects into subclasses based on a small set of observed covariates. Within each subgrou the resondents are a random samle of the suboulation and the weight is the inverse of the roortion of the resondents. When the number of covariates increases, sub-grouing will lead to a large number of subclasses and in this case, roensity weighting will be an alternative (Cochran, 965; Rosenbaum and Rubin, 983, 984, Little, 986). The roensity score is a scalar function of the observed covariates. One can estimate the roensity score by a logistic or robit regression of M on the observed covariates and the weight can be derived as the inverse of the roensity score. The otential draw back of the roensity weighting is that it is may yield estimates with large variances because resondents with very small roensity scores will be assigned huge weights, which may lead to out-of-range estimates for the means (Little and Rubin, 00). Comlete-case analysis and weighted comlete-case analysis delete subjects with missing values thus information contained in the covariates of the incomlete cases are lost. This loss of information may lead to less efficient estimators. To make full use of observed information we can use arametric aroach to deal with missing data. For examle, we can derive the marginal mean of Y based on a linear regression model Y i 0 = β + β + j= j ij ε i, where i ε is the error term with ε ~ N (0, σ ). We can solve this i model by maximum likelihood (ML) aroach (Little and Rubin, 00; Anderson, 957; Rubin, 974). The marginal mean of Y can be derived as Y ˆ r n = n ( y + y ˆ ), i= i i= r+ i with yˆ i = ˆ β ˆ 0 + β jij j=, where ˆ β,..., ˆ β are the maximum likelihood estimators based on 0 the comlete cases. An alternative to the ML estimators is to add a rior distribution for

13 4 the arameters β 0,..., β and σ and derive the osterior distribution of Y given the covariates and the unknown arameters, Y β β σ (Gelman, Carlin, (,...,, 0,...,, ) Stern and Rubin, 995). Missing values of Y and the unknown arameters = (,..., ) and β β0 β σ are drawn iteratively by Gibbs samler or by Markov Chain Monte Carlo (MCMC) method (Casella and George, 99; Geman and Geman, 984). When the osterior distribution reaches stationary condition after Nth iteration, M sets of data are created such that within each data every missing is substituted by an indeendent draw from the osterior distribution. For each dataset a osterior mean of Y, () l Y, l =,..., M, is derived as the average of the observed values and the osterior draws. The marginal mean of Y is the average the osterior means over the M datasets. Usually M needs to be a large number. However, if we can assume aroximate normality for the osterior distribution of β and σ given the observed data, ( β,..., β, σ,..., ), we only need to create a small number of datasets to estimate 0 the marginal mean of Y, which is the idea of multile imutation (Little and Rubin, 00; Rubin, 978). For each dataset the missing values are relaced by indeendent osterior draws and the comlete data analysis technique is alied to each imuted dataset. The marginal mean of Y can be derived using Rubin s combination rules (Rubin 978, 987, 996; Rubin and Schenker, 986; Barnard and Rubin, 999). Let ˆd μ be the estimated marginal mean of the d th dataset, d =,..., D, where D is the total number of imuted Y i datasets, the marginal mean of Y is derived as D ˆ μ = ˆ μ / d D. d = The arametric aroach described above is very efficient and yields consistent estimates if the model assumtions are correct. But the drawback is that it is very sensitive to model missecification. In reality we can never guarantee the model assumtions are correct thus robust estimators are gaining more attention recently. Robins, Rotnitzky and Zhao (994) and Rotnitzky, Robins and Scharfstein (998) roosed a class of augmented orthogonal inverse robability-weighed estimators, which combine the features of the arametric rediction with the weighted estimation equations

14 5 (WEE). The marginal mean of Y can be derived by calibrating the redictions from a arametric model by adding mean of the weighted residuals, M μ y = E( E( Y,..., )) + E[ ( Y E( Y,..., )] π ( Y ) where π ( Y ) is the robability of being observed. This leads to a calibration estimator of the form: where wˆ / Pr( M 0,..., ) n r ˆ ( ˆ ) ( ˆ i i( i i= i= ˆ i)) μ = n y + n w y y i = i = P is the estimated weight for the ith subject, and ˆi the rediction from a arametric model for the ith subject. There are three stes for the calibration method. Firstly a arametric model is fit to the comlete cases and redictions are derived for all the subjects by substituting the covariates to the regression model. Secondly, the roensity score is estimated by a logistic regression or a robit model of M on,...,. Then the marginal mean of Y can be estimated by combining mean of the redictions with mean of the weighted residuals, where residuals are the differences of the observed values and the redicted values for the comlete cases. This method has a double robustness roerty meaning that if either the rediction model is correctly secified or the weight is correctly estimated, the marginal mean of Y is consistent. This class of estimators is further extended in Robins and Rotnitzky (00), Lunceford and Davidian (004), Yu and Nan (006). y is An alternative way to achieve robustness is to weaken the model assumtions; for examle, we can fit models with robust mean functions. One of the methods is the linear in the weight rediction (LWP). It includes the weight as a linear term in the imutation model (Scharfstein, Rotnitzky and Robins, 999; Bang and Robins, 005) as follows. ( Y,..., ; β ) ~ N( g(,... ; β) + α W, σ ) where W ˆ = / Pr( R =,..., ) P is the inverse of the estimated roensity score of resondents. Similar aroach has been alied in the samle survey setting, where the weights are due to samling rather than nonresonse (Sarndal, Swensson and Wretman, 003 ; Firth D. and Bennett, 998). The linear in the weight rediction method has a similar double robustness roerty as the calibration estimators meaning that if either the ˆ

15 6 mean function of Y given the covariates are correctly secified or the weight is correctly estimated then the marginal mean of missing variable Y will be consistent. Like the calibration method, the first ste of fitting a linear in the weight model estimates the roensity score, for examle by a logistic regression model or a robit model of M on,..., ; in the second ste, a regression of Y on the weight and the other covariates is fit arametrically. Semiarametric and non-arametric method is another aroach to yield robust mean functions by caturing the nonlinear relationshi between the variables. In articular, with = and single covariate, one version of this aroach is to base imutations on the enalized sline model yi = s( xi) + εiwith truncated olynomial basis q K 0 q k= sx q ( ) = β + β x β x + β ( x κ ) + where, x,..., x q,( x κ ) q,...,( x κ ) q is known as the truncated ower basis of degree + k + q; κ <... <κk are selected fixed knots and K is the total number of knots (Eilers and Marx, 996; Ruert, Wand and Carroll, 003; Ngo and Wand, 004). The enalized least squares estimator ˆ β = ( ˆ β,..., ˆ β, ˆ β,..., ˆ β ) T is obtained by minimizing n 0 q q qk q j K q q { y } T i β0 β ( ) D j jx β k qk x κk + + λ β = β = i= qk k where λ is a smoothing arameter and D = diag (0 q+, K). The fitted values are ŷ = ( T q + λ D) T y. This model can be fitted using a number of existing software ackages, such as PROC MIED in SAS (SAS, 99; Ngo and Wand, 004; Littell, Milliken, Strou, and Wolnger. 996; Ruert, 00) and lme() in S-lus (Pinheiro and Bates, 000). With more than one covariate, one might extend this aroach by fitting a multivariate sline. However, such models are subject to the curse of dimensionality when is large, which relates to the difficulty of fitting nonarametric regression functions when the regressor sace has high dimension. The Penalized Sline of Proensity Prediction (PSPP; Little and An 004) method addresses this roblem by restricting the sline to a articular function of covariates most sensitive to model missecification, namely the roensity score. Little and An show that the PSPP method

16 7 yields an estimate of the marginal mean of the missing variable with a double robustness (DR) roerty, which means that the redicted marginal mean of Y will be consistent when either the mean function of Y given the covariates is correctly secified or the roensity score function is correctly secified. The robustness feature lies in the fact that the arametric function does not have to be correctly secified. A related aroach is given by Zeng (00), who reduces the dimension of the covariates to two, the roensity and a linear redictor, and then models the relationshi of the outcome and these two variables by a bivariate nonarametric model. For the first art of the dissertation, we simlify and extend the PSPP method. Little and An's method requires centering of the covariates before adding them to the model arametrically. We show this centering is not necessary and simlify the PSPP method considerably. We rove that this simlified version has the same DR roerty as the model roosed by Little and An (004). The simlified PSPP method is much easier for the ractitioners. We then extend the simlified PSPP method to derive the conditional mean(s) of a missing variable given a covariate. A stratified PSPP method is roosed to derive the subgrou means given a categorical covariate. For continuous covariate, we roose a bivariate PSPP method. Both of these extensions consider the interaction of the roensity score and the covariate. Simulations show that these extensions yield consistent conditional means under different mean and roensity structures. We aly the stratified PSPP method to an online weight loss study conducted by Kaiser Permanente (Couer, Peytchev, Little, Strecher and Rothert, 005). For the second art of the dissertation, we comare the PSPP method and several alternative doubly robust estimators. The PSPP method is based on the balance roerty of the roensity score, which means, conditioning on the roensity score and assuming MAR, missingness of Y does not deend on the covariates,..., (Rosenbaum and Rubin, 983). Since we do not know the true relationshi of Y and the roensity score, we use a sline of the roensity score to imute the missing values and the resulting estimates have a double robustness roerty. The DR roerty can also be achieved by modeling the relationshi arametrically, such as linear in the weight rediction method

17 8 and the calibration estimators. However emhasis in revious research has been on asymtotic roerties of the estimates, namely consistency and achieving the semiarametric efficiency bound (Firth and Bennett, 998; Scharfstein, Rotnitzky and Robins, 999; Bang and Robins, 005; Robins, Rotnitzky and Zhao, 994; Scharfstein, Rotnitzky and Robins, 999). Consistency is a relatively weak roerty, and does not guarantee good confidence coverage of inferences in small or moderate sized samles. Semiarametric efficiency is also a fairly weak roerty since it is asymtotic and does not necessarily guarantee efficiency in finite samles. For the second art of the dissertation we comare root mean square error (RMSE), width of confidence interval and non-coverage rate of the above aroaches for a range of samle sizes, when the regression model is missecified. We also comare these methods when the roensity score is wrongly secified. In the third art of my dissertation, we extend the PSPP method to the monotone attern of missing data, where variables can be arranged in a way that if Y j is missing in a unit then Yj+, Yj+,, Y are missing as well. Monotone attern of missing data is common in longitudinal studies when some subjects dro out the study and do not return. We roose to imute the missing values in a stewise rocedure. The marginal roensity score is derived for each art of the missing variables. For the art where marginal roensity score is zero due to the missingness of the recedent variable(s), we cannot aly the PSPP method directly since there is no observed data with this roensity score. In this case we roose to borrow the roensity scores from the revious stages. Imutation of missing values is done in several stes according to the attern of missing data. The art with least missing information is imuted first and then the imuted data is used to redict the missing values for the next art of data. Simulation studies show that the stewise rocedure yields satisfactory results. We illustrate our method by alying it to an online weight loss study conducted by Kaiser Permanente. We conclude the dissertation with a short discussion and future work in Chater V.

18 CHAPTER II ETENSIONS OF THE PENALIZED SPLINE PROPENSITY PREDICTION (PSPP) METHOD OF IMPUTATION Abstract Little and An (004) roosed a enalized sline of roensity rediction (PSPP) method of imutation of missing values that yields robust model-based inference under the missing at random assumtion. The roensity score for a missing variable is estimated and a regression model is fit that includes the sline of the estimated roensity score as a covariate. The redicted unconditional mean of the missing variable has a double robustness (DR) roerty under missecification of the imutation model. We show that a simlified version of PSPP, which does not center other regressors rior to including them in the rediction model, also has the DR roerty. We also roose two extensions of PSPP, namely stratified PSPP and bivariate PSPP, that extend the DR roerty to inferences about conditional means. These extended PSPP methods are comared with the PSPP method and simle alternatives in a simulation study and alied to an online weight loss study conducted by Kaiser Permanente. Keywords: missing at random, roensity, enalized sline.. Introduction Missing data roblems are common in many alications of statistics. In this aer, we consider univariate nonresonse, where the missingness is confined to a single variable. Let ( Y,,..., ) denote a + dimensional vector of variables with Y subject to missing values and,..., fully observed covariates. We consider here the roblem of estimating the mean of Y, and the conditional mean of Y in subclasses defined by a categorical -variable, and the regression coefficient of Y on a continuous -variable. 9

19 0 Many statistical methods have been roosed for these roblems. A simle aroach is comlete case analysis (CC), which deletes units with Y missing, so information contained in the deleted cases is lost. In the context of our roblem, CC analysis yields a consistent estimate of the overall mean of Y if missingness does not deend on any of the variables, and consistent estimate of the conditional mean of Y given a covariate if the missing-data mechanism deends on, but does not deend on Y or,...,. Another aroach is to imute missing values based on a arametric model, for examle a linear regression model Y = β0 + β + ε j i i = j ij, where ε i is the error term with εi ~ N(0, σ ). One can estimate ( β0,..., β ) based on the comlete cases and redict the missing values of Y by substituting for that case into the regression equation. This aroach is effective when the data are missing at random (Rubin 976; Little and Rubin, 00) and the regression model assumtions are correct, but can yield biased results when the model is missecified. Semiarametric and nonarametric methods weaken the model assumtions and cature the nonlinear relationshis between the variables. In articular, with = and single covariate, one version of this aroach is to base imutations on the enalized sline model y = s( x ) + ε with truncated olynomial basis i i i q K q 0 q k= qk k () sx ( ) = β + β x β x + β ( x κ ) + where, q x,..., x,( x ),...,( x ) q κ + q κ k + is known as the truncated ower basis of degree q; κ <... <κk are selected fixed knots and K is the total number of knots (Eilers and Marx, 996; Ruert, Wand and Carroll, 003; Ngo and Wand, 004). The enalized least squares estimator ˆ β = ( ˆ β,..., ˆ β, ˆ β,..., ˆ β ) T is obtained by minimizing n 0 q q qk q j K q q { y } T i β0 β ( ) D j jx β k qk x κk + + λ β = β = i= where λ is a smoothing arameter and D = diag (0 q+, K). The fitted values are ŷ = ( T q + λ D) T y. This model can be fitted using a number of existing software ackages, such as PROC MIED in SAS (SAS, 99; Ngo and Wand, 004) and lme() in S-lus (Pinheiro and Bates, 000). This imutation model is strictly seaking

20 arametric, but mimics a nonarametric method when K is large, since the form of the relationshi between Y and is very flexible. With more than one covariate, one might extend this aroach by fitting a multivariate sline. However, such models are subject to the curse of dimensionality when is large, which relates to the difficulty of fitting nonarametric regression functions when the regressor sace has high dimension. Penalized Sline of Proensity Prediction (PSPP; Little and An 004) addresses this roblem by restricting the sline to a articular function of covariates most sensitive to model missecification, namely the roensity score. Little and An show that the PSPP method yields an estimate of the marginal mean of the missing variable with a double robustness (DR) roerty, described below in section.. We roose a simlification of PSPP that does not center the regressors rior to including them in the rediction model. Little and An (004) did not consider whether the PSPP yields robust estimates for other arameters, such as conditional means or regression coefficients. In section.3 we rovide examles to show that the PSPP method does not in general yield estimates of these arameters with the DR roerty. This motivates robust extensions of the PSPP method for estimating subgrou means and regression coefficients, which are described in sections.4 and.5. We aly the roosed methods to an online weight loss study in section.6, and section.7 resents concluding remarks.. Penalized Sline of Proensity Prediction (PSPP) Let ( Y,,..., ) denote a vector of variables with Y subject to missing values and,..., fully-observed covariates. The missingness of Y deends only on,...,, so the missing data mechanism is missing at random (Rubin, 976). Let M be an indicator variable with M = when Y is missing and M = 0 when Y is observed. Define the logit of the roensity for Y to be observed as: ( ) P M =logit Pr( = 0,..., ) ()

21 The key roerty of the roensity score is that, conditioning on the roensity score and assuming MAR, missingness of Y does not deend on,..., (Rosenbaum and Rubin, 983). Thus, the mean of Y can be written as [( ) ] [ ( )] μ y = E M Y + E M E Y P (3) Since the true relationshi of Y and the roensity score is unknown, Little and An (004) roosed to include the roensity score in the imutation model nonarametrically. This motivates the Penalized Sline of Proensity Prediction Method (PSPP), which is based on the following model: (,..., P ) ~ N (( s ( P ),..., s ( P )), Σ) ( Y P,,..., ; β ) ~ N( s( P ) + g( P,,... ; β), σ ) (4) where N ( μ, Σ ) denotes the k-variate normal distribution with mean μ and covariance k matrix Σ, sj( P ) = E( j P ), j =,...,, is a sline for the regression of j on P of the form (); = s ( Y ) j j j is the residual of the sline model and reresents the art in not exlained by the roensity score; sp ( ) is a sline of Y on P of the form () j and g is a arametric function indexed by unknown arameter β with gy (,0,...,0; β ) = 0 for all β. One of the redictors, here, is omitted from the g - function to avoid multicollinearity. The first ste of fitting a PSPP model estimates the roensity score, for examle by a logistic regression model of M on,..., ; in the second ste, the regression of Y on P is fit as a sline model with the other covariates included in the model arametrically in the g - function. The redicted mean of Y from model (4) has the following DR roerty: Theorem. Let ˆ μ y be the rediction estimator for (3) based on model (4), and assume MAR. Then ˆ μ y is a consistent estimator of μ y if either (a) the mean of Y given ( P,,..., ) in model (4) is correctly secified, or (b) the roensity P is correctly secified, and (b) E( P ) = s ( P ) for j =,..., and EY ( P) = s( P ). The j j

22 3 robustness feature derives from the fact that the regression function g does not have to be correctly secified ( Little and An, 004). The covariates,..., in this theorem are centered by regressing,..., on slines of P and taking residuals. A simler method adds,..., directly to the regression, without centering. We now show that this method also has the DR roerty: Theorem. The PSPP method based on model (4) can be simlified as follows: ( Y P,,..., ; β ) ~ N( s( P ) + g( P,,... ; β), σ ) (5) that is, the covariates,..., enter the arametric function g without centering. Let ˆ μ y be the rediction estimator for (3) based on model (5), and assume MAR, then ˆ μ y has the same DR roerty as that derived from model (4) (see aendix for roof). For this reason, we focus on the uncentered version of the PSPP method for the remainder of the aer..3 PSPP is not doubly robust for subgrou means. The DR roerty of PSPP for estimating the marginal mean of Y does not extend to estimates of conditional means, such as means in subgrous defined by a categorical covariate. The next two examles illustrate this statement. The first examle illustrates the intuitively obvious fact that for estimating the conditional mean of Y given, the PSPP method needs to include as a redictor in the model for Y. The second examle illustrates that inclusion of as a redictor in the model for Y is not sufficient to avoid bias with the PSPP method. This limitation is then addressed with the extended versions of the method. Examle. PSPP for estimating a conditional mean: including the subgrou variable in the model for Y is necessary. We simulate 500 datasets with 500 subjects, with categorical covariate, continuous covariate and continuous resonse variable Y, where, are indeendent with ~ multinomial (0.5,0.3,0.), ~ N (0,), and

23 ( μ ) 4 Y, ~ N (, ),, μ (, ) = I[ = ] + 3 I[ = ] + 5 I[ = 3] + 0 where I[] denotes an indicator for the event in the arenthesis. We create missing values of Y from the resonse roensity model: logit ( PM ( = 0, )) = I [ = ] 0.5 I [ = ] We imute the missing values of Y using redicted means from the following methods: (a) A correctly-secified ANCOVA model of Y given,, which we denote [ + ]. (b) An incorrectly secified regression model for Y that omits, namely [ ]. (c) The PSPP Method with null g function, which we denote [( )]. The roensity sp correct score is modeled as an additive function of Pcorrect and and hence is correctly secified and conditions on. (d) Model (c) with included, namely [( sp ) + ]. This model correctly secifies correct the mean of Y given the covariates, since it includes the main effects of and. (e) The PSPP Method with null g function and incorrectly secified roensity score, modeled as a linear function of alone, which we denote [( sp wrong )]. (f) Model (e) with included, namely [( sp ) + ]. This model correctly secifies wrong the mean of Y given the covariates, since it includes the main effects of and. For all the enalized sline methods in this aer, we choose 0 equally saced fixed knots and a truncated linear basis. We estimate the marginal mean of Y and the conditional means of Y given as the average of observed and imuted values from these methods. For comarison uroses, we also show estimates from the data before deletion (BD) and estimates based on the comlete cases (CC). Emirical bias (Bias), emirical standard deviation (STD) and root mean square error (RMSE) over the 500 relications are summarized in Tables.A and.b. CC analysis yields estimates with large biases and RMSEs. The correctly secified ANCOVA model (a) yields unbiased estimates close to the BD estimates. The wrongly secified ANCOVA model (b) yields

24 5 biased arameter estimates, with large biases and RMSEs. For the PSPP method, inclusion of in the model is imortant for subgrou mean estimation. Without in the model, the PSPP method (c) yields small emirical bias for the marginal mean estimate and a large emirical bias for the conditional means of Y given, even though the roensity score model is correct and conditions on both and ; including in the PSPP method (d) yields estimates of the marginal mean of Y and conditional means of Y given with small emirical biases, and STDs and RMSEs very close to those of BD. When neither the roensity score nor the mean function is correctly secified, the PSPP method (e) yields biased results; but the bias is removed in model (f) by including, since then the regression is correctly secified. Examle. PSPP for estimating a conditional mean: including the subgrou variable in the model for Y is not sufficient. We now generate and as in Examle ; but the mean of Y given and is simulated to include both a quadratic term in and interactions between and : Y, ~ N( μ(, ),), μ (, ) = I[ = ] + 3 I[ = ]+5 I[ = 3] I [ = ] 0 I [ = ] The logistic regression of M is additive in and a quadratic function of : logit( P( M = 0, )) = 0.5 I[ = ] 0.5 I[ = ] We simulate 500 datasets with samle size of 000 each. We imute the missing Y as redicted means from the following methods: (a) A correctly-secified regression model for Y, namely [ ]. (b) The PSPP model with null g-function, namely [( sp)]. The roensity score P is modeled as an additive function of, and (c) PSPP with included, that is, [( sp) + ]. (d) PSPP with and included, namely, [( sp ) + + ]. and hence is correctly secified.

25 6 The correctly-secified ANCOVA model yields estimates with small emirical bias and RMSE close to BD (Table.). CC analysis and the wrongly secified ANCOVA model yield biased estimates. The PSPP methods (b) (d) yield estimates for the marginal mean of Y with small emirical bias, but are clearly biased for the conditional means of Y given and Y given. In articular, unlike Examle, adding to the g-function does not correct the missecification of the mean of Y given, since the estimates of the conditional means are still biased. In the second examle, the PSPP method [( sp) + ] assumes that for different levels of, the slines of Y on the interaction between P have the same shae; since the true model includes and, this assumtion is violated, and it is this fact that leads to bias for the conditional means. One solution is to include the interaction of roensity score and into the model, yielding a stratified PSPP method discussed in the next section..4 Stratified Penalized Sline Proensity Prediction for subgrou means Let I = if = c ; I = 0 if c, c =,..., C, where C is the total number of c c categories of. The stratified PSPP method is based on the following model: C c c + c= ( Y P,,..., ; β ) ~ N( I s ( P ) g( P,,,..., ; β), σ ) (6) Where g is a arametric function indexed by unknown arameter β as before, with droed to avoid multicollinearity; q K j c c = c 0c + jc + qkc k j= k= q Is( P) I( γ γ ( P) γ ( P κ ) + ) is the fitted curves for the cth level of. Within each level of, E( Y P, = c,,..., ; β ) = s ( P ) + g( P, = c,,..., ; β ). c Note that this method is not the same as alying PSPP within strata defined by, since the g-function does not necessarily include the interactions of with the other covariates. This method yields consistent estimates for the conditional means of Y given

26 7. The marginal mean of Y is a weighted average of conditional means, which again has the double robustness roerty (see aendix for roof). Examle continued Row (e) in Table. shows the results of alying stratified PSPP to the data in Examle. The emirical bias is small for the marginal mean of Y and the subgrou means of Y given, and the RMSE for these arameters is only slightly larger than for BD. Thus stratified PSPP has fixed the bias for the subgrou means in the PSPP methods. On the other hand the emirical bias remains large for the coefficients of the regression of Y on. For those arameters we need another extension of PSPP, which we now describe..5 A Bivariate PSPP Method for estimating the conditional mean of Y given a continuous covariate. In this section we consider estimating the conditional mean of Y given a continuous variable, based on a regression model for Y given. To estimate the regression coefficients in this case we need to assume that the regression of Y on is correctly secified; for concreteness we assume it is linear with mean E( Y ) = β0 + β + β. To yield consistent arameter estimates for the regression coefficients, we now include the interaction of roensity score and in the model for redicting the missing values of Y. Secifically, we roose the following bivariate PSPP method, based on the model: ( Y P,,,..., ; β ) ~ N( s( P, ) + g( P,,,..., ; β), σ ) (7) where g is a arametric function; sp (, ) is a bivariate P-sline of and Estimation of the bivariate smoothing function P. requires bivariate basis functions, which can be derived in several different ways. A natural extension of the truncated linear basis for one dimension is to form all the air-wise roducts of the basis functions. The resulting bivariate basis is called the tensor roduct basis (Ruert, Wand sp and Carroll, 003). With this basis, the bivariate function (, ) sp (, ) can be written as

27 8 K K = α0 + α + γk κk + + α + γ k' κ k' + + α3 k= k' = sp (, ) P ( P ) ( ) P K K K K γ3kp κk + γ4 k' P κ k' + γ5 kk' P κk + κ k' k= k' = k= k' = + + ( ) + ( ) + ( ) ( ) where κ <... < κ K and κ <... < κ K are selected fixed knots for roensity score and resectively. In this aer we choose 5 equally saced knots for each variable when fitting the bivariate slines using a tensor roduct basis. Examle Continued Row (f) in Table. shows estimates of the arameters when missing values are imuted using the bivariate PSPP method. This method yields estimates of the coefficients of the regression of Y on with small emirical biases and RMSEs only slightly higher than those of BD analysis. The conditional means of Y given from bivariate PSPP are biased. To get consistent estimates of both the conditional means of Y given and conditional mean of Y given, a model is needed that includes the interaction between the roensity score and and the interaction between the roensity score and. This motivates the following combination of the stratified PSPP and bivariate PSPP models: where C c c + + c= ( Y P,,..., ; β ) ~ N( I s ( P ) s( P, ) g( P,,..., ; β), σ ) ( ) Icsc P and sp (, ) are defined as in sections 4 and 5 resectively. When we alied this method to the second simulation, a small number (8) of the 500 samles failed to converge, but results for the other samles indicate that emirical bias from this model is small for both the conditional mean of Y given Y given (Table., row (g)). and the conditional mean of

28 9.6 An Examle: Online Weight Loss Study To illustrate our roosed aroach, we consider data from an online weight loss study conducted by Kaiser Permanente (Couer et al., 005). The study randomized aroximately 4,000 subjects to the treatment or the control grou. For the treatment grou, the weight loss information rovided online was tailored to the subjects based on their answers to an initial survey, which contained baseline measurements such as baseline weight, motivation to weight loss, etc; for the control grou, information rovided online was the same for all the subjects. At 3 months, a second survey was sent to all of the articiants, which collected follow-u measurements such as current weight. Our goal is to comare the short-term treatment effects; in articular, we comare the reduction of the body mass index (BMI), defined as difference of 3-month BMI and baseline BMI. There were 059 subjects in the treatment grou and 956 subjects in the control grou at the baseline. At 3 month 63 subjects in the treatment grou and 6 subjects in the control grou resonded to the second survey. We assume the data are missing at random. Subjects in the treatment grou who remained in the study have much lower baseline BMI than those who droed out (P<0.00), but this differences is not seen in the control grou (P=0.47); On the other hand, for the control grou subjects who remained in the study have better baseline health, as measured by the number of revious diseases, than those who droed out of the study (P<0.0); this differences was not seen in the treatment grou (P=0.56). These differences suggest that interactions between treatment and baseline covariates need to be included when estimating the roensity scores. We estimate the roensity scores by a logistic regression, with the inclusive criterion of retaining all variables with P-values less than 0.0. The final model includes the following covariates: baseline BMI; number of revious disease; baseline self care; which is harder eating less or being active; baseline exercise suort; baseline activity level; baseline eating toology; education; ethnic identity; treatment; interaction of treatment and baseline BMI; interaction of treatment and baseline eating toology;

29 0 interaction of treatment and baseline activity level; interaction of treatment and number of revious disease; interaction of treatment and which is harder eating less or being active. We aly the PSPP method and the stratified PSPP method to the data as follows: (a) PSPP method with null g-function, denoted as [( sp )], where scores defined in section. (b) Model (a) with treatment as a covariate, denoted as [ sp ( ) + treatment]. (c) Model (b) with baseline covariates, denoted as [ sp ( ) + treatment + g(baseline vars)]. (d) Stratified PSPP method with null g-function, denoted as [ Icsc( P )]. c= P is the roensity (e) Model (d) with baseline covariates, denoted as [ Is( P) + g(baseline vars)]. The baseline covariates in the g-function of model (c) and (e) include: ethnic identity; baseline medical advice; baseline eating toology; baseline cardio exercise; baseline activity level; baseline BMI; number of revious disease; number of weigh loss methods tried; motivation of weigh loss; which is harder eating less or being active. c= c c Results are summarized in Table.3. Emirical Standard errors (SE) and the corresonding confidence intervals are obtained from 00 bootstra samles. The treatment grou has a larger reduction of BMI after 3 month (-0.9 (0.09)) comared to the control grou (-0.45 (0.0)) based on the comlete case analysis. The stratified PSPP method (model d and e) and the PSPP method with the treatment as a covariate (model b and c) yield similar results, with the reduction of BMI ranging from to -.0 for the treatment grou and to in the control grou. The 95% confidence intervals for the treatment grou do not overla with the control grou suggesting a treatment effect on the weight loss (model b, c, d, e). On the other hand, the PSPP method without treatment as a covariate does not shown the treatment effect (95% CI (-0.96, -0.65) for the treatment; 95% CI (-0.76, -0.47) for the control). Adding g function into the model does not affect bias but imroves efficiency (model c and e).

30 .7 Discussion We have shown that the PSPP method yields an estimate of the marginal mean of Y with a double robustness roerty, without the need to center the covariates in the g function. However the PSPP method lacks this roerty for conditional mean estimation. We have roosed two extensions of PSPP that extend the double robustness roerty to conditional means, namely stratified PSPP for a categorical redictor, and bivariate PSPP for a continuous redictor. The key roerty of these extensions is that they include in the rediction model the interaction of the roensity score and the conditioning variable that defines the estimand of interest. Simulations are resented as emirical evidence of the robustness of these extensions. We estimate the bivariate function (, ) using a P-sline with a tensor roduct basis, but other sline fitting methods could also be alied. One choice is to use a thin late sline (Green and Silverman, 994; Wood, 999). To estimate sp sp (, ), we need to find the function g = g( t) = g( t, t ) minimizing g g g n ( yi gt (, t)) + λ [( ) + ( ) + ( ) ] dtdt i= t t t t where the g function has the form with t R M g( t) = θ + θ φ ( t) + δ E ( t t ) 0 j j j j j= j= E() s = s ln( s ); φ 3 j() t are linearly indeendent functions of t with π and λ is the smoothing arameter. This model can be fit using the tsline rocedure from SAS (SAS, 99; Ngo and Wand 004; Wand 003). We also fitted thin late slines for the simulation study in section 5 but found some samles failed to yield estimates due to negative variance estimates. For the other samles the results from the tsline rocedure are comarable to those from a P-sline with a tensor roduct basis. n More generally, a PSPP method that yields doubly robust estimates of the conditional mean of Y given a subset of the covariates (,..., s), s <, requires

31 inclusion of the interactions between the roensity score and (,..., s) ; clearly the curse of dimensionality comes increasingly into lay as the size of s increases. A natural question is whether these roensity score methods can be extended to yield robust estimates for the regression given the comlete set of covariates, such as, (,..., ). We note that in our setting the cases with Y missing contribute no information to this regression, so there is no gain in develoing an imutation model. If it is the covariates rather than the outcome that have missing values, however, then the incomlete cases do include information, and it remains an oen question whether roensity methods can be used to increase the robustness of inference in such situations. This question deserves future study. We use a smoothing sline function to model the relationshi between Y and the roensity score and our method has a DR roerty. The DR roerty can also be achieved by modeling the relationshi arametrically. One method is to include the inverse of the roensity score as a linear term in the imutation model (Firth and Bennett, 998; Bang and Robins, 005). Another aroach is to calibrate the redictions from a arametric model by adding means of the weighted residuals, with weights equal to inverse of the roensity scores (Robins, Rotnitzky and Zhao, 994; Scharfstein, Rotnitzky and Robins, 999). We are currently conducting simulations to comare the erformance of these methods with the PSPP method, and results will be reorted in a future aer. Acknowledgements: this research is suorted by CECCR Center grant P50 CA045. We thank Trivellore Raghunathan for assistance with Theorem.

32 3 Table. Examle : Emirical Bias, Standard Deviation (SD) and Root Mean Squared Error (RMSE) for (A) Marginal mean of Y, and (B) Conditional Mean of Y given. Entries are multilied by 00. (A) Marginal Mean of Y Methods Bias STD RMSE BD CC (a)correct ANCOVA [ + ] (b)wrong ANCOVA [ ] (c)pspp [ sp ( )] correct ( correct ) (d)pspp [ sp + ] (e)pspp [ sp ( )] wrong (f)pspp [ sp + ] ( wrong ) (B) Conditional Mean of Y given Methods = = = 3 Bias STD RMSE Bias STD RMSE Bias STD RMSE BD CC (a) Correct ANCOVA [, ] (b)wrong ANCOVA [ ] (c )PSPP [ sp ( corr ect )] (d)pspp [ sp ( correct ) + ] (e)pspp [ sp ( wrong )] (f)pspp [ sp ( wrong ) + ]

33 Table. Examle : Emirical Bias, Root Mean Squared Error (RMSE) and Coverage rate (Cov) for (A) Marginal mean of Y, (B) Conditional Mean of Y given, and (C) Intercet and Sloes for Regression of Y on,. Entries are multilied by 00. Methods Overall Mean Conditional mean given Coefficients of conditional mean given = = =3 Intercet Bias RMSE Cov Bias RMSE Cov Bias RMSE Cov Bias RMSE Cov Bias RMSE Cov Bias RMSE Cov Bias RMSE Cov BD CC (a) Correct Model [ ] (b) PSPP (c) PSPP [( s P )] [( s P ) ] (d) PSPP ( [( s P ) + + ] ) (e) Stratified PSPP ( Y = I ( csc P ) ) (f) Bivariate PSPP ( Y = s( P, ) ) (g) Stratified_Bivariate PSPP ( Y = I c s c ( P ) + (, ) s P )

University of Michigan School of Public Health

University of Michigan School of Public Health The University of Michigan Deartment of Biostatistics Working Paer Series ear 003 Paer 5 Robust Likelihood-based Analysis of Multivariate Data with Missing