Semiparametric Approach for Non-Monotone Missing Covariates in a Parametric Regression Model

Size: px

Start display at page:

Download "Semiparametric Approach for Non-Monotone Missing Covariates in a Parametric Regression Model"

Egbert Potter
5 years ago
Views:

1 Biometrics DOI: /biom Semiparametric Approach for Non-Monotone Missing Covariates in a Parametric Regression Model Samiran Sinha, 1 Krishna K. Saha, 2 and Suojin Wang 1, * 1 Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A. 2 Department of Mathematical Sciences, Central Connecticut State University, New Britain, Connecticut 06050, U.S.A. sjwang@stat.tamu.edu Summary. Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this article, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to nonidentifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset. Key words: Dimension reduction; Estimating equations; Missing at random; Non-ignorable missing data; Robust method. 1. Introduction In observational biomedical studies or survey designs, we often encounter partially missing variables, or variables that are not observed or recorded for all the subjects in a study. For example, in the first motivating dataset from the Los Angeles Endometrial Cancer Study, three of the five covariates are partially missing, and in the second dataset on hip fractures, all nine covariates are partially missing. The simplest method, commonly known as complete case () analysis, ignores the subjects with partially missing entries and takes into account the subjects only with complete information on all covariates. This approach may lose efficiency. It may also lead to biased estimates depending on the missingness mechanism. The problem can be severe especially when there are multiple covariates with missing values, even if each of them contains a low to moderate percentage of missing entries. The aims of this article are to introduce a new missingness mechanism for non-monotone missing data and to propose a semiparametric method to estimate the model parameters. Suppose that our interest is in a finite dimensional parameter β in the parametric regression model f (Y X, Z; β) where Y is the outcome variable and X = (X 1,...,X p ) T and Z = (Z 1,...,Z q ) T are the explanatory variables. We assume that Y and Z are observed for all subjects, whereas all the components of X are partially observed. Let (X i,y i,z i ), i = 1,...,n, be independent and identically distributed (iid) copies of (X, Y, Z), where n is the sample size. Define the missingness indicator variables 1 if Xij is observed, R ij 0 if X ij is not observed, j = 1,...,p and X i( r) (X i1,...,x i(r 1),X i(r1),...,x ip ) T. There are two possibilities for multiple missing covariates: (1) monotone missing data and (2) non-monotone missing data. For monotone missing data, if X ij is missing, then X ik, k>jare all missing, that is, pr(r i(j1) = =R ip = 0 R ij = 0) = 1 for any j = 1,...,(p 1). In this article, we focus on non-monotone missing covariate data. Regarding the missing data mechanism, the missing at random (MAR) assumption is widely used, under which the probability that R ij = 1 depends only on the observed quantities (Rubin, 1976). Under the MAR mechanism and parameter distinctness assumption, valid likelihood based inferences may be carried out without using a model for the missingness mechanism. For handling MAR covariate data, multiple imputation is a commonly used method. Within the broader context of imputation approach, Reilly and Pepe (1995) proposed the mean-score approach and Chatterjee, Chen, and Breslow (2003) proposed a pseudo-score approach which is usually more efficient than the mean-score approach if the assumed model for the missingness mechanism is correct. Robins, Rotnitzky, and Zhao (1994) proposed an efficient method within the class of inverse probability weighted (IPW) estimating 2014, The International Biometric Society 1

2 2 Biometrics equations for handling the MAR data which is similar in spirit to the Horvitz Thompson estimator. Lipsitz, Ibrahim, and Zhao (1999) proposed a doubly robust method that blends the likelihood based approach and the weighted estimating equations. Ibrahim, Chen, and Lipstiz (1999) proposed a full likelihood based method using the EM algorithm. They assumed a parametric model for the partially missing covariate. Under the MAR assumption, Chen (2004) proposed a semiparametric method for handling multiple missing covariates for any arbitrary pattern of missing data. He assumed a parametric model for the odds ratio between any two missing covariates, which is then used in the likelihood formation. Non-ignorable (NI) mechanism happens if the probability that R ij = 1 depends on values of the completely observed and partially observed variables, meaning that the missingness mechanism of X ij may depend on all components of X i = (X i1,...,x ip ) T along with Y i and Z i. For NI missing data, the model parameters are estimated by jointly maximizing the likelihood of the data generation process and the missingness process. This requires a parametric model specification of the distribution of the missing covariates and the missingness process. Consistent parameter estimates are obtained under the assumption that both the models are correctly specified. Ibrahim, Lipsitz, and Chen (1999) wrote a comprehensive review of NI missing covariate data scenario, and they considered arbitrary missing data patterns. One disadvantage of the NI missingness mechanism is that data provide little or no information regarding the parameters in the missingness models. Hence, model parameters are weakly identifiable (Little, 1995). Non-monotone missing data are usually handled by assuming the MAR mechanism (Robins and Gill, 1997) or the NI mechanism (Robins, 1997). In the former case, analysis can be done without assuming any model for the missingness mechanism and any parametric model for the partially missing covariates. In the latter case, analysis is usually done by positing a parametric model for the missingness mechanism and partially missing covariates. Except for some special cases (e.g., Lei and Wang, 2001), the MAR and NI assumptions are generally not testable. To fill the substantial gap between the MAR and NI mechanisms, we introduce another mechanism which is somewhat less general than the NI mechanism but facilitates our calculations as it can be used to tackle the identifiability issue. To handle the non-monotone pattern of missing data, we define missingness mechanism NI- when the probability that R ij = 1 may depend on all variables except the jth covariate X ij, that is, pr(r ij = 1 Y i,x i,z i ) = pr(r ij = 1 Y i,x i( j),z i ). Thus, we assume that R ij may depend on the other variables which are completely observed or partially missing, but does not depend on the values of the jth covariate X ij. Further, we assume that under the NI-mechanism, (1) R i1,...,r ip are independent conditional on X i,y i,z i, and (2) for every j, pr(r ij = 1 Y i,x i,z i ) > 0 with probability one for every combination of (Y i,x i,z i ). Under the independence assumption (1), the NImechanism helps to reduce the identifiability issue encountered by the non-ignorable missing data. Consequently, we do not need to make a parametric model assumption regarding the distribution of the missing covariates. Our NI-mechanism is partly non-ignorable because the probability model for R j may depend on (X 1,...,X j 1,X j1,...,x p ) which are partially observed. In summary, the proposed NI-mechanism is sometimes more flexible than the MAR mechanism where the missingness mechanism is allowed to depend only on the completely observed variables. Further, the NI-mechanism allows us to estimate the model parameters without specifying a parametric model for the missing covariate that is needed for the NI missingness mechanism. When the MAR assumption is in doubt, it may be preferable to assume the NI-mechanism and use the proposed method of estimation. The remainder of this article is outlined as follows: Sections 2 and 3 contain the detailed methodology and asymptotic properties for the case of two partially missing covariates. An extension of the method for more than two partially missing covariates is discussed in Section 4. Sections 5 and 6 contain simulation studies and two real data examples, respectively. Some concluding remarks are given in Section 7. Technical material is provided in the Appendix. 2. Estimation Methodology 2.1. Background The goal is to estimate β. Without any missing entries in X, and under certain regularity conditions, the MLE of β is obtained by solving the score function S nm = n S β(y i, X i,z i )=0, where S β (Y,X,Z)= logf (Y X,Z;β)}/ β for f (Y X,Z; β) satisfying some standard regularity conditions. For the analysis, β is estimated by solving S cc = n ( p j=1 R ij)s β (Y i, X i,z i ) = 0. Note that Reilly and Pepe (1995) and Chatterjee et al. (2003) both considered the case where all R ij, j =1,...,p are equal for a given i, that is, either R i1 = =R ip = 1or R i1 = =R ip = 0 and their missingness mechanism depends only on the observable quantities Proposed Method for NI-Missing Data In this Section, we consider the case of p = 2. We will use the following notation π ij (Y i,x i( j),z i ) = 1 π ij (Y i,x i( j),z i ) pr(r ij = 1 Y i,x i( j),z i ). We first assume that both the covariates X and Z are discrete with fixed numbers of categories while Y can be either discrete or continuous. In the following derivation, any integration with respect to a discrete covariate implies a summation. The likelihood of the data is L = i:r i1 =R i2 =1 f (X i1,x i2 Z i ) π i1 (Y i,x i( 1),Z i )π i2 (Y i,x i( 2),Z i )f (Y i X i,z i ) i:r i1 =1,R i2 =0 π i1 (Y i,x i( 1),Z i ) f (Y i X i,z i )f (X i1,x i2 Z i )dx i2 } π i2 (Y i,x i( 2),Z i ) i:r i1 =0,R i2 =1 π i1 (Y i,x i( 1),Z i ) f (Y i X i,z i )f (X i1,x i2 Z i )dx i1 } π i2 (Y i,x i( 2),Z i )

3 Semiparametric Approach for Non-Monotone Missing Covariates 3 i:r i1 =0,R i2 =0 π i1 (Y i,x i( 1),Z i ) π i2 (Y i,x i( 2),Z i ) f (Y i X i,z i )f (X i1,x i2 Z i )dx i1 dx i2 }. The score function for estimating β is S fx,β = i:r i1 =R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =1 i:r i1 =0,R i2 =0 S β (Y i,x i,z i ) S β (Y i,x i,z i )q i,2 (π i1 )dx i2 S β (Y i,x i,z i )q i,1 (π i2 )dx i1 S β (Y i,x i,z i )q i,12 ( π i1 π i2 )dx i, (1) where for r = 1, 2, 12, q i,r (ω) = f (Y i X i,z i )ω(y i,x i,z i )f (X ir X i( r),z i )/ f (Y i X i,z i )ω(y i,x i,z i ) f (X ir X i( r),z i )dx ir is a function of the given parametric model f (Y i X i,z i ), the distribution of the missing covariate given the completely observed covariates, and ω. Here r = 12 means both X i1 and X i2 are missing. For obtaining q i,2 (π i1 ), q i,1 (π i2 ), and q i,12 ( π i1 π i2 ), we set ω = π i1,π i2, and π i1 π i2, respectively. For estimating q i,r (ω) we face two issues: estimation of the missing covariate distribution and of the missingness probabilities (selection probabilities). One obvious approach is to model the missingness probability and the covariate distribution parametrically. However, modeling the distribution of the covariates parametrically is not an easy task, and model misspecification can lead to biased estimates. Therefore, we use an empirical approach to estimate f (X ir X i( r),z i ) and use a parametric model for the missingness probabilities. Suppose that the missingness probabilities are identified by a finite dimensional parameter α = (α T 1,αT 2 )T, where α k is for X k for k = 1, 2. We will denote ω(y i,x i,z i ) by ω(y i,x i,z i,α). Adopting the technique of Chatterjee et al. (2003) to our setting, we write the conditional density of X ir given X i( r) and Z i in terms of the conditional density of X ir given X i( r) and Z i in the completely observed data V =i : R i1 = R i2 = 1} as f (X ir X i( r),z i ) = f (X ir X i( r),z i,r i1 =R i2 =1)pr(R i1 =R i2 =1 X i( r),z i ). pr(r i1 = R i2 = 1 X i,z i ) (2) Plugging this into the expression of q i,r we obtain q i,r (ω) = h(y i,x i,z i ; ω)f (X ir X i( r),z i,r i1 = R i2 = 1) / h(y i,x i,z i ; ω)f (X ir X i( r),z i,r i1 = R i2 = 1) dx ir, where h(x i,y i,z i ; ω) = f (Y i X i,z i )ω(x i,y i, Z i )/pr(r i1 = R i2 = 1 X i,z i ). Following a strategy similar to that of Chatterjee et al. (2003) we obtain pr(r i1 = R i2 = 1 X i,z i ) = pr(r i1 = R i2 = 1 X i,y i, Z i )f (Y X i,z i ; β)dy. Therefore, q i,1 (π i2 ), q i,2 (π i1 ), and q i,12 ( π i1 π i2 ) involve f (X i1 X i( 1),Z i,r i1 = R i2 = 1), f (X i2 X i( 2), Z i,r i1 = R i2 = 1), and f (X i Z i,r i1 = R i2 = 1), respectively. Note that these functions must be estimated empirically based on the completely observed data V. Observe that while pr(r i1 = R i2 = 1 X i( r),z i ) appears in the numerator of (2), it does not appear in q i,r (ω). An estimator of q i,r (ω) denoted by q i,r (ω) is obtained after replacing f (X ir X i( r),z i,r i1 = R i2 = 1) by its empirical estimates f (Xir X i( r),z i,r i1 = R i2 = 1) = d F(Xir X i( r),z i,r i1 = R i2 = 1), where F(x X i( r),z i,r i1 = R i2 = 1) n j=1 = I(X jr x, X j( r) =X i( r),z j =Z i,r j1 =R j2 =1) n I(X j=1 j( r) = X i( r),z j = Z i,r j1 = R j2 = 1) and in particular when r = 12, F(x 1,x 2 Z i,r i1 = R i2 = 1) n j=1 = I(X j1 x 1,X j2 x 2,Z j = Z i,r j1 = R j2 = 1) n I(Z. j=1 j = Z i,r j1 = R j2 = 1) After replacing q i,r by q i,r (r = 1, 2, 12) in (1) we obtain the estimated score functions S fx,β = i:r i1 =R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =1 i:r i1 =0,R i2 =0 S β (Y i,x i,z i ) S β (Y i,x i,z i ) q i,2 (π i1 )dx i2 S β (Y i,x i,z i ) q i,1 (π i2 )dx i1 S β (Y i,x i,z i ) q i,12 ( π i1 π i2 )dx i. For estimating β we also need to estimate α 1 and α 2. Define S αk (R ik,y i,x i,z i ) = R ik logπ ik (Y i,x i( k),z i )}/ α k (1 R ik ) log π ik (Y i,x i( k),z i )}/ α k for k = 1, 2. Then we can estimate α 1 and α 2 by solving the estimating equations S f x,α 1 = i:r i1 =R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =1 i:r i1 =0,R i2 =0 S α1 (R i1,y i,x i,z i ) S α1 (R i1,y i,x i,z i ) q i,2 (π i1 )dx i2 S α1 (R i1,y i,x i,z i ) S α1 (R i1,y i,x i,z i ) q i,12 ( π i1 π i2 )dx i = 0

4 4 Biometrics and S fx,α 2 = i:r i1 =R i2 =1 i:r i1 =0,R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =0 S α2 (R i2,y i,x i,z i ) S α2 (R i2,y i,x i,z i ) q i,1 (π i2 )dx i1 S α2 (R i2,y i,x i,z i ) S α2 (R i2,y i,x i,z i ) q i,12 ( π i1 π i2 )dx i = 0. If the missingness model of R r does not depend on X ( r) then the missingness model becomes a model for the MAR data. Therefore, the resulting method will produce consistent estimates when the data are MAR, and the method will reduce to the pseudo-score method of Chatterjee et al. (2003). While there is no direct test to verify this assumption, the NImechanism would be worth considering if R 1 (R 2 ) depends on X 2 (X 1 ) in the subset i : R i2 = 1} (i : R i1 = 1}). When any component of X or Z is continuous, a similar method can be developed where conditional probability mass function (pmf) must be replaced by a conditional density function estimated by, say, a kernel method. This approach requires a full scale technical and numerical investigation and is a problem for future research. Alternatively, in this work we approximate the conditional density by the empirical conditional pmf after discretizing the continuous components. Although this method is an approximation, our finite sample numerical studies suggest that it works well. Note that in principle discretization should be refined with sample size so that the conditional pmf converges to the conditional density as n grows to infinity. How to optimize the rate of refinement as a function of the sample size is an interesting problem for future work The Issue of Identifiability The likelihood identifiability is a sophisticated issue in the general missing data context. Through the following simple example we illustrate how the identifiability issue under the NI data is eased under the NI-missing data assumption when the models for the missingness indicators and the covariate distribution are nonparametric. Suppose that potentially observable data are n iid copies of (Y, X 1,X 2,R 1,R 2 ) without Z, where Y is a binary response variable, X 1 and X 2 both are scalar binary variables, and R k is a missingness indicator for X k for k = 1, 2. Let n x1,x 2,y be the number of observations with X 1 = x 1, X 2 = x 2, Y = y, and for R 1 = R 2 = 1; m x1,,y be the number of observations with X 1 = x 1, missing X 2, and Y = y (i.e., where R 1 =1 and R 2 = 0); m,x2,y be the number of observations with missing X 1, X 2 = x 2, and Y = y (i.e., where R 1 = 0 and R 2 = 1); and m,,y be the number of observations with missing X 1 and X 2, and Y = y (i.e., where R 1 = 0 and R 2 = 0). Thus, n 0,0,0 n 0,0,1 n 0,1,0 n 0,1,1 n 1,0,0 n 1,0,1 n 1,1,0 n 1,1,1 m,0,0 m,0,1 m,1,0 m,1,1 m 0,,0 m 0,,1 m 1,,0 m 1,,1 m,,0 m,,1 = n. Let u x1,x 2 = pr(x 1 = x 1,X 2 = x 2 ), v x1,x 2 = pr(y = 1 X 1 = x 2,X 2 = x 2 ), π x (1) 1,x 2,y = pr(r 1 = 1 X 1 = x 1,X 2 = x 2,Y = y), and π x (2) 1,x 2,y = pr(r 2 = 1 X 1 = x 1,X 2 = x 2,Y = y). The likelihood of the observed data involves 23 parameters (4v-parameters, 3u-parameters, 8π (1) -parameters, and 8π (2) -parameters) whereas the data contain 18 1 = 17 independent patterns; therefore the parameters are not identifiable from the data. However, this issue may be eased by imposing some model restrictions such as our NI-mechanism. In the NImechanism, there are at most 4π (1) -parameters (no X 1 involved) and 4π (2) -parameters (no X 2 involved), resulting in a total of 15 model parameters, which is fewer than the number of observed patterns of the data. Let L be the likelihood of the observed data and θ = (u 00,u 01,u 10,v 00, v 01,v 10,v 11,π (1) 00,π (1) 01,π (1) 10,π (1) 11,π (2) 00,π (2) 01,π (2) 10,π (2) 11 ) T be the set of parameters under the NI-mechanism. In the Supplementary Materials, we show that E log(l)/ θ log(l)/ θ} T ] is non-singular, a sufficient condition for identifiability (Rothenberg, 1971). The issue in a more general setting is interesting and worth investigating. Additionally, partial information regarding α 1 (generally a non-saturated equivalent of π (1) ) involved in pr(r 1 = 1 X 2,Y; α 1 ) can be obtained from the data (R 1,X 2,Y) among the subjects with R 2 = 1. Note that these subjects contri- bution to the likelihood for α 1 is i:r i2 =1 prr i1(r i1 = 1 R i2,x i1,x i2,y i )pr 1 R i1(r i1 = 0 R i2,x i1,x i2,y i ) = i:r i2 =1 pr R i1(r i1 = 1 X i2,y i )pr 1 R i1(r i1 = 0 X i2,y i ) due to our assumption that R 1 R 2 X 1,X 2,Y and R 1 X 1 X 2,Y. To clarify the notation, the R i1 inside the probability statements above is a random variable while the R i1 on the exponents is the observed value of the random variable. Similarly, information regarding α 2 can be obtained from the data (R 2,X 1,Y) among the subjects with R 1 = Computational Steps Let θ = (α T,β T ) T. Then θ is estimated by solving the estimating equations = 0 iteratively, where S S f,θ Ṱ = f x,θ (S Ṱ,S Ṱ,S Ṱ ). Let α(t) 1, α (t) 2, and β (t) be the parameter f x,α 1 f x,α 2 f x,β estimates at the tth iteration. Step 0. Initialize α 1 = α (0) 1, α 2 = α (0) 2, and β = β (0). Step 1. For the (t 1)th iteration first compute q i,r for r = 1, 2, 12 using α 1 = α (t) 1, α 2 = α (t) 2, and β = β (t). Then insert these quantities into the estimating equations S fx,θ = 0. Step 2. Solve for α 1,α 2, and β from = 0 using the S fx,θ Newton Raphson method by assuming that q i,r is fixed. Observe that conditional on q i,r, r = 1, 2, 12, the Newton Raphson method can be carried out separately for each parameter vector, β, α 1, and α 2, thereby keeping the dimension low. Step 3. Repeat Steps 1 and 2 until the estimates converge. We denote the resulting estimates by α and β. For the Newton Raphson method of Step 2, we use θ = θ H 1 θ of S f,θ S f,θ, where H θ is the matrix of the partial derivatives for fixed q i,r, r = 1, 2, 12. The computational step is similar to the EM algorithm as in every iteration, β, α 1, and

5 Semiparametric Approach for Non-Monotone Missing Covariates 5 α 2 are determined assuming the conditional distributions q i,r, r = 1, 2, 12 are known. 3. Asymptotic Properties We start with introducing some additional notation. Let = (S S f,α Ṱ,S Ṱ ) T. For convenience, from now on we f,α 1 f,α 2 will omit the arguments of π ij (Y i,x i( j),z i ), S β (Y i,x i,z i ), and h(y i,x i,z i,ω) and simply use π ij, S i,β, and h i (ω), respectively. Define D = (D ll ) ll, where D 11 = A αα lim n n 1 / α, D S f,α 12 = B lim n n 1 / β, D S f,α 21 = C lim n n 1 / α, and D S f,β 22 = A ββ lim n n 1 / β, and S f,β S adj R i1 R i2 S α1 (R i1,y i,x i,z i ) R i1 (1 R i2 )ES α1 (R i1,y i, i, f,α 1 X i,z i ) Y i, X i1,z i,r i1 = 1,R i2 = 0}ϒ i,α1,10 (1 R i1 )R i2 S α1 (R i1,y i,x i,z i ) (1 R i1 )(1 R i2 )ES α1 (R i1,y i,x i,z i ) Y i, Z i,r i1 = R i2 = 0}ϒ i,α1,00,s adj R i1 R i2 S α2 (R i2,y i,x i,z i ) i, f,α 2 R i1 (1 R i2 )S α2 (R i2,y i,x i,z i ) (1 R i1 )R i2 ES α2 (R i2,y i,x i, Z i ) Y i,x i2,z i,r i1 = 0,R i2 = 1}ϒ i,α2,01 (1 R i1 )(1 R i2 ) ES α2 (R i2,y i,x i,z i ) Y i,z i,r i1 = R i2 = 0}ϒ i,α2,00,s adj i, f,β R i1 R i2 S β (Y i,x i,z i )R i1 (1 R i2 )ES β (Y i,x i,z i ) Y i,x i1,z i,r i1 = 1,R i2 = 0}ϒ i,β,10 (1 R i1 )R i2 ES β (Y i,x i,z i ) Y i,x i2,z i,r i1 = 0,R i2 = 1}ϒ i,β,01 (1 R i1 )(1 R i2 )ES β (Y i,x i,z i ) Y i,z i, R i1 = R i2 = 0}ϒ i,β,00, where the expressions for the adjustment terms ϒ i,α1,10,ϒ i,α1,00,ϒ i,α2,01,ϒ i,α2,00,ϒ i,β,10,ϒ i,β,01, and ϒ i,β,00 are given in the Appendix. Now we have the following main results. Theorem 1. Under regularity conditions C1 C8 listed in Web Appendix A3, when n, (a) the influence function representation of the estimators is ( ) α α n = D 1 1 β β n n Sadj i, f,α S adj i, f,β op (1); (b) asymptotically n( β β) Normal(0, β ), where β = H 1 cov(s adj )H 1 2H 1 cov(s adj 1, f,β 1, f,β,sadj ) F T C T A T ββ 1, f,α A 1 ββ CF 1 cov(s adj )F 1 C T A 1 ββ, with H = A ββ CA 1 αα B and 1, f,α F = A αα BA 1 ββ C. The exact expression for D and a sketch proof of the above theorem are given in the Supplementary Materials. Corollary 1. (a) If we replace A αα, A ββ, B, and C in β by n 1 S fx,α / α, n 1 S fx,β / β, and n 1 S fx,α / β, n 1 S fx,β / α, respectively, then the resulting estimate β β ; (b) n 1 n Sadj n 1 n Sadj i, f,α SadjT i, f,α cov(s adj 1, f,β,sadj 1, f,α i, f,β SadjT i, f,β, n 1 n Sadj is consistent for i, f,β SadjT i, f,α, are consistent estimates for cov(sadj ), and cov(sadj 1, f,α ), respectively. and 1, f,β ), The proof of the corollary is straightforward and thus omitted. Note that the adjustment terms are zero for subjects i/ V =j : R j1 = R j2 = 1}. Fori V, ϒ i,α1,10 can be written as ϒ i,α1,10 = π1 π 2 h(y, X, Z, π 1 ) S α1 (1,Y,X,Z) a(y, X ( 2),Z,π 1 ) b } α 1 (1,Y,X ( 2),Z,π 1 ) I(X 1 = X i1,z = Z i ) a(y, X ( 2),Z,π 1 ) f (Y X i1,x 2,Z i )f (X 2 X i1,z i ) dy dx i2 pr(r 1 = R 2 = 1 X i1,z i ) π1 π 2 h(y, X, Z, π 1 ) = S α1 (1,Y,X,Z) a(y, X ( 2),Z,π 1 ) b } α 1 (1,Y,X ( 2),Z,π 1 ) I(X 1 = X i1,z = Z i ) a(y, X ( 2),Z,π 1 ) f (Y X i1,x 2,Z i )f (X 2 X i1,z i,r 1 =R 2 =1) pr(r 1 = R 2 = 1 X i1,x 2,Z i ) dy dx 2. Thus, ϒ i,α1,10 can be estimated by replacing f (X 2 X i1, Z i,r 1 = R 2 = 1) using the empirical density f (X2 X i1, Z i,r 1 = R 2 = 1) obtained from the completely observed data. Likewise, for i V, ϒ i,α1,00 can be estimated by π1 π 2 h(y, X, Z, π 1 π 2 ) S α1 (0,Y,X,Z) b } α 1 (0,Y,Z, π 1 π 2 ) a(y, Z, π 1 π 2 ) a(y, Z, π 1 π 2 ) f (Y X, Z i) f (X Zi,R 1 = R 2 = 1) pr(r 1 = R 2 = 1 X, Z i ) dy dx I(Z = Z i ) f (Y X, Z i) f (X Zi,R 1 = R 2 = 1) pr(r 1 = R 2 = 1 X, Z i ) dy dx. When Y is continuous, the integration can be evaluated via a quadrature formula. For discrete Y, the integration is replaced by a summation. 4. Extension to More Than Two Covariates The proposed method with p = 2 may be extended to the p>2 scenario by incorporating the missingness mechanism for each of the p missing variables. To estimate conditional distributions, the data can be stratified based on the conditioning variables that include Z and may include a part of X, with the strata (cell) frequencies used for estimating the conditional distributions. However, when the number of strata is large, some cell frequencies could be zero or very small, and consequently the estimates could be unstable. To solve this practical problem with a finite sample size and when the cross classification of the conditioning variables is large, some cells can be merged in a meaningful way, or the following more objective strategy can be adopted: first we seek to identify a set of important variables (they could be a linear combination of the conditioning variables) from the set of the conditioning variables using principal component analysis (PCA). In the second step, based on these identified variables in the first step, we stratify the data. This two-step technique will generally be effective if the dimension of the important variables is less than the dimension of the conditioning variables.

6 6 Biometrics For each partially missing variable X r (r = 1,...,p), we model pr(r r = 1 Y, X ( r),z,α r ) parametrically. Suppose that M i denotes the set of indices of the missing variables among X 1,...,X p for the ith observation. Then M i M c i =1,...,p}. Let the cardinality of M i be M i =m i, and m i can take values 0, 1,...,p. For m i > 0, define X i(mi ) (X ij1,...,x ijmi ):j 1,...,j mi M i }, and X i( Mi ) as X i1,...,x ip without X i(mi ). Then the score function corresponding to the ith observation for which m i > 0 involves an integral with respect to the conditional distribution of X i(mi ) given Z i,x i( Mi ) in the completely observed data V =i : R ij = 1,j = 1,...,p}. For empirical estimation of this conditional distribution, we replace f (Xi(Mi ) Z i,x i( Mi ),R 1 = =R p = 1) by f (Xi(Mi ) U (M i) 1,...,U (M i) G i,r 1 = =R p =1), where U (M i) 1,...,U (M i) G i are the first G i principal components (PCs) of (Z, X ( Mi )) based on the completely observed data. The number of PCs G i is chosen so that the G i components capture a high proportion of the total variability of (Z, X ( Mi )). Our numerical experience suggests that at least 80% for the proportion would be sufficient. Since PCA is used for continuous variables, to apply PCA on a discrete variable, we borrow Kolenikov and Angeles (2009) idea to first transform the discrete variable into a pseudo-continuous variable as follows. Let X r be an ordered categorical variable with C r categories, 0, 1,...,C r 1. Assume that there is a latent variable L Xr Normal(0, 1) such that X r takes on 0, c, and C r 1 when L Xr (,κ r1 ), L Xr κ rc,κ rc1 ), and L Xr κ Cr 1, ), respectively, for c = 1,...,C r 2. The unknown parameters κ rc,c= 1,...,C r 1, are estimated by the maximum likelihood method based on the data where R r = 1. Define the pseudo-continuous variable to be E(L Xr X r = c) = κ c uφ(u)du when X κ r = c, where φ( ) c 1 denotes a standard normal density. This technique is one way to reduce the dimension in order to produce a good approximation for our methodology. While it is only an approximate approach and the asymptotic theory does not strictly account for it due to these approximation errors, it is still a practically feasible numerical method to identify important variables. Indeed this dimension reduction technique works well in our simulation scenarios. The estimating equation for estimating β can now be written compactly S fx,β = where n S β (Y i,x i,z i ) q i,mi ( π i(mi )π i( Mi ))dx i(mi ) = 0, q i,mi ( π i(mi )π i( Mi )) = f (Y i X i,z i ) π i(mi )π i( Mi ) f } / (X i(mi ) U M i 1,...,U M i G i,r i1 = =R ip = 1) pr(r 1 = =R p = 1 X i,z i ) f (Y i X i,z i ) π i(mi )π i( Mi ) f } (X i(mi ) U M i 1,...,U M i G i,r i1 = =R ip = 1) dx i(mi ). pr(r 1 = =R p = 1 X i,z i ) Note that the integral in the denominator of q i,mi ( π i(mi )π i( Mi )) is simply a sum over all distinct observed values of X Mi in the completely observed data. Also, the estimating equation for α k, k = 1,...,p,is S fx,α k = n S αk (R ik,y i,x i,z i ) q i,mi ( π i(mi )π i( Mi ))dx i(mi ) = 0, where π i(mi ) = j M i pr(r ij = 0 Y i,x i,z i ) and π i( Mi ) = j M c pr(r ij = 1 Y i,x i,z i ). Extending the asymptotic arguments given previously, we obtain the influence function rep- i resentation of θ = (α T 1,...,αT p,βt ) T with n S adj = S β (Y i,x i,z i )q i,mi ( π i(mi )π i( Mi ))dx i(mi ) f x,β n K k=1 p j=1 R ij pr(r i1 = =R ip = 1 Z i,x i( Mk )) πi(mk ) π i( Mk )h(y i,x i,z i, π i(mk ) π i( Mk )) E a(y i,x i( Mk ),Z i, π i(mk ) π i( Mk )) } S β (Y, X, Z) S β Mk (Y, Z, X ( Mk )) ] Z = Z i,x ( Mk ) = X i( Mk ), where M k,k = 1,...,K, are K distinct patterns of missing data realized in the dataset and S β Mk (Y, Z, X ( Mk )) = S β (Y, X, Z)h(X, Y, Z, π Mk π ( Mk ))f (X Mk X ( Mk ),Z,R 1 = =R p = 1) dx Mk ]/a(y, X (Mk ),Z, π Mk π ( Mk )). Similarly, n S adj = S αl (R il,y i,x i,z i )q i,mi ( π i(mi )π i( Mi ))dx i(mi ) f x,α l n K k=1 p j=1 R ij pr(r i1 = =R ip = 1 Z i,x i( Mk )) πi(mk ) π i( Mk )h(y i,x i,z i, π i(mk ) π i( Mk )) E a(y i,x i( Mk ),Z i, π i(mk ) π i( Mk )) } S αl (R lmk,y,x,z) S αl M k (R lmk,y,z,x ( Mk )) ] Z = Z i,x ( Mk ) = X i( Mk ), for l = 1,...,p, where S αl M k (Y, Z, X ( Mk )) = S αl (R lmk, Y, X, Z)h(X, Y, Z, π Mk π ( Mk ))f (X Mk X ( Mk ),Z,R 1 = =R p = 1) dx Mk ]/a(y, X (Mk ),Z, π Mk π ( Mk )). 5. Simulation Studies 5.1. Simulation Designs We considered two scenarios. For scenario I we have two partially missing covariates and one completely observed covariate. We simulated cohort datasets each of size n = 1000 by

7 Semiparametric Approach for Non-Monotone Missing Covariates 7 simulating Z Bernoulli(0.5), X 1 Bernoulli(0.5), and X 2 Bernoulli(0.5). Then the response variable Y was generated from a Bernoulli distribution with logitpr(y = 1 X 1,X 2,Z)} = Z 0.3X 1 0.4X 2 0.2X 1 X 2. Next we generated the missing values in X 1 and X 2 by randomly simulating binary indicators R 1 and R 2 with logitpr(r 1 = 1 Y, Z, X 2 )}=Y 0.5Z X 2 and logitpr(r 2 = 1 Y, Z, X 1 )}= Y 0.5Z X 1, respectively. This NI-missingness mechanism yielded approximately 24% missing values in X 1 as well as in X 2. We also considered the MAR mechanism to generate missing values in X 1 and X 2 by simulating R 1 and R 2 from the Bernoulli distribution with success probabilities logitpr(r 1 = 1 Y, Z)} =0.4 Y Z and logitpr(r 2 = 1 Y, Z)} =0.4 Y Z, respectively. In addition, the Supplementary Materials contain some results for NI missing data. For scenario II, we considered 9 partially missing covariates and one completely observed covariate. We simulated a cohort dataset of size n = 200 by simulating Z Bernoulli(0.5), X r Bernoulli(0.5) for r = 1, 2, 3, 4, X r Gamma(2, 2) with E(X r ) = 1 for r=5, 6, X 7 Normal(0, 1), X 8 =I(X 1 =0)Normal( 0.25, 1) I(X 1 = 1)Normal(0.25, 1), and X 9 Normal( 0.2X 1 0.2X 2, 1). The response variable Y was taken to be binary with conditional success probability logitpr(y=1 Z, X 1,...,X 9 )}= 2 0.1Z 0.2X 1 0.3X 2 0.4X 3 0.5X 4 0.5X 5 0.5X 6 0.4X 7 0.3X 8 0.2X 9. The NI-data were generated by simulating the missingness indicators with success probabilities logitpr(r r = 1 Y, Z, X ( r) )}= Y 0.25Z X l=1,l r l, r = 1,...,9, which resulted in approximately 10 17% missing values for each of X 1 through X Method of Analysis The simulated datasets were analyzed by four approaches. First, we analyzed the data without any missing values, where the estimates were obtained by maximizing the full data likelihood n f (Y i X i,z i ). This approach was referred to as the full data (FD) method. We then analyzed the data using, where the subjects containing any missing values in X k were discarded. The third approach considered here was the extension of the mean-score method (only for scenario I). For this method the underlying assumption was that the data are MAR. Finally, we analyzed the datasets using the proposed semiparametric approach which we referred to as. In we modeled the missingness mechanism using a linear logistic regression (i.e., the logit of the selection probability is a linear function of the conditioning variables) Results Table 1 contains the bias, empirical and estimated standard errors, 95% coverage probabilities based on the Wald-type confidence intervals, and MSE for the five regression parameters involved in the model f (Y X 1,X 2,Z) for scenario I. The results in Table 1 indicate that when the data are MAR, the mean-score and methods performed equally well. Note that both of the approaches have much smaller variances compared with. When data are NI-, the proposed approach is much better than the mean-score approach in terms of bias. For the NI-data the mean-score method produces biased results. In both the MAR and NI-scenarios, also produced biased results. Moreover, Table 1 shows that the empirical and estimated standard errors are reasonably close for all methods, and the empirical coverage probabilities for are close to the nominal level but not always so for the other methods. Additionally, under some NI missing data (presented in the Supplementary Materials), shows the least bias among all methods. For scenario II (Table 2), we only present the results based on and the proposed approach. We noticed that faced some convergence problems (1 2% of the data). The results for are based on the datasets where estimates converged and indicate that the proposed approach performs much better than in terms of bias and standard errors. While shows bias in the parameter estimates, the coverage probabilities for the analysis are usually close to the nominal level due to large standard errors. The bias in depends on how strongly the missingness mechanism depends on the response variable (results not presented here) Robustness Study for Model Misspecification The proposed approach depends on a parametric model assumption of the missing probabilities. Therefore, in principle, a model misspecification can cause bias in the parameter estimates. In order to assess such possible bias, we conducted the following additional simulation with scenario II where NI-missing data were created by simulating the missingness indicators with success probabilities logitpr(r r = 1 Y, Z, X ( r) )}= Y 0.25Z l r X2 l. The resulting missing percentages for each of the nine variables vary between 9% and 12%. However, in the analysis missingness was modeled as logitpr(r r = 1 Y, Z, X ( r) )}=α 0r α 1r Y α 2r Z 9 α l r (2l),rX l resulting in a model violation. The results (Table 3) indicate the method is hardly affected by this model misspecification. Table 4 contains the results when the actual missingness mechanism was logitpr(r 1 = 1 Y, Z)} = 1 0.5Y 0.5Z, logitpr(r j = 1 Y, Z, R (j 1) )}= Y 0.5Z 0.8R (j 1), j = 2,...,9. This mechanism violates our assumption that the missingness mechanism of each variable is independent conditional on the variables. In this scenario missing percentage for each variable varies between 18% and 21%. In the analysis, missingness was modeled as logitpr(r r = 1 Y, Z, X ( r) )}=α 0r α 1r Y α 2r Z 9 α l r (2l),rX l. The analysis had numerical convergence problems for 68% of the datasets out of 500 replications, and here we present the results based only on the converged datasets. There was no convergence issue for, which continued to outperform. 6. Data Examples 6.1. Analysis of the Los Angeles Endometrial Cancer Data This is a 1:4 matched case-control study of endometrial cancer (Breslow and Day, 1980), where three controls were matched with every case subject based on neighborhood of residence and age in an affluent retirement community in Los Angeles. All subjects in the study were post menopausal women. The dataset contains several risk factors, including presence

8 8 Biometrics Table 1 Results of the simulation study for scenario 1 based on 500 replications Method β 0 β 1 β 2 β 3 β 4 FD Bias EMP.SE EST.SE CP MSE MAR mechanism Bias EMP.SE EST.SE CP MSE Mean-score extension Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE NI- mechanism Bias EMP.SE EST.SE CP MSE Mean-score extension Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10. of gall bladder disease (Z 1 ), presence of hypertension (Z 2 ), presence of obesity (X 1 ), dose of conjugated estrogen used (X 2, denoted as Dose), and duration of conjugated estrogen used (X 3, denoted as Duration). The last three variables had 17%, 3%, and 6% missing values, respectively, while the first two were completely observed. The non-monotone pattern of missing data is given in the top panel of Table 5, and the counts do not show any evidence of dependence between R 1 and R 2, and between R 1 and R 3. Although these counts exhibit strong evidence of dependence between R 2 and R 3, this does not rule out the possibility of independence of these missingness indicators conditional on the covariates. Furthermore, among the subjects with R 2 R 3 = 1, the missingness of obesity is strongly associated with Duration (pvalue= 0.004). Also, among the subjects with R 1 R 2 = 1, the missingness of Duration is strongly associated with Dose

9 Semiparametric Approach for Non-Monotone Missing Covariates 9 Table 2 Results of the simulation study based on 500 replications for scenario 2 with NI-mechanism Method β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 FD Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10. (p-value= 0.021). These facts motivated us to assume the NImechanism and estimate the parameters using the proposed method. Define Y = 1 if a subject has endometrial cancer and Y = 0 otherwise. We fit a logistic regression model logitpr(y = 1 Z 1,Z 2,X 1,X 2,X 3 )}=β 0 β 1 Z 1 β 2 Z 2 β 3 X 1 β 4 X 2 β 5 X 3 as if the data were unmatched and cohort. We estimated the model parameters using and. For the missingness model in we considered a logistic regression model, that is, logitpr(r r = 1 X ( r),z,y)} =α r0 α r1 Y 2 α j=1 r(1j)z j 3 α j=1,j r r(3j)x j. The estimates, standard errors, and p values for β from both methods are presented Table 3 Results of the simulation study based on 500 replications for scenario 2 with NI-mechanism with logitpr(r r = 1 Y, Z, X ( r) )}= Y 0.25Z l r X2 l Method β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 FD Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE For the method the missingness mechanism is incorrectly modeled. Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10.

10 10 Biometrics Table 4 Results of the simulation study based on 500 replications for scenario 2 where the missingness mechanism is logitpr(r 1 = 1 Y, Z)} =1 0.5Y 0.5Z, logitpr(r j = 1 Y, Z, R (j 1) )}= Y 0.5Z 0.8R (j 1), j = 2,...,9 Method β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 FD Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE For the method the missingness mechanism is incorrectly modeled. Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10. in Table 5. has lower standard errors for all the coefficient estimates. Except for X 1 and X 3, the parameter estimates for both the methods are somewhat similar. Further investigation revealed that the missingness mechanism of X 1 strongly depends on X 3 (p-value = 0.005) and the missingness mechanism of X 3 strongly depends on X 1 (p-value = 0.034). Based on the results of, the presence of gall bladder disease and duration of conjugated estrogen used are statistically significant (at the 5% level) as risk factors for endometrial cancer. Table 5 The top panel shows the pattern of missing data in the Los Angeles Endometrial Cancer data with 1 and 0 representing the observed and missing variable, respectively, and the bottom panel contains the results of the data analysis Dose of conjugated Duration of conjugated Pattern Obesity estrogen estrogen used Count Method Gall bladder Hypertension Obesity Dose log (duration) Estimate SE p-value Estimate SE p-value Here and stand for the complete case and the proposed semiparametric method, respectively.

11 Semiparametric Approach for Non-Monotone Missing Covariates 11 Table 6 Results of the hip fracture data analysis Method Estimate SE p-value Estimate SE p-value EtOH < <0.001 Smoke Dementia < <0.001 Antiseiz <0.001 LevoT AntiChol Albumin BMI log(hgb) Here and stand for the complete case and the proposed semiparametric method, respectively Analysis of the Hip Fracture Data The second data example is a 1:1 age and race matched casecontrol study of possible risk factors of hip fracture among male veterans with 218 strata (Chen, 2004). The study was conducted at the University of Illinois at Chicago College of Medicine (Barengolts et al., 2001). Many potential risk factors (25 altogether) in addition to age and race were recorded. Preliminary exploratory analysis suggested that nine of these risk factors are potentially important. Among them EtOH(X 1 ), Smoke (X 2 ), Dementia (X 3 ), Antiseizure (X 4 ), LevoT4 (X 5 ), and anti-cholesterol (X 6 ) are dichotomous variables, whereas the remaining three Albumin (X 7 ), BMI (X 8 ), and log(hgb) (X 9 ) are continuous in nature. All of these predictors contain some missing values, with the percentage of missing values being approximately 10%, 12%, 3%, 5%, 9%, 10%, 24%, 10%, and 12%, respectively. Although there could be at most = 511 missing patterns, the observed number of patterns (non-monotone) is 37, and these patterns are presented in Table 2 of Chen (2004). The presence (Y = 1) or absence (Y = 0) of hip fracture was considered as the binary outcome. We aimed to find the estimate of the logodds ratio parameters by fitting a logistic regression model to the data: logitpr(y = 1 X)} =β 0 9 r=1 X rβ r. Note here that β 1,...,β 9 are the main parameters of interest. Following Chen (2004), in the model we ignored age and race as these variable were used as matching variables. First we carried out the analysis and then we analyzed the data using. In the latter, we assumed that the missingness mechanisms follow a logistic model: logitpr(r r = 1 X ( r),y)} =α r0 α r1 Y 9 j=1,j r α r(j1)x j for r = 1,...,9. We applied the dimension reduction technique for calculating the conditional distributions based on the completely observed data. In Table 6 we present the estimates, standard errors, and p values for β-parameters. The estimates differ somewhat between the methods. However, the standard errors of are substantially smaller than that of. The results indicate that all the covariates have statistically significant associations (at the 5% level) with the occurrence of hip fracture. Our results are consistent with the results of Chen (2004) who analyzed this data under the MAR assumption. 7. Discussion This article deals with non-monotone patterns of missing covariate data in a parametric regression model. Here we address two important issues simultaneously: extending the missingness mechanism beyond MAR and handling possible association among the partially missing covariates. The method relies on a parametric model for the missingness mechanism, and possible associations among the covariates are handled empirically without using any parametric structure. The NI-mechanism may be used when the MAR assumption is likely to be violated (Potthof et al., 2006) or in the scenario when the NI missing assumption is plausible. If we assume that the missingness mechanism is completely non-ignorable, then we will require a parametric model for the partially missing covariate. This parametric modeling of the missing covariate distribution is avoided by imposing the NI-mechanism assumption. The simulation results indicate that the proposed approach works remarkably well in terms of bias and standard error. Furthermore, it appears robust when some of the underlying assumptions about the missingness mechanism are violated. We have also derived the asymptotic normality of the proposed estimator and an asymptotically consistent formula to compute the standard errors. Our proposed method is an estimated-score approach to estimate the parameters of the model. Alternatively, a maximum likelihood estimator (MLE) or a nonparametric maximum likelihood estimator (NPMLE) is also possible and likely to be more efficient. However, in both cases of MLE and NPMLE the computation could be challenging for a large dimension of the parameter vector as all parameters (β, α 1,...,α p, and any additional parameters) need to be estimated simultaneously in each step of the EM algorithm. In contrast, in each iteration of our EM algorithm, β, α 1,...,α p are estimated separately. Thus, a further theoretical and numerical investigation of the MLE-type estimator would be interesting. Finally, in principle the proposed idea can be generalized to handle both missing covariates and the missing response variable, especially in the context of longitudinal data analyses.

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan