Semiparametric Approach for Non-Monotone Missing Covariates in a Parametric Regression Model

Size: px
Start display at page:

Download "Semiparametric Approach for Non-Monotone Missing Covariates in a Parametric Regression Model"

Transcription

1 Biometrics DOI: /biom Semiparametric Approach for Non-Monotone Missing Covariates in a Parametric Regression Model Samiran Sinha, 1 Krishna K. Saha, 2 and Suojin Wang 1, * 1 Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A. 2 Department of Mathematical Sciences, Central Connecticut State University, New Britain, Connecticut 06050, U.S.A. sjwang@stat.tamu.edu Summary. Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this article, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to nonidentifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset. Key words: Dimension reduction; Estimating equations; Missing at random; Non-ignorable missing data; Robust method. 1. Introduction In observational biomedical studies or survey designs, we often encounter partially missing variables, or variables that are not observed or recorded for all the subjects in a study. For example, in the first motivating dataset from the Los Angeles Endometrial Cancer Study, three of the five covariates are partially missing, and in the second dataset on hip fractures, all nine covariates are partially missing. The simplest method, commonly known as complete case () analysis, ignores the subjects with partially missing entries and takes into account the subjects only with complete information on all covariates. This approach may lose efficiency. It may also lead to biased estimates depending on the missingness mechanism. The problem can be severe especially when there are multiple covariates with missing values, even if each of them contains a low to moderate percentage of missing entries. The aims of this article are to introduce a new missingness mechanism for non-monotone missing data and to propose a semiparametric method to estimate the model parameters. Suppose that our interest is in a finite dimensional parameter β in the parametric regression model f (Y X, Z; β) where Y is the outcome variable and X = (X 1,...,X p ) T and Z = (Z 1,...,Z q ) T are the explanatory variables. We assume that Y and Z are observed for all subjects, whereas all the components of X are partially observed. Let (X i,y i,z i ), i = 1,...,n, be independent and identically distributed (iid) copies of (X, Y, Z), where n is the sample size. Define the missingness indicator variables 1 if Xij is observed, R ij 0 if X ij is not observed, j = 1,...,p and X i( r) (X i1,...,x i(r 1),X i(r1),...,x ip ) T. There are two possibilities for multiple missing covariates: (1) monotone missing data and (2) non-monotone missing data. For monotone missing data, if X ij is missing, then X ik, k>jare all missing, that is, pr(r i(j1) = =R ip = 0 R ij = 0) = 1 for any j = 1,...,(p 1). In this article, we focus on non-monotone missing covariate data. Regarding the missing data mechanism, the missing at random (MAR) assumption is widely used, under which the probability that R ij = 1 depends only on the observed quantities (Rubin, 1976). Under the MAR mechanism and parameter distinctness assumption, valid likelihood based inferences may be carried out without using a model for the missingness mechanism. For handling MAR covariate data, multiple imputation is a commonly used method. Within the broader context of imputation approach, Reilly and Pepe (1995) proposed the mean-score approach and Chatterjee, Chen, and Breslow (2003) proposed a pseudo-score approach which is usually more efficient than the mean-score approach if the assumed model for the missingness mechanism is correct. Robins, Rotnitzky, and Zhao (1994) proposed an efficient method within the class of inverse probability weighted (IPW) estimating 2014, The International Biometric Society 1

2 2 Biometrics equations for handling the MAR data which is similar in spirit to the Horvitz Thompson estimator. Lipsitz, Ibrahim, and Zhao (1999) proposed a doubly robust method that blends the likelihood based approach and the weighted estimating equations. Ibrahim, Chen, and Lipstiz (1999) proposed a full likelihood based method using the EM algorithm. They assumed a parametric model for the partially missing covariate. Under the MAR assumption, Chen (2004) proposed a semiparametric method for handling multiple missing covariates for any arbitrary pattern of missing data. He assumed a parametric model for the odds ratio between any two missing covariates, which is then used in the likelihood formation. Non-ignorable (NI) mechanism happens if the probability that R ij = 1 depends on values of the completely observed and partially observed variables, meaning that the missingness mechanism of X ij may depend on all components of X i = (X i1,...,x ip ) T along with Y i and Z i. For NI missing data, the model parameters are estimated by jointly maximizing the likelihood of the data generation process and the missingness process. This requires a parametric model specification of the distribution of the missing covariates and the missingness process. Consistent parameter estimates are obtained under the assumption that both the models are correctly specified. Ibrahim, Lipsitz, and Chen (1999) wrote a comprehensive review of NI missing covariate data scenario, and they considered arbitrary missing data patterns. One disadvantage of the NI missingness mechanism is that data provide little or no information regarding the parameters in the missingness models. Hence, model parameters are weakly identifiable (Little, 1995). Non-monotone missing data are usually handled by assuming the MAR mechanism (Robins and Gill, 1997) or the NI mechanism (Robins, 1997). In the former case, analysis can be done without assuming any model for the missingness mechanism and any parametric model for the partially missing covariates. In the latter case, analysis is usually done by positing a parametric model for the missingness mechanism and partially missing covariates. Except for some special cases (e.g., Lei and Wang, 2001), the MAR and NI assumptions are generally not testable. To fill the substantial gap between the MAR and NI mechanisms, we introduce another mechanism which is somewhat less general than the NI mechanism but facilitates our calculations as it can be used to tackle the identifiability issue. To handle the non-monotone pattern of missing data, we define missingness mechanism NI- when the probability that R ij = 1 may depend on all variables except the jth covariate X ij, that is, pr(r ij = 1 Y i,x i,z i ) = pr(r ij = 1 Y i,x i( j),z i ). Thus, we assume that R ij may depend on the other variables which are completely observed or partially missing, but does not depend on the values of the jth covariate X ij. Further, we assume that under the NI-mechanism, (1) R i1,...,r ip are independent conditional on X i,y i,z i, and (2) for every j, pr(r ij = 1 Y i,x i,z i ) > 0 with probability one for every combination of (Y i,x i,z i ). Under the independence assumption (1), the NImechanism helps to reduce the identifiability issue encountered by the non-ignorable missing data. Consequently, we do not need to make a parametric model assumption regarding the distribution of the missing covariates. Our NI-mechanism is partly non-ignorable because the probability model for R j may depend on (X 1,...,X j 1,X j1,...,x p ) which are partially observed. In summary, the proposed NI-mechanism is sometimes more flexible than the MAR mechanism where the missingness mechanism is allowed to depend only on the completely observed variables. Further, the NI-mechanism allows us to estimate the model parameters without specifying a parametric model for the missing covariate that is needed for the NI missingness mechanism. When the MAR assumption is in doubt, it may be preferable to assume the NI-mechanism and use the proposed method of estimation. The remainder of this article is outlined as follows: Sections 2 and 3 contain the detailed methodology and asymptotic properties for the case of two partially missing covariates. An extension of the method for more than two partially missing covariates is discussed in Section 4. Sections 5 and 6 contain simulation studies and two real data examples, respectively. Some concluding remarks are given in Section 7. Technical material is provided in the Appendix. 2. Estimation Methodology 2.1. Background The goal is to estimate β. Without any missing entries in X, and under certain regularity conditions, the MLE of β is obtained by solving the score function S nm = n S β(y i, X i,z i )=0, where S β (Y,X,Z)= logf (Y X,Z;β)}/ β for f (Y X,Z; β) satisfying some standard regularity conditions. For the analysis, β is estimated by solving S cc = n ( p j=1 R ij)s β (Y i, X i,z i ) = 0. Note that Reilly and Pepe (1995) and Chatterjee et al. (2003) both considered the case where all R ij, j =1,...,p are equal for a given i, that is, either R i1 = =R ip = 1or R i1 = =R ip = 0 and their missingness mechanism depends only on the observable quantities Proposed Method for NI-Missing Data In this Section, we consider the case of p = 2. We will use the following notation π ij (Y i,x i( j),z i ) = 1 π ij (Y i,x i( j),z i ) pr(r ij = 1 Y i,x i( j),z i ). We first assume that both the covariates X and Z are discrete with fixed numbers of categories while Y can be either discrete or continuous. In the following derivation, any integration with respect to a discrete covariate implies a summation. The likelihood of the data is L = i:r i1 =R i2 =1 f (X i1,x i2 Z i ) π i1 (Y i,x i( 1),Z i )π i2 (Y i,x i( 2),Z i )f (Y i X i,z i ) i:r i1 =1,R i2 =0 π i1 (Y i,x i( 1),Z i ) f (Y i X i,z i )f (X i1,x i2 Z i )dx i2 } π i2 (Y i,x i( 2),Z i ) i:r i1 =0,R i2 =1 π i1 (Y i,x i( 1),Z i ) f (Y i X i,z i )f (X i1,x i2 Z i )dx i1 } π i2 (Y i,x i( 2),Z i )

3 Semiparametric Approach for Non-Monotone Missing Covariates 3 i:r i1 =0,R i2 =0 π i1 (Y i,x i( 1),Z i ) π i2 (Y i,x i( 2),Z i ) f (Y i X i,z i )f (X i1,x i2 Z i )dx i1 dx i2 }. The score function for estimating β is S fx,β = i:r i1 =R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =1 i:r i1 =0,R i2 =0 S β (Y i,x i,z i ) S β (Y i,x i,z i )q i,2 (π i1 )dx i2 S β (Y i,x i,z i )q i,1 (π i2 )dx i1 S β (Y i,x i,z i )q i,12 ( π i1 π i2 )dx i, (1) where for r = 1, 2, 12, q i,r (ω) = f (Y i X i,z i )ω(y i,x i,z i )f (X ir X i( r),z i )/ f (Y i X i,z i )ω(y i,x i,z i ) f (X ir X i( r),z i )dx ir is a function of the given parametric model f (Y i X i,z i ), the distribution of the missing covariate given the completely observed covariates, and ω. Here r = 12 means both X i1 and X i2 are missing. For obtaining q i,2 (π i1 ), q i,1 (π i2 ), and q i,12 ( π i1 π i2 ), we set ω = π i1,π i2, and π i1 π i2, respectively. For estimating q i,r (ω) we face two issues: estimation of the missing covariate distribution and of the missingness probabilities (selection probabilities). One obvious approach is to model the missingness probability and the covariate distribution parametrically. However, modeling the distribution of the covariates parametrically is not an easy task, and model misspecification can lead to biased estimates. Therefore, we use an empirical approach to estimate f (X ir X i( r),z i ) and use a parametric model for the missingness probabilities. Suppose that the missingness probabilities are identified by a finite dimensional parameter α = (α T 1,αT 2 )T, where α k is for X k for k = 1, 2. We will denote ω(y i,x i,z i ) by ω(y i,x i,z i,α). Adopting the technique of Chatterjee et al. (2003) to our setting, we write the conditional density of X ir given X i( r) and Z i in terms of the conditional density of X ir given X i( r) and Z i in the completely observed data V =i : R i1 = R i2 = 1} as f (X ir X i( r),z i ) = f (X ir X i( r),z i,r i1 =R i2 =1)pr(R i1 =R i2 =1 X i( r),z i ). pr(r i1 = R i2 = 1 X i,z i ) (2) Plugging this into the expression of q i,r we obtain q i,r (ω) = h(y i,x i,z i ; ω)f (X ir X i( r),z i,r i1 = R i2 = 1) / h(y i,x i,z i ; ω)f (X ir X i( r),z i,r i1 = R i2 = 1) dx ir, where h(x i,y i,z i ; ω) = f (Y i X i,z i )ω(x i,y i, Z i )/pr(r i1 = R i2 = 1 X i,z i ). Following a strategy similar to that of Chatterjee et al. (2003) we obtain pr(r i1 = R i2 = 1 X i,z i ) = pr(r i1 = R i2 = 1 X i,y i, Z i )f (Y X i,z i ; β)dy. Therefore, q i,1 (π i2 ), q i,2 (π i1 ), and q i,12 ( π i1 π i2 ) involve f (X i1 X i( 1),Z i,r i1 = R i2 = 1), f (X i2 X i( 2), Z i,r i1 = R i2 = 1), and f (X i Z i,r i1 = R i2 = 1), respectively. Note that these functions must be estimated empirically based on the completely observed data V. Observe that while pr(r i1 = R i2 = 1 X i( r),z i ) appears in the numerator of (2), it does not appear in q i,r (ω). An estimator of q i,r (ω) denoted by q i,r (ω) is obtained after replacing f (X ir X i( r),z i,r i1 = R i2 = 1) by its empirical estimates f (Xir X i( r),z i,r i1 = R i2 = 1) = d F(Xir X i( r),z i,r i1 = R i2 = 1), where F(x X i( r),z i,r i1 = R i2 = 1) n j=1 = I(X jr x, X j( r) =X i( r),z j =Z i,r j1 =R j2 =1) n I(X j=1 j( r) = X i( r),z j = Z i,r j1 = R j2 = 1) and in particular when r = 12, F(x 1,x 2 Z i,r i1 = R i2 = 1) n j=1 = I(X j1 x 1,X j2 x 2,Z j = Z i,r j1 = R j2 = 1) n I(Z. j=1 j = Z i,r j1 = R j2 = 1) After replacing q i,r by q i,r (r = 1, 2, 12) in (1) we obtain the estimated score functions S fx,β = i:r i1 =R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =1 i:r i1 =0,R i2 =0 S β (Y i,x i,z i ) S β (Y i,x i,z i ) q i,2 (π i1 )dx i2 S β (Y i,x i,z i ) q i,1 (π i2 )dx i1 S β (Y i,x i,z i ) q i,12 ( π i1 π i2 )dx i. For estimating β we also need to estimate α 1 and α 2. Define S αk (R ik,y i,x i,z i ) = R ik logπ ik (Y i,x i( k),z i )}/ α k (1 R ik ) log π ik (Y i,x i( k),z i )}/ α k for k = 1, 2. Then we can estimate α 1 and α 2 by solving the estimating equations S f x,α 1 = i:r i1 =R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =1 i:r i1 =0,R i2 =0 S α1 (R i1,y i,x i,z i ) S α1 (R i1,y i,x i,z i ) q i,2 (π i1 )dx i2 S α1 (R i1,y i,x i,z i ) S α1 (R i1,y i,x i,z i ) q i,12 ( π i1 π i2 )dx i = 0

4 4 Biometrics and S fx,α 2 = i:r i1 =R i2 =1 i:r i1 =0,R i2 =1 i:r i1 =1,R i2 =0 i:r i1 =0,R i2 =0 S α2 (R i2,y i,x i,z i ) S α2 (R i2,y i,x i,z i ) q i,1 (π i2 )dx i1 S α2 (R i2,y i,x i,z i ) S α2 (R i2,y i,x i,z i ) q i,12 ( π i1 π i2 )dx i = 0. If the missingness model of R r does not depend on X ( r) then the missingness model becomes a model for the MAR data. Therefore, the resulting method will produce consistent estimates when the data are MAR, and the method will reduce to the pseudo-score method of Chatterjee et al. (2003). While there is no direct test to verify this assumption, the NImechanism would be worth considering if R 1 (R 2 ) depends on X 2 (X 1 ) in the subset i : R i2 = 1} (i : R i1 = 1}). When any component of X or Z is continuous, a similar method can be developed where conditional probability mass function (pmf) must be replaced by a conditional density function estimated by, say, a kernel method. This approach requires a full scale technical and numerical investigation and is a problem for future research. Alternatively, in this work we approximate the conditional density by the empirical conditional pmf after discretizing the continuous components. Although this method is an approximation, our finite sample numerical studies suggest that it works well. Note that in principle discretization should be refined with sample size so that the conditional pmf converges to the conditional density as n grows to infinity. How to optimize the rate of refinement as a function of the sample size is an interesting problem for future work The Issue of Identifiability The likelihood identifiability is a sophisticated issue in the general missing data context. Through the following simple example we illustrate how the identifiability issue under the NI data is eased under the NI-missing data assumption when the models for the missingness indicators and the covariate distribution are nonparametric. Suppose that potentially observable data are n iid copies of (Y, X 1,X 2,R 1,R 2 ) without Z, where Y is a binary response variable, X 1 and X 2 both are scalar binary variables, and R k is a missingness indicator for X k for k = 1, 2. Let n x1,x 2,y be the number of observations with X 1 = x 1, X 2 = x 2, Y = y, and for R 1 = R 2 = 1; m x1,,y be the number of observations with X 1 = x 1, missing X 2, and Y = y (i.e., where R 1 =1 and R 2 = 0); m,x2,y be the number of observations with missing X 1, X 2 = x 2, and Y = y (i.e., where R 1 = 0 and R 2 = 1); and m,,y be the number of observations with missing X 1 and X 2, and Y = y (i.e., where R 1 = 0 and R 2 = 0). Thus, n 0,0,0 n 0,0,1 n 0,1,0 n 0,1,1 n 1,0,0 n 1,0,1 n 1,1,0 n 1,1,1 m,0,0 m,0,1 m,1,0 m,1,1 m 0,,0 m 0,,1 m 1,,0 m 1,,1 m,,0 m,,1 = n. Let u x1,x 2 = pr(x 1 = x 1,X 2 = x 2 ), v x1,x 2 = pr(y = 1 X 1 = x 2,X 2 = x 2 ), π x (1) 1,x 2,y = pr(r 1 = 1 X 1 = x 1,X 2 = x 2,Y = y), and π x (2) 1,x 2,y = pr(r 2 = 1 X 1 = x 1,X 2 = x 2,Y = y). The likelihood of the observed data involves 23 parameters (4v-parameters, 3u-parameters, 8π (1) -parameters, and 8π (2) -parameters) whereas the data contain 18 1 = 17 independent patterns; therefore the parameters are not identifiable from the data. However, this issue may be eased by imposing some model restrictions such as our NI-mechanism. In the NImechanism, there are at most 4π (1) -parameters (no X 1 involved) and 4π (2) -parameters (no X 2 involved), resulting in a total of 15 model parameters, which is fewer than the number of observed patterns of the data. Let L be the likelihood of the observed data and θ = (u 00,u 01,u 10,v 00, v 01,v 10,v 11,π (1) 00,π (1) 01,π (1) 10,π (1) 11,π (2) 00,π (2) 01,π (2) 10,π (2) 11 ) T be the set of parameters under the NI-mechanism. In the Supplementary Materials, we show that E log(l)/ θ log(l)/ θ} T ] is non-singular, a sufficient condition for identifiability (Rothenberg, 1971). The issue in a more general setting is interesting and worth investigating. Additionally, partial information regarding α 1 (generally a non-saturated equivalent of π (1) ) involved in pr(r 1 = 1 X 2,Y; α 1 ) can be obtained from the data (R 1,X 2,Y) among the subjects with R 2 = 1. Note that these subjects contri- bution to the likelihood for α 1 is i:r i2 =1 prr i1(r i1 = 1 R i2,x i1,x i2,y i )pr 1 R i1(r i1 = 0 R i2,x i1,x i2,y i ) = i:r i2 =1 pr R i1(r i1 = 1 X i2,y i )pr 1 R i1(r i1 = 0 X i2,y i ) due to our assumption that R 1 R 2 X 1,X 2,Y and R 1 X 1 X 2,Y. To clarify the notation, the R i1 inside the probability statements above is a random variable while the R i1 on the exponents is the observed value of the random variable. Similarly, information regarding α 2 can be obtained from the data (R 2,X 1,Y) among the subjects with R 1 = Computational Steps Let θ = (α T,β T ) T. Then θ is estimated by solving the estimating equations = 0 iteratively, where S S f,θ Ṱ = f x,θ (S Ṱ,S Ṱ,S Ṱ ). Let α(t) 1, α (t) 2, and β (t) be the parameter f x,α 1 f x,α 2 f x,β estimates at the tth iteration. Step 0. Initialize α 1 = α (0) 1, α 2 = α (0) 2, and β = β (0). Step 1. For the (t 1)th iteration first compute q i,r for r = 1, 2, 12 using α 1 = α (t) 1, α 2 = α (t) 2, and β = β (t). Then insert these quantities into the estimating equations S fx,θ = 0. Step 2. Solve for α 1,α 2, and β from = 0 using the S fx,θ Newton Raphson method by assuming that q i,r is fixed. Observe that conditional on q i,r, r = 1, 2, 12, the Newton Raphson method can be carried out separately for each parameter vector, β, α 1, and α 2, thereby keeping the dimension low. Step 3. Repeat Steps 1 and 2 until the estimates converge. We denote the resulting estimates by α and β. For the Newton Raphson method of Step 2, we use θ = θ H 1 θ of S f,θ S f,θ, where H θ is the matrix of the partial derivatives for fixed q i,r, r = 1, 2, 12. The computational step is similar to the EM algorithm as in every iteration, β, α 1, and

5 Semiparametric Approach for Non-Monotone Missing Covariates 5 α 2 are determined assuming the conditional distributions q i,r, r = 1, 2, 12 are known. 3. Asymptotic Properties We start with introducing some additional notation. Let = (S S f,α Ṱ,S Ṱ ) T. For convenience, from now on we f,α 1 f,α 2 will omit the arguments of π ij (Y i,x i( j),z i ), S β (Y i,x i,z i ), and h(y i,x i,z i,ω) and simply use π ij, S i,β, and h i (ω), respectively. Define D = (D ll ) ll, where D 11 = A αα lim n n 1 / α, D S f,α 12 = B lim n n 1 / β, D S f,α 21 = C lim n n 1 / α, and D S f,β 22 = A ββ lim n n 1 / β, and S f,β S adj R i1 R i2 S α1 (R i1,y i,x i,z i ) R i1 (1 R i2 )ES α1 (R i1,y i, i, f,α 1 X i,z i ) Y i, X i1,z i,r i1 = 1,R i2 = 0}ϒ i,α1,10 (1 R i1 )R i2 S α1 (R i1,y i,x i,z i ) (1 R i1 )(1 R i2 )ES α1 (R i1,y i,x i,z i ) Y i, Z i,r i1 = R i2 = 0}ϒ i,α1,00,s adj R i1 R i2 S α2 (R i2,y i,x i,z i ) i, f,α 2 R i1 (1 R i2 )S α2 (R i2,y i,x i,z i ) (1 R i1 )R i2 ES α2 (R i2,y i,x i, Z i ) Y i,x i2,z i,r i1 = 0,R i2 = 1}ϒ i,α2,01 (1 R i1 )(1 R i2 ) ES α2 (R i2,y i,x i,z i ) Y i,z i,r i1 = R i2 = 0}ϒ i,α2,00,s adj i, f,β R i1 R i2 S β (Y i,x i,z i )R i1 (1 R i2 )ES β (Y i,x i,z i ) Y i,x i1,z i,r i1 = 1,R i2 = 0}ϒ i,β,10 (1 R i1 )R i2 ES β (Y i,x i,z i ) Y i,x i2,z i,r i1 = 0,R i2 = 1}ϒ i,β,01 (1 R i1 )(1 R i2 )ES β (Y i,x i,z i ) Y i,z i, R i1 = R i2 = 0}ϒ i,β,00, where the expressions for the adjustment terms ϒ i,α1,10,ϒ i,α1,00,ϒ i,α2,01,ϒ i,α2,00,ϒ i,β,10,ϒ i,β,01, and ϒ i,β,00 are given in the Appendix. Now we have the following main results. Theorem 1. Under regularity conditions C1 C8 listed in Web Appendix A3, when n, (a) the influence function representation of the estimators is ( ) α α n = D 1 1 β β n n Sadj i, f,α S adj i, f,β op (1); (b) asymptotically n( β β) Normal(0, β ), where β = H 1 cov(s adj )H 1 2H 1 cov(s adj 1, f,β 1, f,β,sadj ) F T C T A T ββ 1, f,α A 1 ββ CF 1 cov(s adj )F 1 C T A 1 ββ, with H = A ββ CA 1 αα B and 1, f,α F = A αα BA 1 ββ C. The exact expression for D and a sketch proof of the above theorem are given in the Supplementary Materials. Corollary 1. (a) If we replace A αα, A ββ, B, and C in β by n 1 S fx,α / α, n 1 S fx,β / β, and n 1 S fx,α / β, n 1 S fx,β / α, respectively, then the resulting estimate β β ; (b) n 1 n Sadj n 1 n Sadj i, f,α SadjT i, f,α cov(s adj 1, f,β,sadj 1, f,α i, f,β SadjT i, f,β, n 1 n Sadj is consistent for i, f,β SadjT i, f,α, are consistent estimates for cov(sadj ), and cov(sadj 1, f,α ), respectively. and 1, f,β ), The proof of the corollary is straightforward and thus omitted. Note that the adjustment terms are zero for subjects i/ V =j : R j1 = R j2 = 1}. Fori V, ϒ i,α1,10 can be written as ϒ i,α1,10 = π1 π 2 h(y, X, Z, π 1 ) S α1 (1,Y,X,Z) a(y, X ( 2),Z,π 1 ) b } α 1 (1,Y,X ( 2),Z,π 1 ) I(X 1 = X i1,z = Z i ) a(y, X ( 2),Z,π 1 ) f (Y X i1,x 2,Z i )f (X 2 X i1,z i ) dy dx i2 pr(r 1 = R 2 = 1 X i1,z i ) π1 π 2 h(y, X, Z, π 1 ) = S α1 (1,Y,X,Z) a(y, X ( 2),Z,π 1 ) b } α 1 (1,Y,X ( 2),Z,π 1 ) I(X 1 = X i1,z = Z i ) a(y, X ( 2),Z,π 1 ) f (Y X i1,x 2,Z i )f (X 2 X i1,z i,r 1 =R 2 =1) pr(r 1 = R 2 = 1 X i1,x 2,Z i ) dy dx 2. Thus, ϒ i,α1,10 can be estimated by replacing f (X 2 X i1, Z i,r 1 = R 2 = 1) using the empirical density f (X2 X i1, Z i,r 1 = R 2 = 1) obtained from the completely observed data. Likewise, for i V, ϒ i,α1,00 can be estimated by π1 π 2 h(y, X, Z, π 1 π 2 ) S α1 (0,Y,X,Z) b } α 1 (0,Y,Z, π 1 π 2 ) a(y, Z, π 1 π 2 ) a(y, Z, π 1 π 2 ) f (Y X, Z i) f (X Zi,R 1 = R 2 = 1) pr(r 1 = R 2 = 1 X, Z i ) dy dx I(Z = Z i ) f (Y X, Z i) f (X Zi,R 1 = R 2 = 1) pr(r 1 = R 2 = 1 X, Z i ) dy dx. When Y is continuous, the integration can be evaluated via a quadrature formula. For discrete Y, the integration is replaced by a summation. 4. Extension to More Than Two Covariates The proposed method with p = 2 may be extended to the p>2 scenario by incorporating the missingness mechanism for each of the p missing variables. To estimate conditional distributions, the data can be stratified based on the conditioning variables that include Z and may include a part of X, with the strata (cell) frequencies used for estimating the conditional distributions. However, when the number of strata is large, some cell frequencies could be zero or very small, and consequently the estimates could be unstable. To solve this practical problem with a finite sample size and when the cross classification of the conditioning variables is large, some cells can be merged in a meaningful way, or the following more objective strategy can be adopted: first we seek to identify a set of important variables (they could be a linear combination of the conditioning variables) from the set of the conditioning variables using principal component analysis (PCA). In the second step, based on these identified variables in the first step, we stratify the data. This two-step technique will generally be effective if the dimension of the important variables is less than the dimension of the conditioning variables.

6 6 Biometrics For each partially missing variable X r (r = 1,...,p), we model pr(r r = 1 Y, X ( r),z,α r ) parametrically. Suppose that M i denotes the set of indices of the missing variables among X 1,...,X p for the ith observation. Then M i M c i =1,...,p}. Let the cardinality of M i be M i =m i, and m i can take values 0, 1,...,p. For m i > 0, define X i(mi ) (X ij1,...,x ijmi ):j 1,...,j mi M i }, and X i( Mi ) as X i1,...,x ip without X i(mi ). Then the score function corresponding to the ith observation for which m i > 0 involves an integral with respect to the conditional distribution of X i(mi ) given Z i,x i( Mi ) in the completely observed data V =i : R ij = 1,j = 1,...,p}. For empirical estimation of this conditional distribution, we replace f (Xi(Mi ) Z i,x i( Mi ),R 1 = =R p = 1) by f (Xi(Mi ) U (M i) 1,...,U (M i) G i,r 1 = =R p =1), where U (M i) 1,...,U (M i) G i are the first G i principal components (PCs) of (Z, X ( Mi )) based on the completely observed data. The number of PCs G i is chosen so that the G i components capture a high proportion of the total variability of (Z, X ( Mi )). Our numerical experience suggests that at least 80% for the proportion would be sufficient. Since PCA is used for continuous variables, to apply PCA on a discrete variable, we borrow Kolenikov and Angeles (2009) idea to first transform the discrete variable into a pseudo-continuous variable as follows. Let X r be an ordered categorical variable with C r categories, 0, 1,...,C r 1. Assume that there is a latent variable L Xr Normal(0, 1) such that X r takes on 0, c, and C r 1 when L Xr (,κ r1 ), L Xr κ rc,κ rc1 ), and L Xr κ Cr 1, ), respectively, for c = 1,...,C r 2. The unknown parameters κ rc,c= 1,...,C r 1, are estimated by the maximum likelihood method based on the data where R r = 1. Define the pseudo-continuous variable to be E(L Xr X r = c) = κ c uφ(u)du when X κ r = c, where φ( ) c 1 denotes a standard normal density. This technique is one way to reduce the dimension in order to produce a good approximation for our methodology. While it is only an approximate approach and the asymptotic theory does not strictly account for it due to these approximation errors, it is still a practically feasible numerical method to identify important variables. Indeed this dimension reduction technique works well in our simulation scenarios. The estimating equation for estimating β can now be written compactly S fx,β = where n S β (Y i,x i,z i ) q i,mi ( π i(mi )π i( Mi ))dx i(mi ) = 0, q i,mi ( π i(mi )π i( Mi )) = f (Y i X i,z i ) π i(mi )π i( Mi ) f } / (X i(mi ) U M i 1,...,U M i G i,r i1 = =R ip = 1) pr(r 1 = =R p = 1 X i,z i ) f (Y i X i,z i ) π i(mi )π i( Mi ) f } (X i(mi ) U M i 1,...,U M i G i,r i1 = =R ip = 1) dx i(mi ). pr(r 1 = =R p = 1 X i,z i ) Note that the integral in the denominator of q i,mi ( π i(mi )π i( Mi )) is simply a sum over all distinct observed values of X Mi in the completely observed data. Also, the estimating equation for α k, k = 1,...,p,is S fx,α k = n S αk (R ik,y i,x i,z i ) q i,mi ( π i(mi )π i( Mi ))dx i(mi ) = 0, where π i(mi ) = j M i pr(r ij = 0 Y i,x i,z i ) and π i( Mi ) = j M c pr(r ij = 1 Y i,x i,z i ). Extending the asymptotic arguments given previously, we obtain the influence function rep- i resentation of θ = (α T 1,...,αT p,βt ) T with n S adj = S β (Y i,x i,z i )q i,mi ( π i(mi )π i( Mi ))dx i(mi ) f x,β n K k=1 p j=1 R ij pr(r i1 = =R ip = 1 Z i,x i( Mk )) πi(mk ) π i( Mk )h(y i,x i,z i, π i(mk ) π i( Mk )) E a(y i,x i( Mk ),Z i, π i(mk ) π i( Mk )) } S β (Y, X, Z) S β Mk (Y, Z, X ( Mk )) ] Z = Z i,x ( Mk ) = X i( Mk ), where M k,k = 1,...,K, are K distinct patterns of missing data realized in the dataset and S β Mk (Y, Z, X ( Mk )) = S β (Y, X, Z)h(X, Y, Z, π Mk π ( Mk ))f (X Mk X ( Mk ),Z,R 1 = =R p = 1) dx Mk ]/a(y, X (Mk ),Z, π Mk π ( Mk )). Similarly, n S adj = S αl (R il,y i,x i,z i )q i,mi ( π i(mi )π i( Mi ))dx i(mi ) f x,α l n K k=1 p j=1 R ij pr(r i1 = =R ip = 1 Z i,x i( Mk )) πi(mk ) π i( Mk )h(y i,x i,z i, π i(mk ) π i( Mk )) E a(y i,x i( Mk ),Z i, π i(mk ) π i( Mk )) } S αl (R lmk,y,x,z) S αl M k (R lmk,y,z,x ( Mk )) ] Z = Z i,x ( Mk ) = X i( Mk ), for l = 1,...,p, where S αl M k (Y, Z, X ( Mk )) = S αl (R lmk, Y, X, Z)h(X, Y, Z, π Mk π ( Mk ))f (X Mk X ( Mk ),Z,R 1 = =R p = 1) dx Mk ]/a(y, X (Mk ),Z, π Mk π ( Mk )). 5. Simulation Studies 5.1. Simulation Designs We considered two scenarios. For scenario I we have two partially missing covariates and one completely observed covariate. We simulated cohort datasets each of size n = 1000 by

7 Semiparametric Approach for Non-Monotone Missing Covariates 7 simulating Z Bernoulli(0.5), X 1 Bernoulli(0.5), and X 2 Bernoulli(0.5). Then the response variable Y was generated from a Bernoulli distribution with logitpr(y = 1 X 1,X 2,Z)} = Z 0.3X 1 0.4X 2 0.2X 1 X 2. Next we generated the missing values in X 1 and X 2 by randomly simulating binary indicators R 1 and R 2 with logitpr(r 1 = 1 Y, Z, X 2 )}=Y 0.5Z X 2 and logitpr(r 2 = 1 Y, Z, X 1 )}= Y 0.5Z X 1, respectively. This NI-missingness mechanism yielded approximately 24% missing values in X 1 as well as in X 2. We also considered the MAR mechanism to generate missing values in X 1 and X 2 by simulating R 1 and R 2 from the Bernoulli distribution with success probabilities logitpr(r 1 = 1 Y, Z)} =0.4 Y Z and logitpr(r 2 = 1 Y, Z)} =0.4 Y Z, respectively. In addition, the Supplementary Materials contain some results for NI missing data. For scenario II, we considered 9 partially missing covariates and one completely observed covariate. We simulated a cohort dataset of size n = 200 by simulating Z Bernoulli(0.5), X r Bernoulli(0.5) for r = 1, 2, 3, 4, X r Gamma(2, 2) with E(X r ) = 1 for r=5, 6, X 7 Normal(0, 1), X 8 =I(X 1 =0)Normal( 0.25, 1) I(X 1 = 1)Normal(0.25, 1), and X 9 Normal( 0.2X 1 0.2X 2, 1). The response variable Y was taken to be binary with conditional success probability logitpr(y=1 Z, X 1,...,X 9 )}= 2 0.1Z 0.2X 1 0.3X 2 0.4X 3 0.5X 4 0.5X 5 0.5X 6 0.4X 7 0.3X 8 0.2X 9. The NI-data were generated by simulating the missingness indicators with success probabilities logitpr(r r = 1 Y, Z, X ( r) )}= Y 0.25Z X l=1,l r l, r = 1,...,9, which resulted in approximately 10 17% missing values for each of X 1 through X Method of Analysis The simulated datasets were analyzed by four approaches. First, we analyzed the data without any missing values, where the estimates were obtained by maximizing the full data likelihood n f (Y i X i,z i ). This approach was referred to as the full data (FD) method. We then analyzed the data using, where the subjects containing any missing values in X k were discarded. The third approach considered here was the extension of the mean-score method (only for scenario I). For this method the underlying assumption was that the data are MAR. Finally, we analyzed the datasets using the proposed semiparametric approach which we referred to as. In we modeled the missingness mechanism using a linear logistic regression (i.e., the logit of the selection probability is a linear function of the conditioning variables) Results Table 1 contains the bias, empirical and estimated standard errors, 95% coverage probabilities based on the Wald-type confidence intervals, and MSE for the five regression parameters involved in the model f (Y X 1,X 2,Z) for scenario I. The results in Table 1 indicate that when the data are MAR, the mean-score and methods performed equally well. Note that both of the approaches have much smaller variances compared with. When data are NI-, the proposed approach is much better than the mean-score approach in terms of bias. For the NI-data the mean-score method produces biased results. In both the MAR and NI-scenarios, also produced biased results. Moreover, Table 1 shows that the empirical and estimated standard errors are reasonably close for all methods, and the empirical coverage probabilities for are close to the nominal level but not always so for the other methods. Additionally, under some NI missing data (presented in the Supplementary Materials), shows the least bias among all methods. For scenario II (Table 2), we only present the results based on and the proposed approach. We noticed that faced some convergence problems (1 2% of the data). The results for are based on the datasets where estimates converged and indicate that the proposed approach performs much better than in terms of bias and standard errors. While shows bias in the parameter estimates, the coverage probabilities for the analysis are usually close to the nominal level due to large standard errors. The bias in depends on how strongly the missingness mechanism depends on the response variable (results not presented here) Robustness Study for Model Misspecification The proposed approach depends on a parametric model assumption of the missing probabilities. Therefore, in principle, a model misspecification can cause bias in the parameter estimates. In order to assess such possible bias, we conducted the following additional simulation with scenario II where NI-missing data were created by simulating the missingness indicators with success probabilities logitpr(r r = 1 Y, Z, X ( r) )}= Y 0.25Z l r X2 l. The resulting missing percentages for each of the nine variables vary between 9% and 12%. However, in the analysis missingness was modeled as logitpr(r r = 1 Y, Z, X ( r) )}=α 0r α 1r Y α 2r Z 9 α l r (2l),rX l resulting in a model violation. The results (Table 3) indicate the method is hardly affected by this model misspecification. Table 4 contains the results when the actual missingness mechanism was logitpr(r 1 = 1 Y, Z)} = 1 0.5Y 0.5Z, logitpr(r j = 1 Y, Z, R (j 1) )}= Y 0.5Z 0.8R (j 1), j = 2,...,9. This mechanism violates our assumption that the missingness mechanism of each variable is independent conditional on the variables. In this scenario missing percentage for each variable varies between 18% and 21%. In the analysis, missingness was modeled as logitpr(r r = 1 Y, Z, X ( r) )}=α 0r α 1r Y α 2r Z 9 α l r (2l),rX l. The analysis had numerical convergence problems for 68% of the datasets out of 500 replications, and here we present the results based only on the converged datasets. There was no convergence issue for, which continued to outperform. 6. Data Examples 6.1. Analysis of the Los Angeles Endometrial Cancer Data This is a 1:4 matched case-control study of endometrial cancer (Breslow and Day, 1980), where three controls were matched with every case subject based on neighborhood of residence and age in an affluent retirement community in Los Angeles. All subjects in the study were post menopausal women. The dataset contains several risk factors, including presence

8 8 Biometrics Table 1 Results of the simulation study for scenario 1 based on 500 replications Method β 0 β 1 β 2 β 3 β 4 FD Bias EMP.SE EST.SE CP MSE MAR mechanism Bias EMP.SE EST.SE CP MSE Mean-score extension Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE NI- mechanism Bias EMP.SE EST.SE CP MSE Mean-score extension Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10. of gall bladder disease (Z 1 ), presence of hypertension (Z 2 ), presence of obesity (X 1 ), dose of conjugated estrogen used (X 2, denoted as Dose), and duration of conjugated estrogen used (X 3, denoted as Duration). The last three variables had 17%, 3%, and 6% missing values, respectively, while the first two were completely observed. The non-monotone pattern of missing data is given in the top panel of Table 5, and the counts do not show any evidence of dependence between R 1 and R 2, and between R 1 and R 3. Although these counts exhibit strong evidence of dependence between R 2 and R 3, this does not rule out the possibility of independence of these missingness indicators conditional on the covariates. Furthermore, among the subjects with R 2 R 3 = 1, the missingness of obesity is strongly associated with Duration (pvalue= 0.004). Also, among the subjects with R 1 R 2 = 1, the missingness of Duration is strongly associated with Dose

9 Semiparametric Approach for Non-Monotone Missing Covariates 9 Table 2 Results of the simulation study based on 500 replications for scenario 2 with NI-mechanism Method β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 FD Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10. (p-value= 0.021). These facts motivated us to assume the NImechanism and estimate the parameters using the proposed method. Define Y = 1 if a subject has endometrial cancer and Y = 0 otherwise. We fit a logistic regression model logitpr(y = 1 Z 1,Z 2,X 1,X 2,X 3 )}=β 0 β 1 Z 1 β 2 Z 2 β 3 X 1 β 4 X 2 β 5 X 3 as if the data were unmatched and cohort. We estimated the model parameters using and. For the missingness model in we considered a logistic regression model, that is, logitpr(r r = 1 X ( r),z,y)} =α r0 α r1 Y 2 α j=1 r(1j)z j 3 α j=1,j r r(3j)x j. The estimates, standard errors, and p values for β from both methods are presented Table 3 Results of the simulation study based on 500 replications for scenario 2 with NI-mechanism with logitpr(r r = 1 Y, Z, X ( r) )}= Y 0.25Z l r X2 l Method β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 FD Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE For the method the missingness mechanism is incorrectly modeled. Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10.

10 10 Biometrics Table 4 Results of the simulation study based on 500 replications for scenario 2 where the missingness mechanism is logitpr(r 1 = 1 Y, Z)} =1 0.5Y 0.5Z, logitpr(r j = 1 Y, Z, R (j 1) )}= Y 0.5Z 0.8R (j 1), j = 2,...,9 Method β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 β 10 FD Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE Bias EMP.SE EST.SE CP MSE For the method the missingness mechanism is incorrectly modeled. Here FD,,, EMP.SE, EST.SE, and CP stand for the full data analysis, complete case, the proposed semiparametric method, empirical standard error, estimated standard error, and the 95% coverage probability, respectively. All entries are multiplied by 10. in Table 5. has lower standard errors for all the coefficient estimates. Except for X 1 and X 3, the parameter estimates for both the methods are somewhat similar. Further investigation revealed that the missingness mechanism of X 1 strongly depends on X 3 (p-value = 0.005) and the missingness mechanism of X 3 strongly depends on X 1 (p-value = 0.034). Based on the results of, the presence of gall bladder disease and duration of conjugated estrogen used are statistically significant (at the 5% level) as risk factors for endometrial cancer. Table 5 The top panel shows the pattern of missing data in the Los Angeles Endometrial Cancer data with 1 and 0 representing the observed and missing variable, respectively, and the bottom panel contains the results of the data analysis Dose of conjugated Duration of conjugated Pattern Obesity estrogen estrogen used Count Method Gall bladder Hypertension Obesity Dose log (duration) Estimate SE p-value Estimate SE p-value Here and stand for the complete case and the proposed semiparametric method, respectively.

11 Semiparametric Approach for Non-Monotone Missing Covariates 11 Table 6 Results of the hip fracture data analysis Method Estimate SE p-value Estimate SE p-value EtOH < <0.001 Smoke Dementia < <0.001 Antiseiz <0.001 LevoT AntiChol Albumin BMI log(hgb) Here and stand for the complete case and the proposed semiparametric method, respectively Analysis of the Hip Fracture Data The second data example is a 1:1 age and race matched casecontrol study of possible risk factors of hip fracture among male veterans with 218 strata (Chen, 2004). The study was conducted at the University of Illinois at Chicago College of Medicine (Barengolts et al., 2001). Many potential risk factors (25 altogether) in addition to age and race were recorded. Preliminary exploratory analysis suggested that nine of these risk factors are potentially important. Among them EtOH(X 1 ), Smoke (X 2 ), Dementia (X 3 ), Antiseizure (X 4 ), LevoT4 (X 5 ), and anti-cholesterol (X 6 ) are dichotomous variables, whereas the remaining three Albumin (X 7 ), BMI (X 8 ), and log(hgb) (X 9 ) are continuous in nature. All of these predictors contain some missing values, with the percentage of missing values being approximately 10%, 12%, 3%, 5%, 9%, 10%, 24%, 10%, and 12%, respectively. Although there could be at most = 511 missing patterns, the observed number of patterns (non-monotone) is 37, and these patterns are presented in Table 2 of Chen (2004). The presence (Y = 1) or absence (Y = 0) of hip fracture was considered as the binary outcome. We aimed to find the estimate of the logodds ratio parameters by fitting a logistic regression model to the data: logitpr(y = 1 X)} =β 0 9 r=1 X rβ r. Note here that β 1,...,β 9 are the main parameters of interest. Following Chen (2004), in the model we ignored age and race as these variable were used as matching variables. First we carried out the analysis and then we analyzed the data using. In the latter, we assumed that the missingness mechanisms follow a logistic model: logitpr(r r = 1 X ( r),y)} =α r0 α r1 Y 9 j=1,j r α r(j1)x j for r = 1,...,9. We applied the dimension reduction technique for calculating the conditional distributions based on the completely observed data. In Table 6 we present the estimates, standard errors, and p values for β-parameters. The estimates differ somewhat between the methods. However, the standard errors of are substantially smaller than that of. The results indicate that all the covariates have statistically significant associations (at the 5% level) with the occurrence of hip fracture. Our results are consistent with the results of Chen (2004) who analyzed this data under the MAR assumption. 7. Discussion This article deals with non-monotone patterns of missing covariate data in a parametric regression model. Here we address two important issues simultaneously: extending the missingness mechanism beyond MAR and handling possible association among the partially missing covariates. The method relies on a parametric model for the missingness mechanism, and possible associations among the covariates are handled empirically without using any parametric structure. The NI-mechanism may be used when the MAR assumption is likely to be violated (Potthof et al., 2006) or in the scenario when the NI missing assumption is plausible. If we assume that the missingness mechanism is completely non-ignorable, then we will require a parametric model for the partially missing covariate. This parametric modeling of the missing covariate distribution is avoided by imposing the NI-mechanism assumption. The simulation results indicate that the proposed approach works remarkably well in terms of bias and standard error. Furthermore, it appears robust when some of the underlying assumptions about the missingness mechanism are violated. We have also derived the asymptotic normality of the proposed estimator and an asymptotically consistent formula to compute the standard errors. Our proposed method is an estimated-score approach to estimate the parameters of the model. Alternatively, a maximum likelihood estimator (MLE) or a nonparametric maximum likelihood estimator (NPMLE) is also possible and likely to be more efficient. However, in both cases of MLE and NPMLE the computation could be challenging for a large dimension of the parameter vector as all parameters (β, α 1,...,α p, and any additional parameters) need to be estimated simultaneously in each step of the EM algorithm. In contrast, in each iteration of our EM algorithm, β, α 1,...,α p are estimated separately. Thus, a further theoretical and numerical investigation of the MLE-type estimator would be interesting. Finally, in principle the proposed idea can be generalized to handle both missing covariates and the missing response variable, especially in the context of longitudinal data analyses.

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

Analysis of Matched Case Control Data in Presence of Nonignorable Missing Exposure

Analysis of Matched Case Control Data in Presence of Nonignorable Missing Exposure Biometrics DOI: 101111/j1541-0420200700828x Analysis of Matched Case Control Data in Presence of Nonignorable Missing Exposure Samiran Sinha 1, and Tapabrata Maiti 2, 1 Department of Statistics, Texas

More information

Missing Covariate Data in Matched Case-Control Studies

Missing Covariate Data in Matched Case-Control Studies Missing Covariate Data in Matched Case-Control Studies Department of Statistics North Carolina State University Paul Rathouz Dept. of Health Studies U. of Chicago prathouz@health.bsd.uchicago.edu with

More information

7 Sensitivity Analysis

7 Sensitivity Analysis 7 Sensitivity Analysis A recurrent theme underlying methodology for analysis in the presence of missing data is the need to make assumptions that cannot be verified based on the observed data. If the assumption

More information

1 Introduction A common problem in categorical data analysis is to determine the effect of explanatory variables V on a binary outcome D of interest.

1 Introduction A common problem in categorical data analysis is to determine the effect of explanatory variables V on a binary outcome D of interest. Conditional and Unconditional Categorical Regression Models with Missing Covariates Glen A. Satten and Raymond J. Carroll Λ December 4, 1999 Abstract We consider methods for analyzing categorical regression

More information

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs STAT 5500/6500 Conditional Logistic Regression for Matched Pairs Motivating Example: The data we will be using comes from a subset of data taken from the Los Angeles Study of the Endometrial Cancer Data

More information

Columbia University. Columbia University Biostatistics Technical Report Series. Handling Missing Data by Deleting Completely Observed Records

Columbia University. Columbia University Biostatistics Technical Report Series. Handling Missing Data by Deleting Completely Observed Records Columbia University Columbia University Biostatistics Technical Report Series Year 2006 Paper 13 Handling Missing Data by Deleting Completely Observed Records Cuiling Wang Myunghee Cho Paik Columbia University,

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models

Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models Hui Xie Assistant Professor Division of Epidemiology & Biostatistics UIC This is a joint work with Drs. Hua Yun

More information

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Modification and Improvement of Empirical Likelihood for Missing Response Problem UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu

More information

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs Michael J. Daniels and Chenguang Wang Jan. 18, 2009 First, we would like to thank Joe and Geert for a carefully

More information

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Alessandra Mattei Dipartimento di Statistica G. Parenti Università

More information

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A. Linero and M. Daniels UF, UT-Austin SRC 2014, Galveston, TX 1 Background 2 Working model

More information

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation Biometrika Advance Access published October 24, 202 Biometrika (202), pp. 8 C 202 Biometrika rust Printed in Great Britain doi: 0.093/biomet/ass056 Nuisance parameter elimination for proportional likelihood

More information

TECHNICAL REPORT Fixed effects models for longitudinal binary data with drop-outs missing at random

TECHNICAL REPORT Fixed effects models for longitudinal binary data with drop-outs missing at random TECHNICAL REPORT Fixed effects models for longitudinal binary data with drop-outs missing at random Paul J. Rathouz University of Chicago Abstract. We consider the problem of attrition under a logistic

More information

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs STAT 5500/6500 Conditional Logistic Regression for Matched Pairs The data for the tutorial came from support.sas.com, The LOGISTIC Procedure: Conditional Logistic Regression for Matched Pairs Data :: SAS/STAT(R)

More information

Some methods for handling missing values in outcome variables. Roderick J. Little

Some methods for handling missing values in outcome variables. Roderick J. Little Some methods for handling missing values in outcome variables Roderick J. Little Missing data principles Likelihood methods Outline ML, Bayes, Multiple Imputation (MI) Robust MAR methods Predictive mean

More information

Topics and Papers for Spring 14 RIT

Topics and Papers for Spring 14 RIT Eric Slud Feb. 3, 204 Topics and Papers for Spring 4 RIT The general topic of the RIT is inference for parameters of interest, such as population means or nonlinearregression coefficients, in the presence

More information

LOCAL LINEAR REGRESSION FOR GENERALIZED LINEAR MODELS WITH MISSING DATA

LOCAL LINEAR REGRESSION FOR GENERALIZED LINEAR MODELS WITH MISSING DATA The Annals of Statistics 1998, Vol. 26, No. 3, 1028 1050 LOCAL LINEAR REGRESSION FOR GENERALIZED LINEAR MODELS WITH MISSING DATA By C. Y. Wang, 1 Suojin Wang, 2 Roberto G. Gutierrez and R. J. Carroll 3

More information

A Sampling of IMPACT Research:

A Sampling of IMPACT Research: A Sampling of IMPACT Research: Methods for Analysis with Dropout and Identifying Optimal Treatment Regimes Marie Davidian Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score Causal Inference with General Treatment Regimes: Generalizing the Propensity Score David van Dyk Department of Statistics, University of California, Irvine vandyk@stat.harvard.edu Joint work with Kosuke

More information

Chapter 4. Parametric Approach. 4.1 Introduction

Chapter 4. Parametric Approach. 4.1 Introduction Chapter 4 Parametric Approach 4.1 Introduction The missing data problem is already a classical problem that has not been yet solved satisfactorily. This problem includes those situations where the dependent

More information

5 Methods Based on Inverse Probability Weighting Under MAR

5 Methods Based on Inverse Probability Weighting Under MAR 5 Methods Based on Inverse Probability Weighting Under MAR The likelihood-based and multiple imputation methods we considered for inference under MAR in Chapters 3 and 4 are based, either directly or indirectly,

More information

Covariate Balancing Propensity Score for General Treatment Regimes

Covariate Balancing Propensity Score for General Treatment Regimes Covariate Balancing Propensity Score for General Treatment Regimes Kosuke Imai Princeton University October 14, 2014 Talk at the Department of Psychiatry, Columbia University Joint work with Christian

More information

DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA

DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA Statistica Sinica 22 (2012), 149-172 doi:http://dx.doi.org/10.5705/ss.2010.069 DOUBLY ROBUST NONPARAMETRIC MULTIPLE IMPUTATION FOR IGNORABLE MISSING DATA Qi Long, Chiu-Hsieh Hsu and Yisheng Li Emory University,

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

Web-based Supplementary Materials for A Robust Method for Estimating. Optimal Treatment Regimes

Web-based Supplementary Materials for A Robust Method for Estimating. Optimal Treatment Regimes Biometrics 000, 000 000 DOI: 000 000 0000 Web-based Supplementary Materials for A Robust Method for Estimating Optimal Treatment Regimes Baqun Zhang, Anastasios A. Tsiatis, Eric B. Laber, and Marie Davidian

More information

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University Statistica Sinica 27 (2017), 000-000 doi:https://doi.org/10.5705/ss.202016.0155 DISCUSSION: DISSECTING MULTIPLE IMPUTATION FROM A MULTI-PHASE INFERENCE PERSPECTIVE: WHAT HAPPENS WHEN GOD S, IMPUTER S AND

More information

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW SSC Annual Meeting, June 2015 Proceedings of the Survey Methods Section ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW Xichen She and Changbao Wu 1 ABSTRACT Ordinal responses are frequently involved

More information

Longitudinal analysis of ordinal data

Longitudinal analysis of ordinal data Longitudinal analysis of ordinal data A report on the external research project with ULg Anne-Françoise Donneau, Murielle Mauer June 30 th 2009 Generalized Estimating Equations (Liang and Zeger, 1986)

More information

A weighted simulation-based estimator for incomplete longitudinal data models

A weighted simulation-based estimator for incomplete longitudinal data models To appear in Statistics and Probability Letters, 113 (2016), 16-22. doi 10.1016/j.spl.2016.02.004 A weighted simulation-based estimator for incomplete longitudinal data models Daniel H. Li 1 and Liqun

More information

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design 1 / 32 Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design Changbao Wu Department of Statistics and Actuarial Science University of Waterloo (Joint work with Min Chen and Mary

More information

6. Fractional Imputation in Survey Sampling

6. Fractional Imputation in Survey Sampling 6. Fractional Imputation in Survey Sampling 1 Introduction Consider a finite population of N units identified by a set of indices U = {1, 2,, N} with N known. Associated with each unit i in the population

More information

A simulation study for comparing testing statistics in response-adaptive randomization

A simulation study for comparing testing statistics in response-adaptive randomization RESEARCH ARTICLE Open Access A simulation study for comparing testing statistics in response-adaptive randomization Xuemin Gu 1, J Jack Lee 2* Abstract Background: Response-adaptive randomizations are

More information

Introduction An approximated EM algorithm Simulation studies Discussion

Introduction An approximated EM algorithm Simulation studies Discussion 1 / 33 An Approximated Expectation-Maximization Algorithm for Analysis of Data with Missing Values Gong Tang Department of Biostatistics, GSPH University of Pittsburgh NISS Workshop on Nonignorable Nonresponse

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Discussion of Identifiability and Estimation of Causal Effects in Randomized. Trials with Noncompliance and Completely Non-ignorable Missing Data

Discussion of Identifiability and Estimation of Causal Effects in Randomized. Trials with Noncompliance and Completely Non-ignorable Missing Data Biometrics 000, 000 000 DOI: 000 000 0000 Discussion of Identifiability and Estimation of Causal Effects in Randomized Trials with Noncompliance and Completely Non-ignorable Missing Data Dylan S. Small

More information

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM

IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS IPW and MSM IP WEIGHTING AND MARGINAL STRUCTURAL MODELS (CHAPTER 12) BIOS 776 1 12 IPW and MSM IP weighting and marginal structural models ( 12) Outline 12.1 The causal question 12.2 Estimating IP weights via modeling

More information

Chapter 11. Regression with a Binary Dependent Variable

Chapter 11. Regression with a Binary Dependent Variable Chapter 11 Regression with a Binary Dependent Variable 2 Regression with a Binary Dependent Variable (SW Chapter 11) So far the dependent variable (Y) has been continuous: district-wide average test score

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s Chapter 866 Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s Introduction Logistic regression expresses the relationship between a binary response variable and one or

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation

On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation Structures Authors: M. Salomé Cabral CEAUL and Departamento de Estatística e Investigação Operacional,

More information

Statistical Methods for Alzheimer s Disease Studies

Statistical Methods for Alzheimer s Disease Studies Statistical Methods for Alzheimer s Disease Studies Rebecca A. Betensky, Ph.D. Department of Biostatistics, Harvard T.H. Chan School of Public Health July 19, 2016 1/37 OUTLINE 1 Statistical collaborations

More information

Lecture 5 Models and methods for recurrent event data

Lecture 5 Models and methods for recurrent event data Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.

More information

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

Missing covariate data in matched case-control studies: Do the usual paradigms apply? Missing covariate data in matched case-control studies: Do the usual paradigms apply? Bryan Langholz USC Department of Preventive Medicine Joint work with Mulugeta Gebregziabher Larry Goldstein Mark Huberman

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level. Information From External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level. Information From External Big Data Sources Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 Constrained Maximum Likelihood Estimation for Model Calibration

More information

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i, A Course in Applied Econometrics Lecture 18: Missing Data Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. When Can Missing Data be Ignored? 2. Inverse Probability Weighting 3. Imputation 4. Heckman-Type

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Modelling Association Among Bivariate Exposures In Matched Case-Control Studies

Modelling Association Among Bivariate Exposures In Matched Case-Control Studies Sankhyā : The Indian Journal of Statistics Special Issue on Statistics in Biology and Health Sciences 2007, Volume 69, Part 3, pp. 379-404 c 2007, Indian Statistical Institute Modelling Association Among

More information

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints Noname manuscript No. (will be inserted by the editor) A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints Mai Zhou Yifan Yang Received: date / Accepted: date Abstract In this note

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Biomarkers for Disease Progression in Rheumatology: A Review and Empirical Study of Two-Phase Designs

Biomarkers for Disease Progression in Rheumatology: A Review and Empirical Study of Two-Phase Designs Department of Statistics and Actuarial Science Mathematics 3, 200 University Avenue West, Waterloo, Ontario, Canada, N2L 3G1 519-888-4567, ext. 33550 Fax: 519-746-1875 math.uwaterloo.ca/statistics-and-actuarial-science

More information

Unbiased estimation of exposure odds ratios in complete records logistic regression

Unbiased estimation of exposure odds ratios in complete records logistic regression Unbiased estimation of exposure odds ratios in complete records logistic regression Jonathan Bartlett London School of Hygiene and Tropical Medicine www.missingdata.org.uk Centre for Statistical Methodology

More information

6 Pattern Mixture Models

6 Pattern Mixture Models 6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data

More information

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES Statistica Sinica 19 (2009), 71-81 SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES Song Xi Chen 1,2 and Chiu Min Wong 3 1 Iowa State University, 2 Peking University and

More information

The Accelerated Failure Time Model Under Biased. Sampling

The Accelerated Failure Time Model Under Biased. Sampling The Accelerated Failure Time Model Under Biased Sampling Micha Mandel and Ya akov Ritov Department of Statistics, The Hebrew University of Jerusalem, Israel July 13, 2009 Abstract Chen (2009, Biometrics)

More information

Survival Analysis Math 434 Fall 2011

Survival Analysis Math 434 Fall 2011 Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup

More information

What s New in Econometrics. Lecture 1

What s New in Econometrics. Lecture 1 What s New in Econometrics Lecture 1 Estimation of Average Treatment Effects Under Unconfoundedness Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Potential Outcomes 3. Estimands and

More information

Likelihood-based inference with missing data under missing-at-random

Likelihood-based inference with missing data under missing-at-random Likelihood-based inference with missing data under missing-at-random Jae-kwang Kim Joint work with Shu Yang Department of Statistics, Iowa State University May 4, 014 Outline 1. Introduction. Parametric

More information

Combining multiple observational data sources to estimate causal eects

Combining multiple observational data sources to estimate causal eects Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,

More information

Principal Component Analysis Applied to Polytomous Quadratic Logistic

Principal Component Analysis Applied to Polytomous Quadratic Logistic Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS024) p.4410 Principal Component Analysis Applied to Polytomous Quadratic Logistic Regression Andruski-Guimarães,

More information

Sanjay Chaudhuri Department of Statistics and Applied Probability, National University of Singapore

Sanjay Chaudhuri Department of Statistics and Applied Probability, National University of Singapore AN EMPIRICAL LIKELIHOOD BASED ESTIMATOR FOR RESPONDENT DRIVEN SAMPLED DATA Sanjay Chaudhuri Department of Statistics and Applied Probability, National University of Singapore Mark Handcock, Department

More information

Accounting for Complex Sample Designs via Mixture Models

Accounting for Complex Sample Designs via Mixture Models Accounting for Complex Sample Designs via Finite Normal Mixture Models 1 1 University of Michigan School of Public Health August 2009 Talk Outline 1 2 Accommodating Sampling Weights in Mixture Models 3

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London Bayesian methods for missing data: part 1 Key Concepts Nicky Best and Alexina Mason Imperial College London BAYES 2013, May 21-23, Erasmus University Rotterdam Missing Data: Part 1 BAYES2013 1 / 68 Outline

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

Harvard University. Harvard University Biostatistics Working Paper Series

Harvard University. Harvard University Biostatistics Working Paper Series Harvard University Harvard University Biostatistics Working Paper Series Year 2015 Paper 197 On Varieties of Doubly Robust Estimators Under Missing Not at Random With an Ancillary Variable Wang Miao Eric

More information

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses Outline Marginal model Examples of marginal model GEE1 Augmented GEE GEE1.5 GEE2 Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association

More information

Robustness to Parametric Assumptions in Missing Data Models

Robustness to Parametric Assumptions in Missing Data Models Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice

More information

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Deductive Derivation and Computerization of Semiparametric Efficient Estimation Deductive Derivation and Computerization of Semiparametric Efficient Estimation Constantine Frangakis, Tianchen Qian, Zhenke Wu, and Ivan Diaz Department of Biostatistics Johns Hopkins Bloomberg School

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

Missing Data: Theory & Methods. Introduction

Missing Data: Theory & Methods. Introduction STAT 992/BMI 826 University of Wisconsin-Madison Missing Data: Theory & Methods Chapter 1 Introduction Lu Mao lmao@biostat.wisc.edu 1-1 Introduction Statistical analysis with missing data is a rich and

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Causal Inference Basics

Causal Inference Basics Causal Inference Basics Sam Lendle October 09, 2013 Observed data, question, counterfactuals Observed data: n i.i.d copies of baseline covariates W, treatment A {0, 1}, and outcome Y. O i = (W i, A i,

More information

Optimal Designs for 2 k Experiments with Binary Response

Optimal Designs for 2 k Experiments with Binary Response 1 / 57 Optimal Designs for 2 k Experiments with Binary Response Dibyen Majumdar Mathematics, Statistics, and Computer Science College of Liberal Arts and Sciences University of Illinois at Chicago Joint

More information

Cox s proportional hazards model and Cox s partial likelihood

Cox s proportional hazards model and Cox s partial likelihood Cox s proportional hazards model and Cox s partial likelihood Rasmus Waagepetersen October 12, 2018 1 / 27 Non-parametric vs. parametric Suppose we want to estimate unknown function, e.g. survival function.

More information

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk Causal Inference in Observational Studies with Non-Binary reatments Statistics Section, Imperial College London Joint work with Shandong Zhao and Kosuke Imai Cass Business School, October 2013 Outline

More information

Latent Variable Model for Weight Gain Prevention Data with Informative Intermittent Missingness

Latent Variable Model for Weight Gain Prevention Data with Informative Intermittent Missingness Journal of Modern Applied Statistical Methods Volume 15 Issue 2 Article 36 11-1-2016 Latent Variable Model for Weight Gain Prevention Data with Informative Intermittent Missingness Li Qin Yale University,

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Semiparametric Regression

Semiparametric Regression Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s Chapter 867 Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s Introduction Logistic regression expresses the relationship between a binary response variable

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation Dimitris Rizopoulos Department of Biostatistics, Erasmus University Medical Center, the Netherlands d.rizopoulos@erasmusmc.nl

More information

T E C H N I C A L R E P O R T KERNEL WEIGHTED INFLUENCE MEASURES. HENS, N., AERTS, M., MOLENBERGHS, G., THIJS, H. and G. VERBEKE

T E C H N I C A L R E P O R T KERNEL WEIGHTED INFLUENCE MEASURES. HENS, N., AERTS, M., MOLENBERGHS, G., THIJS, H. and G. VERBEKE T E C H N I C A L R E P O R T 0465 KERNEL WEIGHTED INFLUENCE MEASURES HENS, N., AERTS, M., MOLENBERGHS, G., THIJS, H. and G. VERBEKE * I A P S T A T I S T I C S N E T W O R K INTERUNIVERSITY ATTRACTION

More information

Can a Pseudo Panel be a Substitute for a Genuine Panel?

Can a Pseudo Panel be a Substitute for a Genuine Panel? Can a Pseudo Panel be a Substitute for a Genuine Panel? Min Hee Seo Washington University in St. Louis minheeseo@wustl.edu February 16th 1 / 20 Outline Motivation: gauging mechanism of changes Introduce

More information

Propensity Score Analysis with Hierarchical Data

Propensity Score Analysis with Hierarchical Data Propensity Score Analysis with Hierarchical Data Fan Li Alan Zaslavsky Mary Beth Landrum Department of Health Care Policy Harvard Medical School May 19, 2008 Introduction Population-based observational

More information

Semiparametric Mixed Effects Models with Flexible Random Effects Distribution

Semiparametric Mixed Effects Models with Flexible Random Effects Distribution Semiparametric Mixed Effects Models with Flexible Random Effects Distribution Marie Davidian North Carolina State University davidian@stat.ncsu.edu www.stat.ncsu.edu/ davidian Joint work with A. Tsiatis,

More information

Confidence intervals for the variance component of random-effects linear models

Confidence intervals for the variance component of random-effects linear models The Stata Journal (2004) 4, Number 4, pp. 429 435 Confidence intervals for the variance component of random-effects linear models Matteo Bottai Arnold School of Public Health University of South Carolina

More information

Recent Advances in the analysis of missing data with non-ignorable missingness

Recent Advances in the analysis of missing data with non-ignorable missingness Recent Advances in the analysis of missing data with non-ignorable missingness Jae-Kwang Kim Department of Statistics, Iowa State University July 4th, 2014 1 Introduction 2 Full likelihood-based ML estimation

More information

GENERALIZED SCORE TESTS FOR MISSING COVARIATE DATA. A Dissertation LEI JIN

GENERALIZED SCORE TESTS FOR MISSING COVARIATE DATA. A Dissertation LEI JIN GENERALIZED SCORE TESTS FOR MISSING COVARIATE DATA A Dissertation by LEI JIN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree

More information

Plausible Values for Latent Variables Using Mplus

Plausible Values for Latent Variables Using Mplus Plausible Values for Latent Variables Using Mplus Tihomir Asparouhov and Bengt Muthén August 21, 2010 1 1 Introduction Plausible values are imputed values for latent variables. All latent variables can

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Data Integration for Big Data Analysis for finite population inference

Data Integration for Big Data Analysis for finite population inference for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation

More information