Biostat 2065 Analysis of Incomplete Data

Size: px

Start display at page:

Download "Biostat 2065 Analysis of Incomplete Data"

Vincent Goodwin
5 years ago
Views:

1 Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh September 13 & 15, 2005

2 1. Complete-case analysis (I) Complete-case analysis refers to analysis based on the cases with all variables observed. Advantages: (a) Simplicity; (b) Comparability. Drawbacks: (a) Bias; (b) Loss of precision. Suppose θ is a scalar parameter of interest. Denote θ CC, θ NM and θ EFF the completecase estimator, estimator based on the hypothetical complete data and the efficient estimator on the observed data, respectively. From V ar( θ CC ) = V ar( θ NM )(1 + CC ), V ar( θ CC ) = V ar( θ EFF )(1 + CC ), CC and CC can be used to evaluate the relative efficiency of θ CC. Example 3.1. Bivariate normal monotone data {x i, y i } n i=1, where y is missing for i = r + 1,..., n and data are MCAR. Then 1. For the mean of X, CC = CC = n r r ; 2. For the mean of Y, CC = n r r, but CC (n r)ρ2 n(1 ρ 2 )+rρ 2, where ρ is the correlation coefficient between X and Y. 3. For the regression coefficient of Y on X, CC = n r r and CC = 0. Biostat 2065: Analysis of Incomplete Data, 2005

3 2. Complete-case analysis (II) Example 3.2. In general, bias on mean estimator: µ CC µ = (1 π CC )(µ CC µ), where π CC = pr(m = 0). Example 3.3. Estimation of regression parameters. When the probability of being a complete case only depends on X, whether observed or missing values of X, then the complete cases are a random sample for regression (or conditional distribution) of Y on X, noted as [Y X]. Then the the complete-case estimator is consistent, though may not be efficient. Example 3.4. Inference on odds ratio. Suppose count data (X, Y ) form a 2 2 contingency table, the complete-case analysis estimator is unbiased if the logarithm of the probability of response is an additive function of X and Y.

4 3. Survey sampling & inference Example 3.5. Randomization inference in surveys with complete response. Suppose in a size-n population, the variable of interest is Y (survey variable) and the variables for design are Z (design variables). For each subject i, denote π i = pr(i i = 1 Y i, Z i ). Sampling mechanisms: 1. Simple random sampling. π i = c, where c is a constant. 2. Stratified or unconfounded random sampling. π i = pr(i i = 1 Z i ). Let T denote a population quantity of interest. For example, T is the population mean of Y, then an estimator is stratified mean: t = ȳ st = 1 J N j ȳ j, N where ȳ j is the sample mean in stratum J. The estimator ȳ st has variance V ar(ȳ st ) = 1 N 2 J Nj 2 ( 1 1 )S n j N yj, 2 j where S 2 yj is the population variance of Y in stratum j. In variance estimation, S 2 yj is replaced by the corresponding sample variance, i.e., s 2 yj.

5 4. Weighted estimates Horvitz-Thompson estimator Suppose in the sample, unit i has probability π i being selected from the population, i = 1, 2,..., n. Then the H-T estimator of the population mean is y HT = 1 N n i=1 π 1 i y i. The stratified mean estimator can be written as ȳ st = ȳ w = 1 n w i y i, n i=1 where the weights, w i = nπ 1 i n k=1 π 1 k, are scaled to sum to the sample size n.

6 5. Weighted estimates for data with nonresponse Suppose some of the sampled units do not respond, then for a unit i the probability of response is pr(selection and response) = pr(selection) pr(response selection) = π i φ i. Then a weighted estimator for the population mean is ȳ w = 1 r w i y i, r (π i φ i ) 1 r k=1 (π kφ k. ) 1 i=1 where w i = r Note: in practice, π i is determined by the sampling mechanism; φ i is usually unknown and needs to be estimated. Example 3.6. In stratified random sampling, suppose within each stratum (or weighting class) j the number of respondents is r j and the probability of response is uniform. Then a natural estimate of φ i is φ i = r j /n j, for unit i in stratum j. If the sampling weights are uniform or constant, then the estimator can be written as: J ȳ wc = n 1 n j ȳ jr, where ȳ jr is the mean value of the respondents in stratum j.

7 6. Propensity weighting Let X denote the set of variables observed for all subjects; variable Y is only partially observed and we are interested in certain characteristics of Y, for example, the mean of Y. When the number of X is limited, the weighting class estimator can be applied if data are MAR: pr(m X, Y ) = pr(m X). The weighting class is X and the reponse probability (or propensity) as p(x i ) = pr(m i = 0 x i ), for unit i. From pr{m i = 0 y i, p(x i )} = E[pr{m i = 0 y i, x i } y i, p(x i )] = E[pr{m i = 0 x i } y i, p(x i )] (by MAR) = E{p(x i ) y i, p(x i )} = p(x i ), in fact the weighting class can be defined by levels of p(x) and within each level the respondents are a random subsample.

8 In practice, p(x) is unknown, 1. we may estimate p(x) by logistic regression of M on X. 2. Then coarsen the estimates p(x) into five or six values (or intervals). 3. Regard the above classification as weighting class and apply the weighting class estimate with assuming that units in each class has the same value of reponse propensity; or simply weight each respondent i by the p(x i ) Sensitive to model specification. Biostat 2065: Analysis of Incomplete Data, 2005

9 7. Weighted Generalized Estimating Equations Let Y = (Y 1,..., Y K ) be a vector of variables and the observed data of Y, {y i } n i=1, are subject to missingness. Assume the covariates x i = (x i1,..., x ip ) are fully observed for all subjects, i = 1,..., n. The regression of Y on X has mean structure g(x, β) and β is the set of paramters of interest, with size d. 1. When there are no missing values, the inference on β can be done by sloving the following GEE: n D i (x i, β){y i g(x i, β)} = 0, i=1 where D i (x i, β) is a d K matrix and chosen according to the format or distribution of y i 2. When there are missing values, for example, the first r cases are complete cases. The complete-case estimation is based on the following GEE: r D i (x i, β){y i g(x i, β)} = 0, i=1

10 3. When there are missing values and the missing-data mechanism is determined by pr(m i = 0 x i, y i, z i ) = pr(m i = 0 x i, z i ; α) = p(x i, z i ; α), where z i is a set of fully observed auxiliary variables. Then the inference on θ can be done by solving the following weighted GEE: r p(x i, z i ; α) 1 D i (x i, β){y i g(x i, β)} = 0, i=1 where α is usually estimated by regression of M on X and Z. (a) The weighted GEE can correct bias of unweighted GEE when the mechanism depends on z. (b) Estimating α leads to more efficient estimates than simply using the true value of α. (c) This method can be extended to more general missing-data mechanisms with care.

11 8. Post-stratification and raking to known margins Example 3.9. Suppose in the stratified sampling data, the proportion N j /N is known from external sources. The post-stratified mean is ȳ ps = 1 J N j ȳ jr. N Example For a J L contigency table classified by variables X 1 and X 2, in each cell (j, l) suppose Y is recorded for r jl units out of n jl sampled units among a N jl sub-population. The post-stratified and weighting class estimates are: ȳ ps = 1 J L N jl ȳ jlr, ȳ ps = 1 J L n jl ȳ jlr. N n l=1 When only the marginal distribution of N jl is available, a modified estimate can be obtained by a raking method, i.e., finding estimates {N jl } of {N jl} satisfying: N j+ = L Njl = N j+, N+l = l=1 l=1 J Njl = N +l, j = 1,..., J; l = 1,..., L, and keep the same relationship between X 1 and X 2 from the sample: Biostat 2065: Analysis of Incomplete Data,2005 N jl = a j b l n jl.

12 9. Available-case analysis 1. When there are many missing-data patterns and the data are MCAR, the availablecase analysis use all cases where the variable of interest is present and usually more efficient than the complete-case analysis. 2. Some pitfalls. Homework 2 (due Sep. 22, 2005) Problems 2.13, 3.2, 3.3, 3.4, 3.5, 3.7, 3.13.

Data Integration for Big Data Analysis for finite population inference

Data Integration for Big Data Analysis for finite population inference for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation