TWO-WAY CONTINGENCY TABLES WITH MARGINALLY AND CONDITIONALLY IMPUTED NONRESPONDENTS

Size: px

Start display at page:

Download "TWO-WAY CONTINGENCY TABLES WITH MARGINALLY AND CONDITIONALLY IMPUTED NONRESPONDENTS"

Christine Lee
5 years ago
Views:

1 TWO-WAY CONTINGENCY TABLES WITH MARGINALLY AND CONDITIONALLY IMPUTED NONRESPONDENTS By Hansheng Wang A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Statistics) at the UNIVERSITY OF WISCONSIN MADISON 2006

2 i Abstract We consider estimating the cell probabilities and testing hypotheses in a two-way contingency table where two-dimensional categorical data have nonrespondents imputed using either conditional imputation or marginal imputation. Under simple random sampling, we derive asymptotic distributions for cell probability estimators based on the imputed data. Under conditional imputation, we also show that these estimators are more efficient than those obtained by ignoring nonrespondents when the proportion of nonrespondents is large. A Wald type test along with a Rao and Scott type corrected chi-square test for goodnessof-fit are derived. We show that the naive chi-square test for independence, which treats imputed values as observed data, is still asymptotically valid under marginal imputation. Provided we make a simple adjustment of multiplying by an appropriate factor, the naive chi-square test for independence is also valid under conditional imputation. We present simulation results which examine the size and compare the power of these tests. Some of the results are extended to stratified sampling with imputation within each stratum or across strata. Asymptotics are studied under two types of stratified sampling: 1) when the number of strata is fixed with large stratum sizes and 2) when the number of strata is large with small stratum sizes.

3 ii Acknowledgements First, I want to express my deepest gratitude to my PH.D. adviser Prof. Jun Shao. It is him who suggested my thesis topic and led me into the field of sample survey and imputation. For the first time in my life, I found that the world of statistics is so exciting! My curiosities, enthusiasms, and ambitions are always encouraged and appreciated there. It is his endless help and encourage that makes my academic life in Madison so challenging and productive. Prof. Shao also helps me build up my own research style, which emphasize both theories and application. It is also Prof. Shao who introduces me to Dr. Shein-Chung Chow, another respected researcher I am so grateful to. Although Dr. Chow did not give me help on sample survey and imputation, it is him who led me into the field of pharmaceutical statistics, where I believe I am going to build up my own career. The most important thing I learned from Dr. Chow is practical sense, which gives me a unique understanding of what is statistics and what statistics should do. Statistics is neither science nor mathematics. Instead, it is a way of reasoning and a philosophy of understanding, when unexplained variation exists in the data. I believe this understanding will play an important role in guiding my future career and research. Next, I want to thanks all my friends for their help and support. I want to thanks Landon Sego, David Dahl, Emmily Chow, and JoAnne Pinto for their

4 iii careful proofing of my thesis. Without their help, I can not finish my thesis writing in such a short time. I also want to thanks Bing Chen, Quan Hong, and Yuefeng Lu for their help and support when I was defending my thesis in Madison. I also want to thanks my college classmates Xuan Liu and Xiaohuang Hong for their long-time support and encourage since college whenever I encounter difficulty. Furthermore, I want to thanks my PH.D. defense committee, which includes Prof. Richard Johnson, Prof. Kam-Wah Tsui, Prof. Yi Lin, and Prof. Jun Zhu, for their careful reading and constructive comments. Finally, I want to give special thanks to my parents. As the only child of the family, I was given all the love, bless, and wishes they can give. They are my support and motivation whenever I want to give up. During the past three years study in USA, I missed them so much. I wish my PH.D. degree can bring them happiness and proud!!

5 iv Contents Abstract i Acknowledgements ii 1 Introduction Background An Outline Imputation Under Simple Random Sampling Introduction Statistical Model for Nonresponse Marginal and Conditional Imputation Asymptotic Distribution The Case Where A and B Are Independent The Case Where A and B Are Dependent Weighted Mean Squared Error Testing for Goodness-of-Fit Testing for Independence Simulation Study Under Simple Random Sampling Introduction

6 v 3.2 Asymptotic Normality Weighted Mean Squared Error Testing for Goodness-of-Fit Testing for Independence Marginal Imputation Conditional Imputation Relative Efficiency Conclusion Imputation Under Stratified Sampling Introduction Imputation Within Each Stratum Asymptotic Distribution Rao s Test for Goodness-of-Fit Imputation Across Strata with Small H Asymptotic Distribution Rao s Test for Goodness-of-fit Imputation Across Strata with Large H Asymptotic Distribution Asymptotic Covariance and Estimation Simulation Study Under Stratified Sampling Introduction

7 vi 5.2 Imputation Within Each Stratum Wald s Test for Goodness-of-Fit Rao s Test for Goodness-of-Fit Imputation Across Strata with Small H Wald s Test for Goodness-of-Fit Rao s Test for Goodness-of-Fit Imputation Across Strata with Large H Conclusion Real Data Study The Beaver Dam Eye Study Victimization Incidents Study Bibliography 111

8 1 Chapter 1 Introduction 1.1 Background Two-way contingency tables are widely used for summarizing two dimensional categorical data. Each cell in a two-way contingency table is a category defined by two-dimensional categorical variables. Sample cell frequencies are often computed based on the observed responses (of the two-dimensional categorical variables) from a sample of units (subjects). Statistical inferences including estimating cell probabilities, testing the hypothesis of independence, and testing goodness-of-fit are often carried out. In sample surveys or medical studies, it is not uncommon for one or two of the categorical responses to be missing. Sampled units for which both components are missing (unit nonrespondents) can be handled by a suitable adjustment of sampling weights. In practice, however, many sampled units may have exactly one missing component in their responses (item nonrespondents). The approach of ignoring data from sampled units with exactly one missing component is not acceptable, because throwing away the observed data may result in a serious decrease in efficiency of the analysis.

9 2 A popular method to handle item nonresponses is imputation, which inserts values for the unobserved items. Justification for the use of imputation with practical considerations can be found in Kalton and Kasprzyk (1986). After imputation, statistical inferences can then be made by treating the imputed values as the observed data using formulas designed for the case of no nonresponse. Various imputation methods have been proposed and studied by different authors (Little and Rubin, 1987; Schafer, 1997). All the imputation methods can be roughly divided into two categories: model based imputation methods and nonparametric imputation methods. The model based imputation methods assume a parametric or semi-parametric model for the responses and the missingness. The most typical example is regression imputation, which assumes a linear model between the response and the observed covariates. The situation where the random errors in the linear model are normally distributed was studied by Srivastava and Carter (1986). Shao and Wang (2001) extend the results to include the case in which there was no parametric assumption for the random error. The nonparametric imputation methods do not make any parametric assumption on the distribution of responses and missingness. Typical approaches belonging to this category include hot deck imputation, cold deck imputation, and nearest neighborhood imputation (Chen and Shao 2000; Chen and Shao, 2001). However, all the above methods investigate either continuous data or onedimensional categorical data. The imputation methods for multi-dimensional

10 3 categorical data are not well studied. For example, for a two-way contingency table, which is essentially a multi-dimensional categorical data problem, the statistical problems of interest include the following: How dose one impute the data? How does the relative efficiency of imputation compare with other methods, (e.g. re-weighting method)? How can tests be performed in a valid way? Another important problem for imputation is the variance/covariance estimation. It is well known that the variance/covariance of the estimators given by imputation may be different from the variance/covariance of the estimators for complete data sets. As a result, the estimators designed to estimate the variance and covariance of estimators for complete data sets may not be valid for the estimators generated by imputation. There are three commonly used approaches to obtain estimators for the variance and covariance of estimators given by imputation. One is linearization, which uses Taylor s expansion to obtain an explicit theoretical formula for the covariance structure of the estimators and then replace all the unknown quantities by their consistent point estimators. The merit of the linearization method is that it requires less computation as compared with other methods, e.g., resampling methods. However, it is not uncommon that the theoretical formula is too complex to be used. As an alternative, resampling methods such as jackknife and bootstrap (Rao and Shao, 1992) are commonly used to obtain the variance/covariance estimators. The third approach is multiple imputation (Rubin, 1987), which imputes the same

11 4 data set more than once and then obtains the variance estimator by combining the between and within imputation variability in an appropriate way. The main purpose of this thesis is to investigate the statistical properties of a conditional imputation method, which imputes nonrespondents using the estimated conditional probabilities. More specifically, we study (i) the consistency of estimators of cell probabilities based on imputed data; (ii) the asymptotic variances and covariances of estimators of cell probabilities, which lead to consistent variance and covariance estimators; and (iii) the validity of chi-square type tests for goodness of fit or independence. For testing independence of the two components of the categorical variable, we also study a marginal imputation method, which imputes nonrespondents using the estimated marginal probabilities. 1.2 An Outline The rest of this thesis is organized as follows. In Chapter 2, we study both conditional and marginal imputation under simple random sampling. In Chapter 3, extensive simulations are performed to evaluate the finite sample performance of the procedures described in Chapter 2. In Chapter 4, we study conditional imputation for stratified sampling, which includes imputation within stratum and imputation across strata. For the method of imputation across strata, two different types of asymptotics are considered. One deals with a small number of strata with a large stratum size. The other is a large number of strata

12 5 with a small stratum size. Extensive simulations are carried out in Chapter 5 to evaluate the finite sample performance of the procedures obtained in Chapter 4. Finally, several real data sets are presented to illustrate the proposed imputation methods in Chapter 6.

13 6 Chapter 2 Imputation Under Simple Random Sampling 2.1 Introduction In this chapter, we introduce two imputation methods under simple random sampling: marginal imputation and conditional imputation. Our results show that the point estimators obtained by conditional imputation are consistent, but those obtained by marginal imputation are usually not unless the two components of the categorical variables are independent of each other. The asymptotic distributions of point estimators under both imputation methods are derived when appropriate. In order to evaluate the statistical performances of the point estimators, we propose a measure called Weighted Mean Squared Error (WMSE). Then the estimators given by conditional imputation and reweighting methods are compared in terms of WMSE. The results show that conditional imputation can improve efficiency when the proportion of complete units is small. Testing goodness-of-fit is also considered. We first propose a

14 7 Wald type statistic, which is asymptotically valid. Then we show that performing the naive method of treating the imputed value as observed and applying Pearson s chi-square test is not valid. We propose a Rao type correction to the naive method. Finally, the performances of Wald type statistic and Rao type statistic are compared. Results show that their performances are comparable in terms of type I error. Testing independence is also considered. We show that the naive method of applying Pearson s chi-square statistic directly is still asymptotically valid under marginal imputation but not under conditional imputation. By simply fixing a constant, the naive method is also asymptotically correct under conditional imputation Statistical Model for Nonresponse Consider a two-dimensional response vector (A, B), where both A and B are categorical responses taking values from {1,, a} and {1,, b}, respectively. In practice, imputation is carried out by first creating some imputation classes and imputing nonrespondents within each imputation class. Imputation classes are sub-populations of the original population and are usually formed by using an auxiliary variable without nonresponse. For example, in many business surveys, imputation classes are strata or unions of strata. In medical studies, if data are obtained under several different treatments, the treatment groups are imputation classes. Throughout this chapter, we make the following assumption:

15 8 Assumption A. For each sampled unit within an imputation class, π A denotes the probability of observing A and missing B, π B denotes the probability of observing B and missing A, and π C denotes the probability of observing both A and B. As discussed in the previous chapter, the units with both respondents missing (unit nonresponses) can be ignored by suitably adjusting the sampling weights. As a result, we assume there is no unit nonresponse, i.e., π A +π B +π C = 1. Note that the probability π A, π B, and π C may be different in different imputation classes. For simplicity and without loss of generality, we only consider the case of simple random sampling with replacement. In practice, the data may be obtained by sampling without replacement. Our derived results are still valid if the sampling fraction is negligible. For the sake of convenience, we assume that there is only one imputation class, since the extension to multiple imputation classes is straightforward. 2.2 Marginal and Conditional Imputation Suppose there are n sampled units, which are indexed by k (i.e., (A k, B k ), k = 1,, n). To simplify the notation, (A k, B k ) may also be referred to as (A, B). Let C A, C B, and C C be the collection of the indices of the units with B missing, A missing, and neither A nor B missing, respectively. Let n A = C A, n B = C B,

16 9 and n C = C C, where S denotes the number of elements contained in a finite set S. In other words, n A, n B, and n C are the number of the units with B missing, with A missing, and neither A nor B missing, respectively. Therefore, the total sample size is given by n = n A + n B + n C. Let n C ij denote the total number of completers such that (A, B) = (i, j). Let p ij = P ((A, B) = (i, j)), p i = P (A = i), and p j = P (B = j). A typical method for estimation of p i and p j can be obtained by using completers as follows ˆp C i = b j=1 nc ij ij nc ij = nc i n C and ˆp C j = a i=1 nc ij ij nc ij = nc j n C, where n C i = b j=1 nc ij, n C j = a i=1 nc ij, and n C = ij nc ij. Given an incompleter (A, B) = (i, ), where * denotes the missing value, marginal imputation imputes the missing value j (1 j b) with probability ˆp C j. It means that the missing value is imputed according to its estimated marginal distribution without conditioning on the observed item (A = i). Missing values from incompleters are imputed independently. Intuitively, parameters such as p j and p i can be estimated consistently based on marginally imputed data, but parameters such as p ij cannot be estimated consistently, since the relationship between A and B is not preserved during marginal imputation. The conditional probability P (B = j A = i) is denoted by p ij A = p ij /p i.

17 10 Thus, an estimator based on completers for p ij A is given by ˆp C ij A = ˆpC ij ˆp C i = nc ij/n C n = nc ij. C i /nc n C i Conditional imputation imputes the missing value j (1 j b) with probability ˆp C ij A. In other words, given the completers, conditional imputation imputes a missing value according to its estimated conditional distribution given the observed component. Imputation for an incompleter with A missing is similar, and incompleters are imputed independently, conditioning on the completers and their observed components. After imputation, estimators of p ij are obtained using the standard formulas in a two-way contingency table by treating the imputed values as observed data. We denote those estimators (based on either marginal or conditional imputation) by ˆp I ij. Recall that C A is the collection of the indices of the units with A observed and B missing. Let ˆp A ij = 1 n A k C A I{(A k, B k ) = (i, j)}, where B k is the value obtained by imputation (either marginal imputation or conditional imputation) for any k C A. ˆp B ij and ˆp C ij are similarly defined. The relationship between ˆp I ij and ˆp A ij, ˆp B ij, ˆp C ij is given by ˆp I ij = naˆp A ij + n B ˆp B ij + n C ˆp C ij. n

18 11 For the sake of convenience, we define p = (p 11,, p 1b,, p a1,, p ab ) p A = (p 1,, p a ) p B = (p 1,, p b ) ˆp I = (ˆp I 11,, ˆp I 1b,, ˆpI a1,, ˆp I ab ) (2.1) ˆp A = (ˆp A 11,, ˆp A 1b,, ˆpA a1,, ˆp A ab ) ˆp B = (ˆp B 11,, ˆp B 1b,, ˆpB a1,, ˆp B ab ) ˆp C = (ˆp C 11,, ˆp C 1b,, ˆpC a1,, ˆp C ab ). 2.3 Asymptotic Distribution In order to obtain the limiting distribution of ˆp I ij, Lemma 1 is giveere without proof. Lemma 1 is also used when we study stratified sampling in Chapter 4. A more general form of the lemma can be found in Schenker and Welsh (1988). Lemma 1 Let X n be a sample, and U n (X n ), W n be two random vectors, such that U n d N(0, Σ 1 ) and W n X n d N(0, Σ 2 ), then U n + W n d N(0, Σ 1 + Σ 2 ).

19 The Case Where A and B Are Independent When A and B are independent, the estimators produced by both marginal and conditional imputation are consistent. However, their variances and covariances are different from the variances and covariances of the standard estimator of p ij when there is no nonresponse. The following theorem establishes the asymptotic distributions and covariance matrices of ˆp I, which were defined in (2.1), under both conditional imputation and marginal imputation. Theorem 1 Assume that A and B are independent. If π C > 0, then, as n, n(ˆp I p) d N(0, Σ), where (a) under marginal imputation (b) under conditional imputation Σ = ( 1 π C Σ = P A P B + ( π C+2π C π A +π 2 A π C )(P A (p B p B )) +( π C+2π C π B +π 2 B π C )(p A p A ) P B ; + 1 π C )P A P B + ( π C+2π C π A +π 2 A π C )(P A (p B p B )) +( π C+2π C π B +π 2 B π C )(p A p A ) P B. is the Kronecker product; p A and p B are given in (2.1). P A = diag(p A ) p A p A, where diag(p A ) denotes a diagonal matrix with the same dimension as p A and with its i th (1 i a) diagonal element to be the i th component of p A.

20 13 P B = diag(p B ) p B p B. Proof: After imputation, each unit becomes complete. For a given unit, (A k, B k ), we can define I A k to be an a-dimensional vector with the ith element equal to 1 and the others equal to 0 if A k = i. Similarly, define I B k to be a b-dimensional vector with j th element equal to 1 and the others equal to 0 if B k = j. Under the hypothesis that A and B are independent, I A k and IB k are independent. According to (2.1), note that ˆp I = 1 n n t=1 ˆp A = 1 n A I A k t C A I A k ˆp B = 1 n B t C B I A k ˆp C = 1 n C I B k I B k I B k t C C I A k I B k. It follows that n(ˆp I p) = [ ] n A (ˆp A p) + n B (ˆp B p) + (ˆp C p) n n = [ ] n A (E(ˆp A σ(c)) p) + n B (E(ˆp B σ(c)) p) + n C (ˆp C p) n n

21 14 + [ ] n A (ˆp A E(ˆp A σ(c))) + n B (ˆp B E(ˆp B σ(c))) n, n where E( σ(c)) denotes E( n A, n B, n C, (A k, B k ), k C A ). In other words, E( σ(c)) denotes the expectation conditional on the completers and the number of incompleters. Let U n = [ ] n A (E(ˆp A σ(c)) p) + n B (E(ˆp B σ(c)) p) + n C (ˆp C p) n, (2.2) n and W n = [ ] n A (ˆp A E(ˆp A σ(c))) + n B (ˆp B E(ˆp B σ(c))) n. (2.3) n (a) Marginal imputation. Given σ(c) (i.e., n A, n B, n C, and the completers), {Ik A IB k } k C A are i.i.d random vector with mean E(ˆp A σ(c)). According to Central Limit Theorem, na (ˆp A E(ˆp A σ(c))) σ(c) d N(0, Σ W ), and Σ W = diag{e(ˆp A σ(c))} (E(ˆp A σ(c)))(e(ˆp A σ(c))). On the other hand, E(ˆp A ij σ(c)) = p i ˆp C j a.s p i p j = p ij as n C. Therefore, Σ W a.s diag{p} pp. = P, (2.4)

22 15 where =. denotes defined to be. This leads to W n σ(c) d N(0, P ). Consequently, W n σ(c) n A = na (ˆp A E(ˆp A σ(c))) n n B + nb (ˆp B E(ˆp B σ(c))) n = π A na (ˆp A E(ˆp A σ(c))) + π B nb (ˆp B E(ˆp B σ(c))) + o p (1) d πa N(0, P ) + π B N(0, P ) = N(0, (1 π C )P ). Under the assumption that A and B are independent, it follows that p 1 ˆp C 1 p 11 p 1 (ˆp C 1 p 1 ).. p 1 ˆp C b p 1b p 1 (ˆp C b p b) E[ˆp A σ(c)] p =. =. p a ˆp C 1 p a1 p a (ˆp C 1 p 1 ).. p a ˆp C b p ab p a (ˆp C b p b) = p A [ 1 n C = 1 n C k C C (I B k p B )] k C C p A (I B k p B ).

23 16 Similarly, E(ˆp B σ(c)) p = 1 n C (I A k p A ) P B. Thus, we have = = = U n 1 [ n A (E(ˆp A σ(c)) p) + n B (E(ˆp B σ(c)) p) + n C (ˆp C p) ] n n A nc (E(ˆp A σ(c)) p) + nb nc (E(ˆp B σ(c)) p) nn C nn C n C + nc (ˆp C p) n π A nc (E(ˆp A σ(c) p) πc + π B πc nc (E(ˆp B σ(c) p) = + π C nc (ˆp C p) + o p (1) 1 [ πc (I A n C k p A ) (Ik B p B ) + π C + π A p A (Ik B p B ) πc k C C + π C + π B πc (I A k p A ) p B ] + o p (1) d N(0, Σ U ), where ( πc Σ U = var (Ik A p A ) (Ik B p B ) + π C + π A p A (Ik B p B ) πc + π ) C + π B (Ik A p A ) p B. πc

24 17 Let P A = diag{p A } p A p A and P B = diag{p B } p B p B. Then we have var[(ik A p a ) (Ik B p B )] = E{[(Ik A p A ) (Ik B p B )][(Ik A p A ) (Ik B p B )] } = E{[(Ik A p A )(Ik A p A ) ] [(Ik B p B )(Ik B p B ) ]} = E[(Ik A p A )(Ik A p A ) ] E[(Ik B p B )(Ik B p B ) ] = P A P B. The third equatioolds because I A k and IB k are independent. Similarly var[p A (Ik B p B )] = (p A p A ) P B var[(ik A p A ) p B ] = P A (p B p B ) cov[(ik A p A ) (Ik B p B ), p A (Ik B p B )] = E{[(Ik A p A )p A ] [(Ik B p B )(Ik B p B ) ]} = 0 P B = 0 cov[(ik A p A ) (Ik B p B ), (Ik A p A ) p B ] = E{[(Ik A p A )(Ik A p A ) ] [(Ik B p B )p B ] = P A 0 = 0

25 18 cov[p A (Ik B p B ), (Ik A p A ) p B ] = E{[p A (Ik A p A ) ] [(Ik B p B )p B ]} = 0. As a result, Σ U is given by Σ U = π C P A P B + (π C + π A ) 2 π C (p A p A ) P B + (π C + π B ) 2 π C P A (p B p B ). Therefore, U n d N(0, Σ U ). Since W n d N(0, (1 π C )P ) and P = diag{p} pp = diag{p A p B } (p A p A ) (p B p B ) = (diag{p A } p A p A ) (diag{p B } p B p B ) +(p A p A ) (diag{p B } p B p B ) +(diag{p A } (p A p A ) (p B p B ) = P A P B + (p A p A ) P B + P A (p B p B ), we have n(ˆp p) = Wn + U n d N(0, (1 π C )P + Σ U ) = N(0, Σ),

26 19 where Σ = P A P B + ( π C + 2π C π A + π 2 A π C )(P A (p B p B )) +( π C + 2π C π B + π 2 B π C )(p A p A ) P B. (b) Conditional Imputation. Suppose the total sample size is large and P (n C = 0) is negligible. Similarly, under conditional imputation we have W n σ(c) d N(0, (1 π C )P ). On the other hand, by Taylor s expansion, E(ˆp A σ(c)) p A p B p 1 ( ˆpI 11 ˆp 1 p 1 ) ˆp I 11 p 1 p 1 (ˆp I 1 p 1 )p 1... p 1 ( ˆpI 1b ˆp 1 p b ) ˆp I 1b p 1 p b (ˆp I 1 p 1 )p b =. =.. p a ( ˆpI a1 ˆp a p 1 ) ˆp I a1 p a p 1 (ˆp I a p a )p 1... p a ( ˆpI ab ˆp a p b ) ˆp I ab p a p b (ˆp I a p a )p b +o p (1) = 1 [(I n C k A Ik B p A p B ) (Ik A p A ) p B ] + o p (1) k C C = 1 [(I n C k A p A ) (Ik B p B ) + p A (Ik B p B )] + o p (1). k C C

27 20 Similarly, As a result, E(ˆp B σ(c)) p A p B = 1 [(I n C k A p A ) (Ik B p B ) + (Ik A p A ) p B ] + o p (1). k C C U n = 1 n C [ πc (I A k I B k p A p B ) + π A ((Ik A p A ) (Ik B p B ) + p A (Ik B p B )) πc + π ] B ((Ik A p A ) (Ik B p B ) + (Ik A p A ) p B ) + o p (1) πc = 1 [ 1 (I n C k A p A ) (Ik B p B ) πc + π C + π A πc p A (I B k p B ) d + π ] C + π B (Ik A p A ) p B + o p (1) πc ( 1 N 0, P A P B + (π C + π A ) 2 (p A p A ) P B π C π C + (π ) C + π B ) 2 P A (p B p B ). π C Consequently, W n + U n d N(0, Σ), where Σ = ( π C )P A P B + π C + 2π C π A + πa 2 (p A p A ) P B π C + π C + 2π C π B + π 2 B π C P A (p B p B ). π C

28 The Case Where A and B Are Dependent When A and B are dependent, the point estimators obtained by marginal imputation are not consistent. Therefore, marginal imputation is usually not considered in this case. The asymptotic results are established for conditional imputation only. Theorem 2 Assume that π C > 0. Under conditional imputation, n(ˆp I p) d N(0, Σ), where Σ = MP M + (1 π C )P, M = 1 πc I a b π A πc diag{p B A }I a U b π B πc diag{p A B }U a I b, (2.5) and p A B = (p 11 /p 1,, p 1b /p b,, p a1 /p 1,, p ab /p b ), p B A = (p 11 /p 1,, p 1b /p 1,, p a1 /p a,, p ab /p a ). (2.6) I d (d = a b, a, or b) is a d-dimensional identity matrix, and U d is a d- dimensional square matrix with all elements equal to 1. Proof: U n and W n are defined in (2.2) and (2.3). Under conditional imputation, we have W n σ(c) d N(0, (1 π C )P ),

29 22 where P was given in (2.4). On the other hand, nc [E(ˆp A σ(c)) p] ˆp p C 11 1 p ˆp C ˆp = p C 1b 1 p ˆp C 1b 1 n C. ˆp p C a1 a p ˆp C ab a. p ab +o p (1). p a ˆp C ab ˆp C a = n C ˆp C 11 p 11. ˆp C 1b p 1b. ˆp C a1 p a1 ˆp C ab p ab n C p 11 p 1 (ˆp C 1 p 1 ). p 1b p 1 (ˆp C 1 p 1 ). p a1 p a (ˆp C a p a ). p ab p a (ˆp C a p a ) As a result, nc [E(ˆp A σ(c)) p] = [I a b diag{p B A }I a U b ][ n C (ˆp C p)] + o p (1), where p B A is defined in (2.6). Similarly, it can be shown that nc [E(ˆp B σ(c)) p] = [I a b diag{p A B }(U a I b )}][ n C (ˆp C p)] + o p (1).

30 23 Hence, U n = n C [ π A πc (E(ˆp A σ(c)) p) + π B πc (E(ˆp B σ(c)) p) + π C (ˆp C p)] + o p (1) = M n C (ˆp C p) + o p (1) d N(0, MP M ), where M is given in (2.5). Consequently, n(ˆp I p) = W n + U n d N(0, MP M + (1 π C )P ) = N(0, Σ). Thus, asymptotic covariance matrices can be estimated by replacing p ij, π A, π B, and π C in Σ by ˆp I ij, n A /n, n B /n, and n C /n, respectively. The asymptotic covariance matrix is denoted by ˆΣ. 2.4 Weighted Mean Squared Error Let ˆp be an arbitrary estimator of the cell probability vector p. To evaluate its performance, we propose a measure called weighted mean squared error (WMSE), which is defined by WMSE(ˆp) = ij E(ˆp ij p ij ) 2 p ij. Theorem 3 Under conditional imputation, W MSE(ˆp I ) = 1 n [ 1 π C (ab + π 2 A a + π2 B b 2π Aa 2π B b + 2π A π B +2π A π B δ) π C ab + (ab 1)] + o( 1 n ),

31 24 where δ is a non-centrality parameter given by δ = (p ij p i p j ) 2 p i p j. Intuitively δ can be interpreted as a measure for the dependency between A and B. When A and B are independent, δ reaches its smallest possible value 0. Proof: It has been shown that n(ˆp I p) = N(0, Σ) + o p (1), where Σ is given in Theorem 2. For the sake of convenience, we define 1/ p = (1/ p 11,, 1/ p 1b,, 1/ p a1,, 1/ p ab ) p 2 /(p A p B ) = (p 2 11/(p 1 p 1 ),, p 2 1b/(p 1 p b ),, p 2 a1/(p a p 1 ),, p 2 ab/(p a p b )). It follows that ndiag{1/ p}(ˆp I p) d N(0, Σ ), where Σ = diag{1/ p} Σ diag{1/ p}. On the other side, Σ = M diag{p} M + (1 π C ) diag{p} pp, and tr[diag{1/ p diag{p} diag{1/ p}] = a b tr[diag{1/ p pp } diag{1/ p}] = 1.

32 25 As a result, we only need to focus on the evaluation of tr[diag{1/p} M diag{p } M diag{1/p}]. It can be verified that a = tr[diag{1/ p} diag{p B A } {I a U b } diag{p} {I a U b } diag{p B A } diag{1/ p}] b = tr[diag{1/ p} diag{p A B } (U a I b ) diag{p} (U a I b ) diag{p A B } diag{1/ p}] a = tr[diag{1/ p} diag{p} (I a U b ) diag{p B A } diag{1/ p}] a = tr[diag{1/ p} diag{p B A } (I a U b ) diag{p} diag{1/ p}] b = tr[diag{1/ p} diag{p A B } (U a I b ) diag{p} diag{1/ p} b = tr[diag{1/ p} diag{p} (U a I b ) diag{p A B } diag{1/ p}]. Note that δ = (p ij p i p j ) 2 p i p j = p 2 ij p i p j = p 2 ij p i p j 1 2 p ij + p i p j = tr(diag{p 2 /(p A p B )}) 1,

33 26 it follows that tr[diag(1/ p) diag(p A B ) diag(u a I b ) diag(p) diag(i a U b ) diag(p B A ) diag(1/ p)] = tr(diag(p 2 /(p A p B ))) = δ + 1 tr[diag(1/ p) diag(p B A ) diag(i a U b ) diag(p) diag(u a I b ) diag(p A B ) diag(1/ p)] = tr(p 2 /(p A p B )) = δ + 1. Thus, tr(σ ) = 1 n [ 1 π C (ab + π 2 Aa + π 2 Bb 2π A a 2π B b + 2π A π B + 2π A π B δ) π C ab + (ab 1)] + o( 1 n ). The proof is completed by noting that nw MSE(ˆp I ) = E ndiag{1/ p}(ˆp I p) 2 = tr(σ ) + o(1). According to Theorem 3, WMSE(ˆp I ) depends on the probabilities π A, π B and π C, and the cell probabilities through a non-centrality parameter δ. Also, WMSE(ˆp I ) is an increasing function of δ. Under Assumption A, ˆp C (the estimator using the complete units only) is also consistent. The relative efficiency between ˆp I and ˆp C can be assessed by the difference between WMSE(ˆp I ) and WMSE(ˆp C ). Our simulation results in Chapter 3 show that estimators given by conditional imputation can increase the efficiency if the proportion of the completers is small.

34 Testing for Goodness-of-Fit A direct application of Theorem 2 is to obtain a Wald type test for goodnessof-fit. Consider the null hypothesis of H 0 : p = p 0, where p 0 is a known vector. Under H 0, X 2 W. = n(ˆp p 0) ˆΣ 1 (ˆp p 0) d χ 2 ab 1, where χ 2 v denotes a chi-square random variable with v degrees of freedom. ˆp (p 0) is obtained by dropping the last component of ˆp I (p 0 ) and ˆΣ is the estimated asymptotic covariance matrix of ˆp, which can be obtained by dropping the last row and column of ˆΣ, the estimated asymptotic covariance matrix of ˆp I. Although X 2 W provides an asymptotically correct chi-square test, the computation of ˆΣ 1 is complicated. Instead of looking for an asymptotically correct test, we consider a simple correction of the standard Pearson s chi-square statistic by matching the first order moment (Rao and Scott, 1981). When there is no nonresponse, under H 0, the Pearson s statistic is asymptotically distributed as a chi-square random variable with ab 1 degrees of freedom: X 2 P = n (ˆp ij p ij ) 2 p ij d χ 2 ab 1. (2.7) Therefore, E(XP 2 ) ab 1. However, under conditional imputation, it follows from Theorem 3 that E(X 2 P ) 1 π C (ab + π 2 Aa + π 2 Bb 2π A a 2π B b + 2π A π B + 2π A π B δ) π C ab + (ab 1).

35 28 If we let λ = 1 π C (ab 1) (ab + π2 Aa + π 2 Bb 2π A a 2π B b + 2π A π B +2π A π B δ) π Cab + 1, (2.8) ab 1 it follows that E(XP 2 /λ) (ab 1). Thus, we propose the corrected Pearson s statistic XC 2 = X2 P /λ. The performance of this corrected chi-square test, the naive chi-square test, and Wald s test are evaluated by a simulation study in Chapter Testing for Independence An application of Theorem 1 is testing the independence of A and B. When π C = 1, (i.e., there are no cases of nonresponses) the chi-square statistic is given by X 2 I. (ˆpij ˆp i ˆp j ) 2 = n ˆp i ˆp j d χ 2 (a 1)(b 1). The following theorem establishes the asymptotic behavior of X 2 I under marginal imputation and conditional imputation when π C > 0. Theorem 4 Assume that π C > 0 and that A and B are independent. (a) When marginal imputation is applied to impute nonrespondents, X 2 I d χ 2 (a 1)(b 1).

36 29 (b) When conditional imputation is applied to impute nonrespondents, X 2 I /(π 1 C + 1 π C) d χ 2 (a 1)(b 1). Proof: (a) After marginal imputation, the test statistics given by X 2 I = n (ˆp I ij ˆp I i ˆp I j) 2 ˆp I i ˆpI j = V n 2, where V n is a ab-dimensional vector with n(ˆp I ij ˆp I i ˆp I j) ˆp I i ˆpI j as its d th component, where d = (i 1)b + j. By Taylor s expansion and Slusky s theorem, n(ˆp I ij ˆp I i ˆp I j) ˆp I i ˆpI j = n(ˆp I ij ˆp I i p j p i ˆp I j + p i p j ) pi p j + o p (1). Define U = I a b (p A 1 a) I b I a (p B 1 b), where 1 a and 1 b are two vectors with all elements equal to 1 and dimension a and b, respectively. Let 1/ p A denote the vector (1/ p 1,, 1/ p a ), and let 1/ p B be defined similarly. Define S = diag{(1/ p A ) (1/ p B )}. Note that V n = SU( n(ˆp I p)) + o p (1) d N(0, SUΣU S ),

37 30 where Σ is given in Theorem 1, which is of the form κp A P B + x(p A p A ) P B + yp A (p B p B ), where κ = 1/π C + 1 π C under conditional imputation, and κ = 1 under marginal imputation. Note that P A (1p A ) = (p A 1 )P A = 0 P B (1p B ) = (p B 1)P B = 0 (p A p A )(1p A ) = (p A 1 )(p A p A ) = (p A p A ) (p B p B )(1p B ) = (p B 1 )(p B p B ) = (p B p B ). As a result, UΣU = κp A P B. Since S = diag{1/ p i } diag{1/ p j }, it follows that SUΣU S = κpa PB, where P A = diag(1/ p A ) P A diag(1/ p A ) = diag(1/ p A ) (diag{p A } p A p A ) diag{1/ p A } = I a p A pa.

38 31 Similarly, P B = diag{1/ p B } P B diag{1/ p B } = I b p B pb. Because both P A and P B are projection matrices with rank (a 1) and (b 1), respectively, P A P B is also a projection matrix, but with rank (a 1)(b 1). Since V n d N(0, SUΣU S ) = N(0, κp A P B), it follows that 1 κ X2 I = 1 κ V n 2 d χ 2 (a 1)(b 1). The proof is completed by recalling that under marginal imputation κ = 1 and under conditional imputation κ = 1/π C + 1 π C.

39 32 Chapter 3 Simulation Study Under Simple Random Sampling 3.1 Introduction All the results obtained in the previous chapter are based on large sample theory. In this chapter, extensive simulations are carried out to evaluate the finite sample performances of the proposed estimators and tests. In Section 3.2, we study the asymptotic normality through chi-square scores, which is used by Johnson and Wichern (1998) as a tool to study the normality of multivariate normal distributions. In Section 3.3 the relative efficiencies of ˆp I and ˆp C are compared using WMSE. In Section 3.4, two test statistics (Wald type and Rao type) for goodness-of-fit are compared in terms of size (type I error probability). In Section 3.5, testing independence under marginal imputation, conditional imputation, and re-weighting method are studied. Their relative efficiencies are compared in terms of power.

40 Asymptotic Normality Let X be a random vector and Σ be a positive definite matrix. The chi-square score of X with respect to Σ is given by s. = X Σ 1 X. (3.1) Under the assumption that X is normally distributed with mean 0 and covariance matrix Σ, s is a chi-square random variable with d degrees of freedom, where d is the dimension of X. Therefore, chi-square scores can be used as a tool to evaluate the normality of a multivariate random vector. Since ˆp I has a degenerate covariance matrix, instead of studying ˆp I, we study ˆp, which is obtained by dropping the last component of ˆp I. In each simulation, the total sample size n is fixed to be 1,000. Two types of contingency tables (2 2 and 5 5) are considered. For a given type of contingency table, it is assumed that p ij 1/(ab). Thirty-two different missing patterns (i.e., (π A, π B, π C )) are considered for simulation. For each missing pattern, 10,000 data sets are generated based on the given parameters (i.e., n, a, b, p ij, π A, π B, π C ). For each data set, conditional imputation is performed and ˆp I is calculated. The asymptotic covariance matrix is also calculated according to Theorem 2. The chi-square score is obtained according to (3.1) based on ˆp and ˆΣ. Asymptotically, the scores are distributed as chi-square random variables with ab 1 degrees of freedom. Therefore, each of the 10,000 chi-square scores is compared with χ and χ , where χ 2 p is the p th upper quantile of a chi-square random variable

41 34 with appropriate degrees of freedom, i.e., P (χ 2 > χ 2 p) = p. The empirical upper tail probabilities, i.e., P (s > χ ) and P (s > χ ), are estimated by the proportion of chi-square scores which are larger than χ or χ The results are summarized in Table 1. In order to provide a better understanding of the true distribution of the chi-square scores after conditional imputation, the chi-square scores density is estimated by R for selected cases. They are compared with the true chi-square densities with appropriate degrees of freedom. The results are given in Figure 1 and Figure 2. The results show that the true distributions of the chi-square scores are well approximated by its asymptotic distribution. 3.3 Weighted Mean Squared Error In this section, a simulation study is performed to compare the efficiency of ˆp I and ˆp C in terms of the WMSE as defined in Section 2.4. Two distributions for 2 2 contingency tables are considered here. They are (0.25, 0.25; 0.25, 0.25) and (0.01, 0.49; 0.49, 0.01), respectively. The noncentrality parameters of these two distributions are given by and , respectively. Based on the same simulation scheme as described in Section 3.2, 10,000 data sets are generated for each parameter setting (i.e., n = 1, 000, p ij = 1/(ab), a,b,π A, π B, π C ). For each data set WMSEs are calculated for ˆp I and ˆp C. In order to compare the efficiency of imputation and re-weighting methods, the difference of WMSE is considered and magnified by the total sample size n.

42 35 This difference is estimated by the average of 10,000 data sets and is denoted by ˆ. The results are summarized in Table 2 with negative values indicating better performance of ˆp I. It can be seen that when the missing probability is large, conditional imputation can improve the efficiency of point estimators. 3.4 Testing for Goodness-of-Fit Two methods are proposed in Chapter 2 for testing goodness-of-fit. One is a Wald type statistic and the other is a Rao type statistic. Under the null hypothesis of testing goodness-of-fit, the Wald type statistic is essentially the chi-square score constructed in section (3.2). Therefore, in this section we only study the Rao type statistic. Based on the same simulation scheme as described in Section 3.2, 10,000 data sets are generated for each parameter setting. For each data set the Rao score is calculated and compared with appropriate chi-square quantiles. The empirical upper tail probabilities are estimated by the proportion of the Rao scores which are larger than the quantiles. The results are summarized in Table 3. On the other hand, the density of the Rao type statistic is also estimated by R and compared with the standard one. The results are given in Figure 3 and 4. All the results show that the performance of the Rao type test is comparable to the Wald type test in terms of type I error.

43 Testing for Independence Marginal Imputation Based on the same simulation scheme described in Section 3.2, 10,000 data sets are generated for each parameter setting. For each data set marginal imputation is applied. Then the naive Pearson chi-square score is calculated. According to our result, the naive Pearson chi-square score should be approximately chisquare distributed with (a 1)(b 1) degrees of freedom. Therefore, it is compared with the standard chi-square upper quantiles (5% and 95%) with (a 1)(b 1) degrees of freedom. The empirical upper tail probabilities of the Pearson chi-square scores are estimated. The results are summarized in Table 4. On the other hand, the densities of the naive Pearson scores are also estimated by R and compared with the densities of standard chi-square random variables with appropriate degrees of freedom. The results are given in Figure 5 and Conditional Imputation Based on the same simulation scheme described in Section 3.2, 10,000 data sets are generated for each parameter setting. For each data set conditional imputation is applied. The naive Pearson chi-square score is calculated and corrected by an appropriate constant, which is given in Theorem 4. According to

44 37 our result, the density of the corrected Pearson scores should be approximately chi-square distributed with (a 1)(b 1) degrees of freedom. Therefore, the corrected Pearson scores are compared with the standard chi-square quantiles (5% and 95%) with appropriate degrees of freedom. The empirical upper tail probabilities are estimated. The results are reported in Table 5. On the other hand, the density of the corrected Pearson scores is also estimated by R and compared with the densities of standard chi-square random variables with appropriate degrees of freedom. The results are given in Figure 7 and Relative Efficiency Let X 2, I, X 2,c I, and X 2,m I be the chi-square statistics for testing the independence of A and B based on completers, conditional imputation, and marginal imputation, respectively. According to our asymptotic theory, they define three asymptotically correct tests with the rejection regions given by I{X 2, I > χ 2 1 α,(a 1)(b 1)}, and I{X 2,c I /κ > χ 2 1 α,(a 1)(b 1)}, I{X 2,m I > χ 2 1 α,(a 1)(b 1)}, respectively, where κ = nn 1 C + 1 n Cn 1. Under the null hypothesis that A and B are independent, all three tests have asymptotic size α. Therefore, the

45 38 relative efficiency of the three tests becomes a problem of interest. In this section, a simulation study is performed to study the relative efficiency of the above three tests in terms of power. The simulation is performed based on a 2 2 contingency table with distribution (0.28,0.22; 0.22,0.28). Thirty-two different missing patterns are considered. For each parameter setting, 10,000 data sets are generated and three different chi-square scores are calculated. One is based on the re-weighting method, one is Pearson s test after marginal imputation, and the other is the corrected Pearson test after conditional imputation. The power is estimated by the proportion of the scores which correctly reject the null hypothesis. The results are summarized in Table 6. In order to have a better understanding of how the power of the three tests changes as a function of π C, we perform a simulation based on a 2 2 contingency table with distribution (0.28, 0.22; 0.22, 0.28). For a given probability of completeness (π C ), we set π A = π B = (1 π C )/2. Fifty different values of π C, which are evenly distributed between 0.1 and 1.0, are studied. For each given parameter setting, 10,000 iterations are carried out. The estimated empirical power is plotted versus the π C. The results are given in Figure 9. We also study the power of the three tests as the function of the noncentrality parameter δ. In this case, we fix the missing pattern to be (π C, π A, π B ) = (0.5, 0.3, 0.2).

46 39 Note that the following type of 2 2 contingency tables δ w = has non-centrality parameter equal to δ/16, which is proportional to δ. Therefore, simulations are performed based on this type of 2 2 contingency tables. Fifty equally spaced δ values from 0.01 to 0.50 are studied. The estimated empirical power is plotted versus δ and is given in Figure 10. The results in this section suggest that when testing independence, greatest power is achieved by using the complete units only. In contrast, using the chi-square test under marginal imputation results in the smallest power. An intuitive explanation is that marginal imputation makes the two categorical responses of the incompleters independent with each other conditional on the completers. It decreases the dependency of these two components. The effect is even more pronounced when the proportion of incompleters is large. As a result, the power of marginal imputation for testing independence is the lowest. On the other hand, since the conditional imputation successfully captures the dependent structure of the two responses, its power is significantly higher than marginal imputation and comparable to, but not as good as, the re-weighting method due to the fact that additional noise is created by imputation. However, the merit of the marginal imputation is that the naive Pearson test statistic is still valid, which indicates that the marginally imputed data set can be processed by standard software without modification. This is useful

47 40 when the proportion of the nonrespondents is not too large. If the proportion of the incompleters is relatively large, conditional imputation is recommended. Therefore, the naive Pearson statistic should be modified by a constant which depends on the proportion of complete units only. 3.6 Conclusion For the selected sample sizes and parameters, our simulation results show that the empirical distribution of all the Wald type statistics can be well approximated by the derived asymptotic distributions. In addition, the simulation demonstrates that the empirical distributions of all the Rao type statistics can be well approximated by standard chi-square distributions with appropriate degrees of freedom. With regard to testing independence by different methods (re-weighting, marginal imputation, or conditional imputation) the simulation results indicate that greatest power is achieved by using the complete units only. In contrast, using the chi-square test under marginal imputation results in the smallest power.

48 41 Table 1: Empirical upper tail probability of Wald type statistic Missing Probability π C π A π B p 0.05 p 0.95 p 0.05 p Number of iterations: 10,000; sample size: 1,000; p ij = 1/(ab); p 0.05 and p 0.95 : 5% and 95% empirical upper tail probabilities, respectively.

49 42 Table 2: Efficiency comparison by WMSE Missing Probability δ = δ = p C p A p B ˆ ˆ Number of iterations: 10,000; sample size: 1,000; p 0.05 and p 0.95 : 5% and 95% empirical upper tail probabilities, respectively; δ=0 comes from distribution p = (0.25, 0.25; 0.25, 0.25) and δ= comes from distribution p = (0.01, 0.49; 0.49, 0.01).

50 43 Table 3: Empirical upper tail probability of Rao type statistic Missing Probability π C π A π B p 0.05 p 0.95 p 0.05 p Number of iterations: 10,000; sample size: 1,000; p ij = 1/(ab); p 0.05 and p 0.95 : 5% and 95% empirical upper tail probabilities, respectively.

TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION

Statistica Sinica 13(2003), 613-623 TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION Hansheng Wang and Jun Shao Peking University and University of Wisconsin Abstract: We consider the estimation