Two-way contingency tables for complex sampling schemes

Similar documents
MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

INFORMATION THEORY AND STATISTICS

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Pseudo-score confidence intervals for parameters in discrete statistical models

COMPARISON OF FIVE TESTS FOR THE COMMON MEAN OF SEVERAL MULTIVARIATE NORMAL POPULATIONS

The effect of nonzero second-order interaction on combined estimators of the odds ratio

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Small n, σ known or unknown, underlying nongaussian

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Testing Non-Linear Ordinal Responses in L2 K Tables

Statistics and Probability Letters. Using randomization tests to preserve type I error with response adaptive and covariate adaptive randomization

Session 3 The proportional odds model and the Mann-Whitney test

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Multivariate Extensions of McNemar s Test

Topic 21 Goodness of Fit

Reports of the Institute of Biostatistics

DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Define characteristic function. State its properties. State and prove inversion theorem.

Describing Contingency tables

Chapter 7 Fall Chapter 7 Hypothesis testing Hypotheses of interest: (A) 1-sample

Marcia Gumpertz and Sastry G. Pantula Department of Statistics North Carolina State University Raleigh, NC

Discrete Multivariate Statistics

Inferences on a Normal Covariance Matrix and Generalized Variance with Monotone Missing Data

Estimation of change in a rotation panel design

Multinomial Logistic Regression Models

(DMSTT 01) M.Sc. DEGREE EXAMINATION, DECEMBER First Year Statistics Paper I PROBABILITY AND DISTRIBUTION THEORY. Answer any FIVE questions.

Yu Xie, Institute for Social Research, 426 Thompson Street, University of Michigan, Ann

Sleep data, two drugs Ch13.xls

Testing Statistical Hypotheses

Summary of Chapters 7-9

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

VARIATONAL STATIONARY FUNCTIONALS IN MONTE CARLO COMPUTATIONS

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Review of One-way Tables and SAS

Describing Stratified Multiple Responses for Sparse Data

Categorical Data Analysis Chapter 3

11-2 Multinomial Experiment

An Approximate Test for Homogeneity of Correlated Correlation Coefficients

Central Limit Theorem ( 5.3)

HOTELLING'S T 2 APPROXIMATION FOR BIVARIATE DICHOTOMOUS DATA

Correspondence Analysis

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

A Generalized Global Rank Test for Multiple, Possibly Censored, Outcomes

Testing Statistical Hypotheses

Survival Analysis for Case-Cohort Studies

Prepared by: M. S. KumarSwamy, TGT(Maths) Page

ON EXACT INFERENCE IN LINEAR MODELS WITH TWO VARIANCE-COVARIANCE COMPONENTS

Module 10: Analysis of Categorical Data Statistics (OA3102)

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation

A simulation study for comparing testing statistics in response-adaptive randomization

HANDBOOK OF APPLICABLE MATHEMATICS

18.465, further revised November 27, 2012 Survival analysis and the Kaplan Meier estimator

A Monte-Carlo study of asymptotically robust tests for correlation coefficients

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

Tests for Two Correlated Proportions in a Matched Case- Control Design

SAMPLE SIZE ESTIMATION FOR SURVIVAL OUTCOMES IN CLUSTER-RANDOMIZED STUDIES WITH SMALL CLUSTER SIZES BIOMETRICS (JUNE 2000)

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Lecture 8: Summary Measures

Negative Multinomial Model and Cancer. Incidence

Sample size calculations for logistic and Poisson regression models

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications

4. Comparison of Two (K) Samples

Point and Interval Estimation for Gaussian Distribution, Based on Progressively Type-II Censored Samples

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

STATISTICS; An Introductory Analysis. 2nd hidition TARO YAMANE NEW YORK UNIVERSITY A HARPER INTERNATIONAL EDITION

Section 9.2: Matrices.. a m1 a m2 a mn

Analysis of variance, multivariate (MANOVA)

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Survival Regression Models

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Statistics 3858 : Contingency Tables

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

Sample Size Determination

Math 304 (Spring 2010) - Lecture 2

Multiple comparisons - subsequent inferences for two-way ANOVA

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

On consistency of Kendall s tau under censoring

Performance Evaluation and Comparison

Effect of investigator bias on the significance level of the Wilcoxon rank-sum test

Unit 9: Inferences for Proportions and Count Data

On a connection between the Bradley-Terry model and the Cox proportional hazards model

NAG Library Chapter Introduction. G08 Nonparametric Statistics

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Repeated ordinal measurements: a generalised estimating equation approach

POPULATION AND SAMPLE

Computational Systems Biology: Biology X

Transcription:

Biomctrika (1976), 63, 2, p. 271-6 271 Printed in Oreat Britain Two-way contingency tables for complex sampling schemes BT J. J. SHUSTER Department of Statistics, University of Florida, Gainesville AND D. J. DOWNING Department of Statistics, Marquette University, Milwaukee SUMMARY Methods for testing independence, quasiindependence, and marginal symmetry in contingency tables are derived for a wide variety of sampling schemes including stratified multistage cluster sampling. The null hypothesis is a vector of linear and quadratic contrasts involving the probabilities. The asymptotic null distribution of the test statistic is chisquared. The theory can also be used to test equality of failure distributions in a stratified prospective multiclinio trial, jointly for all strata or merely population-wide. Some key words: Censored survival data; Clinical trial; Cluster design; Contingency table; Independence; Marginal symmetry; Mover-stayer model; Quasiindependence. 1. INTRODUCTION In many sample survey problems, we wish to test hypotheses in a two-way contingency table. The data are usually collected by complex sampling schemes (Kish & Frankel, 1974), rather than by simple random sampling. The usual multinomial approximations of the classical contingency table analyses may lead to misleading values of the chi-squared statistics. Such methods are insensitive to dependence among sampling unite. Although many techniques are available for data analysis when simple random sampling is employed (Cox, 1970; Gart, 1972; Goodman, 1970; Ku & Kullback, 1974; Zelen, 1971), very little beyond the 2x2 table has been published on extensions of these techniques to complex sampling schemes. Gart (1971), Kish & Hess (1959) and Miettinen (1969, 1970) consider the 2x2 table in some nonstandard sampling situations. From time to time, applied scientists use simple random sampling analyses, but note that clustering affects the validity of their conclusions (Kessner, Snow & Singer, 1974; Tietze & Lewit, 1974). We propose statistical methods which can be used to test vectors of linear and quadratic hypotheses involving a vector of probabilities. Section 2 presents the general asymptotic theory of our test statistics. Section 3 is devoted to examples of the types of statistical hypotheses that can be tested. These include tests of independence, quasiindependence (Goodman, 1968), marginal symmetry (Stuart, 1955), mover-stayer model (Goodman, 1961), equality of survival distributions under various treatments, and simultaneous independence of row and column effects for all strata. In 4, we derive estimators, which can be used to implement the methods described in 2 and 3, for complex sampling situations, including stratified multistage cluster sampling and stratified multiclinic censored survival experimente.

272 J. J. SHUSTER AND D. J. DOWNING 2. GENERAL ASYMPTOTIC THEORY FOR TESTING LINEAR AND QUADRATIC HYPOTHESES IN COMPLEX SAMPLING SITUATIONS In this section we shall develop the general asymptotic theory required for a wide variety of sampling situations. Some typical applications of these methods are presented in 3 and 4. Let n =' (P x,...,p L )' be an arbitrary vector of probabilities, not necessarily summing to any specified value, and let {n n } be a sequence of estimators of n such that ni(n n n) tends in law to N(0, V) as n->-oo. The covariance matrix V need not be of full rank, but we shall assume that all linear constraints in n are satisfied by n n. Let V(n) = An-in'B^,...,n'B m n)', (2-1) where A is a specified mxl matrix, and the B { are specified LxL matrices. The following theorem provides a test for the hypothesis THEOREM 2-1. Let H,: U(n) = 0. (2-2) G(n) = A-Q, (2-3) where the ith row of Q is Q i = n'(b i + B' i ) (t = l,...,m). Furthermore, let t n converge in probability to V. Then, as n->-oo, ni{u(n n ) - V(n)} converges to N{0,0(n) VG(n)'}; and if G(n) VO{n)' is invertible, -Vin)} (2-4) converges to Xm- The quantities U(.) and (?(.) are given in (2-1) and (2-3) respectively. Under the null hypothesis (2*2), the statistic (2-4) has no unknown parameters. The proof follows methods of Wald (1943) and is, therefore, omitted. Note that by setting the B matrices equal to zero, TJ{TT) is linear. 3. EXAMPLES OF LINEAR AND QUADRATIC HYPOTHESES IN TWO-WAY TABLES We shall write nfor the RxG table as n = (A. >PRO) = (^n^ >n l0,7t il,...,n 2O,...,n m,...,n RC ). (3-1) Example 3-1. The classical tests of complete independence in a two-way table. For the case of testing homogeneity of several multinomial distributions, U(TT) is linear, while for the case of complete independence in a two-way table, U(n) is quadratic. In either case, U(n) is an (JS 1) ((7 1) vector of contrasts, leading to (B 1) (G 1) degrees of freedom. Example 3-2. Test for marginal symmetry, i.e. matched control studies. Here n is as in (3-1) with B = G. The hypothesis of interest is: fio: (^-7r yj ) = O (l^izb-1), (3-2) when n is a probability distribution. This is linear in the P t 'B. The number of degrees of freedom, m, is (R~ 1).

Ttoo-way contingency tables for complex sampling schemes 273 Example 3-3. Test for quasiindependence, n as in (3-1) over the set 8 of (i,j) values. The hypothesis of interest is for all t,i' and jes t O 8?, where $ 4 = {j:{i,j)e8}, when 7r is a probability distribution (Goodman, 1968, equations (1-2), (1-3), (1-5)). While this is quadratic in the i^'s, H o must be reduced so that Q(n) VQ[n)' is invertible. Although the mechanics of such reductions are routine for any given situation, the general notation is awkward. Hence we do not include it here. Example 3-4. Test for a mover-stayer model (Goodman, 1961) in the BxB situation. The hypothesis of interest is that, given that a move is made from any' parent classification', t, are the probabilities of the ' offspring classification' the same as that of the population wide probability of being in that category, given one is not in classification t? If 77 is the probability distribution of all moves, not the transition matrix, then the hypothesis of interest is»)(%)(5 = 0 (3-4) k+i I I k+i for all (i,j) such that i4= j. It can be shown that (3-4) can be described by B(B 2) functionally independent equations. Hence m = B(R 2). Example 3-5. Simultaneous test of independence of row and column effects for all strata. Here there are probabilities 6 iik (1 ^ i ^ B, 1 < j < C, 1 ^ k ^ L), with ^{ t k 6 iik = 1. The hypothesis of interest is for all i, j and k. Then m = L(R - 1) (O- 1). Example 3-6. Test for equality of survival distributions under various treatments. Each TT ii in (3-1) represents the probability that an individual using treatment i, survives j periods after initiation of treatment. The hypothesis of interest is This is linear in the i^'s. Here m = (B 1) C. kest Ho-*H = " i+ i.i (K»<jR-l,l<i C). (3-6) 4. DEBIVATION OF n n AND V n or 2 IN COMPLEX SAMPLING SITUATIONS Example 4-1. Two stage cluster sampling. (a) First stage. The population is subdivided into N primary clusters. A simple random sample of n clusters is drawn with replacement. (6) Second stage. Any sampling scheme that yields unbiased estimates of the L cluster totals may be used. The method must be invariant under changes in the order in which primary units are drawn. Hence, once a cluster is sampled, we shall restore it to its original form. Typical methods of sampling within primary clusters include: simple random sampling, stratified sampling, single- or two-stage cluster sampling, etc. Sample sizes are arbitrary and may be taken with or without replacement.

274 J. J. SHUSTEB AND D. J. DOWNING Let lf < y be the estimated total in the jth category of the tth primary cluster. Under our sampling conditions, the vectors (T n,...,(t il ) (t = 1,..., n) are independent and identically distributed. The vector n = (P 1 P L )' can be estimated by where. refers to summing over the variable of summation. By the law of large numbers and Taylor's Theorem, we have -i@ ml -P 1 f l _,...,f Ji -P L $_), (4-1) where ~ symbolizes the fact that the ratio of any linear function of the left-hand side divided by the corresponding linear function of the right-hand side converges to one in probability, whenever such linear functions are npt trivially zero. By the Central Limit Theorem, the T t vector has an asymptotic multivariate normal distribution, and hence so has nl(n n n) by the relation (4-1). Let F B be the L x L matrix whose entries are given by where 17«-* -(*.,/*_)*,. By the law of large numbers F n converges in probability to V, the covariance matrix in the limiting distribution of nl(n n n). For stratified random sampling within clusters, with sample size functionally related to strata sizes within the clusters, we use f it = Z k Mt k 7 ijk (1 <S t < n, 1 < j *S L), where M^ is the number of elements in the entire tth cluster and kth. stratum, and Y iik is the fraction of elements sampled in the ifcth stratum of the ith cluster, that fall in the jth category. If M ik = 0, we take T i]k = 0. If M ik =# 0, at least one element is assumed to be sampled from stratum k. Example <L-2. Extensions of Example 4-1. (a) Example 4-1 can be extended to sampling of primary clusters without replacement, provided that our large sample size of primary clusters is a small fraction of the population size. (6) The primary clusters themselves may be drawn from a stratified random sample, rather than a simple random sample. If W t is the tth stratum weight, n t is the number of clusters sampled in stratum t, # <(*) is the n^ for stratum t, ^(t) is the f^ for stratum i, and njn-*?^ as w->oo, with n = 2^, then n n = S^^t) and t n = S^JPJ^O/AJ. (c) In single-stage cluster sampling, where the primary clusters chosen are exhaustively investigated, then (Madow, 1948) the results of Example 4-1 can be applied to sampling without replacement using the same n n andfi n = (1 f)v%, where F has entries given in (4-2)and/=lim(n/^). Example 4-3. Multiclinic prospective trials with censored data. Suppose that H clinics combine to run a prospective clinical trial. Conditional on the fact that each clinic has surviving patients in all treatments group in all time periods, we can make an overall inference as to the survival distributions.

Two-ioay contingency tables for complex sampling schemes 275 Let W h be the fraction of patients who would be treated in the Ath clinic, given that they would be treated in one of the H clinics. Using the life table methods for survival data of Cutler & Ederer (1958), each clinic obtains independent estimates of their set 6 iih, the probabilities of surviving j periods under treatment t, at clinic h. The variance-covariance structure of each clinic's estimator is obtained by Greenwood's formula. These estimates are respectively denoted by B iih and V nh (h), where n h is the number of patients in clinic h who entered the trial. Let n = 1ai h -yco in such a way that?i ft /n-»-a A. Then gives the components of n n, and A A-"l The framework of such inferences applies only to the H clinics, since the above is a fixed effect analysis. Example 4-4. Extension of Example 4-3. Should the trial be run as a stratified trial within clinics, Example 4-3 extends in an obvious way. We can test equality of survival distributions, or equality of survival distributions within all strata. The latter is a much more stringent condition. An account of stratified trials is given by Zelen (1974). Again, the inference applies only to the H clinics. 5. CONCLUDING DISCUSSION Often, in the social science and medical literature Pearson chi-squared has been used, rather than a technique of the type described in this article. In the following example the significance level achieved by Pearson's chi-squared is larger than the nominal level of our test. A hypothetical medical experiment is run as follows. A simple random sample of 40 patients is drawn from a target population. Each patient provides 4 muscle specimens. Each muscle specimen is cut into three pieces. Each of the three pieces is randomly assigned to one of treatments A, B and G such that each treatment is used once on each specimen. The measured response is quantal, i.e. all or none. We treat subjects as primary clusters, specimens as a simple random sample of secondary clusters, and treatment assignment as a sample of size one of the six possible assignments. The hypothesis of interest is that the treatments are equivalent. The test is given in Example 3-1 and the distribution theory in Example 4-1. Note that, so long as the subject-treatment interaction is not too large, and substantial variability between subjects' positive response probabilities exist, then the estimates of positive response probabilities for each treatment would be highly positively correlated. The naive assumption that we have three independent binomial samples of 160 assumes these correlations to be zero. The clustering tends to keep the relative frequencies closer together than would the three independent binomials. Hence, Pearson's chisquared is conservative in this example. The methods described in this paper are easy to use. The only computational requirement is an accurate matrix inversion subroutine. The authors wish to acknowledge the help of the referees and editor. In addition, we thank Professor Marvin Zelen for his helpful discussion of the comparison of survival distributions.

276 J. J. SHUSTEB AND D. J. DOWNING REFERENCES Cox, D. R. (1970). The Analysis of Binary Data. London: Methuen. CUTLEB, S. J. & EDEBER, F. (1958). MftTiitinnn utilization of the life table method in analyzing survival. J. Ohron. Dis. 8, 699-712. GABT, J. J. (1971). The comparison of proportions: A review of significance teste, confidence intervals, and adjustments for stratification. Rev. Inst. Int. Statist. 39, 148-69. GABT, J. J. (1972). Interaction teats for 2 x s x t contingency tables. Biometrika 59, 309-16. GOODMAN, L. A. (1961). Statistical methods for the mover-stayer model. J. Am. Statist. Assoc. 56, 841-68. GOODMAN, L. A. (1968). The analysis of cross-classified data, independence, quasi-independence, and interactions in contingency tables with or without minmng entries. J. Am. Statist. Assoc. 63,1091 131. GOODMAN, L. A. (1970). The multivariate analysis of qualitative data: Interactions among multiple classifications. J. Am. Statist. Assoc. 65, 226-66. KESSNEB, D., SNOW, C. & SINGEB, J. (1974). Assessment of Medical Care for Children. Washington: National Academy of Sciences. KJBH, L. <fc FBANKEX, M. R. (1974). Inference from complex samples. J. R. Statist. Soc. B 36, 1-38. KISH, L. & HESS, I. (1969). On variances of ratios and their differences in multistage sampling. J. Am. Statist. Assoc. 54, 416-^6. Correction (1963), J. Am. Statist. Assoc. 58, 1162. Ku, H. H. <fc KtrtiBAOK, S. (1974). Loglinear models in contingency table analysis. Am. Statistician 28, 116-25. MADOW, W. G. (1948). On the limiting distribution of estimates based on samples from finite universes. Ann. Math. Statist. 19, 636-46. MTETTINEN, O. S. (1969). Individual matching with multiple controls in the case of all or none responses. Biometrics 25, 339-55. MTETTINEN, O. S. (1970). Estimation of relative risk from individually matched series. Biometrics 26, 75-86. STUART, A. (1956). Test for homogeneity of the marginal distributions in a two-way classification. Biometrika 42, 412-6. TTETZE, C. & LBWIT, S. (1974). Comparison of the oopper-t and loop-d: A research report. Stud. Fam. Plann. 5, 277-8. WALD, A. (1943). Teste of statistical hypotheses concerning several parameters when the number of observations is large. Trans. Am. Math. Soc. 54, 426 82. ZELEN, M. (1971). The analysis of several 2x2 contingency tables. Biometrika 58, 129-37. ZELEN, M. (1974). The randomization and stratification of patiente to clinical trials. J. Chron. Dis. 27, 366-75. [Received August 1974. Revised December 1975]