Correcting a Significance Test for Clustering in Designs With Two Levels of Nesting

Similar documents
Testing equality of variances for multiple univariate normal populations

Block designs and statistics

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

OBJECTIVES INTRODUCTION

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

arxiv: v1 [stat.ot] 7 Jul 2010

Meta-Analytic Interval Estimation for Bivariate Correlations

In this chapter, we consider several graph-theoretic and probabilistic models

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

RAFIA(MBA) TUTOR S UPLOADED FILE Course STA301: Statistics and Probability Lecture No 1 to 5

An Introduction to Meta-Analysis

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

3.3 Variational Characterization of Singular Values

Feature Extraction Techniques

Biostatistics Department Technical Report

Non-Parametric Non-Line-of-Sight Identification 1

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

IN modern society that various systems have become more

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Solutions of some selected problems of Homework 4

The Weierstrass Approximation Theorem

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Optical Properties of Plasmas of High-Z Elements

Probability Distributions

Sampling How Big a Sample?

Inference in the Presence of Likelihood Monotonicity for Polytomous and Logistic Regression

Example A1: Preparation of a Calibration Standard

A Semi-Parametric Approach to Account for Complex. Designs in Multiple Imputation

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

AN EFFICIENT CLASS OF CHAIN ESTIMATORS OF POPULATION VARIANCE UNDER SUB-SAMPLING SCHEME

a a a a a a a m a b a b

Estimating Parameters for a Gaussian pdf

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Analyzing Simulation Results

Pattern Recognition and Machine Learning. Artificial Neural networks

AVOIDING PITFALLS IN MEASUREMENT UNCERTAINTY ANALYSIS

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Polygonal Designs: Existence and Construction

Statistical Logic Cell Delay Analysis Using a Current-based Model

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Measures of average are called measures of central tendency and include the mean, median, mode, and midrange.

Pattern Recognition and Machine Learning. Artificial Neural networks

A Simple Regression Problem

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Ch 12: Variations on Backpropagation

Ensemble Based on Data Envelopment Analysis

1 Proof of learning bounds

The Transactional Nature of Quantum Information

Physics 215 Winter The Density Matrix

A proposal for a First-Citation-Speed-Index Link Peer-reviewed author version

Combining Classifiers

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Lower Bounds for Quantized Matrix Completion

COS 424: Interacting with Data. Written Exercises

Using a De-Convolution Window for Operating Modal Analysis

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Best Procedures For Sample-Free Item Analysis

3.8 Three Types of Convergence

1 Bounding the Margin

A method to determine relative stroke detection efficiencies from multiplicity distributions

An Extension to the Tactical Planning Model for a Job Shop: Continuous-Time Control

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

Multivariate Methods. Matlab Example. Principal Components Analysis -- PCA

Chapter 6 1-D Continuous Groups

A remark on a success rate model for DPA and CPA

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Sharp Time Data Tradeoffs for Linear Inverse Problems

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Inflation Forecasts: An Empirical Re-examination. Swarna B. Dutt University of West Georgia. Dipak Ghosh Emporia State University

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

EE5900 Spring Lecture 4 IC interconnect modeling methods Zhuo Feng

STOPPING SIMULATED PATHS EARLY

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Some Perspective. Forces and Newton s Laws

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Moments of the product and ratio of two correlated chi-square variables

An Approximate Model for the Theoretical Prediction of the Velocity Increase in the Intermediate Ballistics Period

Interactive Markov Models of Evolutionary Algorithms

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

The degree of a typical vertex in generalized random intersection graph models

Estimation of the Mean of the Exponential Distribution Using Maximum Ranked Set Sampling with Unequal Samples

Topic 5a Introduction to Curve Fitting & Linear Regression

12 Towards hydrodynamic equations J Nonlinear Dynamics II: Continuum Systems Lecture 12 Spring 2015

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 21. Interior Point Methods Setup and Algorithm

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

A Note on the Applied Use of MDL Approximations

Bayesian Approach for Fatigue Life Prediction from Field Inspection

Using Power Tables to Compute Statistical Power in Multilevel Experimental Designs

The accelerated expansion of the universe is explained by quantum field theory.

Physics 139B Solutions to Homework Set 3 Fall 2009

Research in Area of Longevity of Sylphon Scraies

List Scheduling and LPT Oliver Braun (09/05/2017)

Transcription:

Institute for Policy Research Northwestern University Working Paper Series WP-07-4 orrecting a Significance est for lustering in Designs With wo Levels of Nesting Larry V. Hedges Faculty Fellow, Institute for Policy Research Board of rustees Professor of Statistics and Social Policy Northwestern University DRAF Please do not quote or distribute without perission. 040 Sheridan Rd. Evanston, IL 6008-400 el: 847-49-3395 Fax: 847-49-996 www.northwestern.edu/ipr, ipr@northwestern.edu

Abstract A coon istake in analysis of cluster randoized experients is to ignore the effect of clustering and analyze the data as if each treatent group were a siple rando saple. his typically leads to an overstateent of the precision of results and anticonservative conclusions about precision and statistical significance of treatent effects. his paper gives a siple correction to the t-statistic that would be coputed if clustering were (incorrectly) ignored in an experient with two levels of nesting (e.g., classroos and schools). he correction is a ultiplicative factor depending on the nuber of clusters and subclusters, the subcluster saple size, the subcluster size, and the cluster and subcluster intraclass correlations ρ and ρ. he corrected t-statistic has Student s t- S distribution with reduced degrees of freedo. he corrected statistic reduces to the t- statistic coputed by ignoring clustering when ρ = ρ = 0. It reduces to the t-statistic S coputed using cluster eans when ρ =. If ρ and ρ are between 0 and, the adjusted S S t-statistic lies between these two and the degrees of freedo are in between those corresponding to these two extrees. Note: his aterial is based upon work supported by the National Science Foundation under Grant No. 09365 and IES under Grant No. R305U040003.

orrecting a significance test for two levels of nesting 3 orrecting a Significance est for lustering in Designs With wo Levels of Nesting Experients in educational research often assign entire intact groups (such as schools or classroos) to the sae treatent group, with different intact groups assigned to different treatents. Because these intact groups correspond to statistical clusters, this design is often called a group randoized or cluster randoized design. Several analysis strategies for cluster randoized trials are possible, but the siplest is to use the cluster as the unit of analysis. his analysis involves coputing ean scores on the outcoe (and all other variables that ay be involved in the analysis) and carrying out the statistical analysis as if the cluster eans were the data. If all cluster saple sizes are equal, this approach provides exact tests for the treatent effect, but ore flexible and inforative analyses are also available, including analyses of variance using clusters as a nested factor (see, e.g., Hopkins, 98) and analyses involving hierarchical linear odels (see e.g., Raudenbush and Bryk, 00). For general discussions of the design and analyses of cluster randoized experients see Raudenbush and Bryk (00), Donner and Klar (000), Klar and Donner (00), Murray (998), or Murray, Varnell, & Blitstein (004). A coon istake in analysis of cluster randoized experients in education is to analyze the data as if it were based on a siple rando saple and assignent was carried out at the level of individuals. his typically leads to an overstateent of the precision of results and consequently to anti-conservative conclusions about precision and statistical significance of treatent effects (see, e.g., Murray, Hannan, and Baker, 996). his analysis can also yield isleading estiates of effect sizes and incorrect estiates of their sapling uncertainty. If the raw data were available, then reanalysis using ore appropriate analytic ethods is usually desirable. In soe cases, however, the raw data is not available but one wants to be able to interpret the findings of a research report that iproperly ignored clustering in the analysis. his proble often arises in reviewing the findings of studies carried out by other investigators. In particular, this proble has arisen in the work of the What Works learinghouse, a US Institute of Education Sciences funded project whose ission is to evaluate, copare, and synthesize evidence of effectiveness of educational progras, products, practices, and policies. he What Works learinghouse reviewers found that the ajority of the high quality studies they were exaining involved assignent of treatent by schools, which led to clustering that needed to be taken into account in assessing the uncertainty of the treatent effect (e.g., by coputing confidence intervals) or in testing its statistical significance. While soe of these studies sapled students directly within schools (at least roughly approxiating a siple rando saple within schools), ost studies sapled students by first sapling classroos within schools and thus there is a second level of clustering (nesting) that ay need to be taken into account. Moreover, ost of the statistical analyses in these studies did not attept to take clustering into account. In this context, it would be desirable to be able to know how the conclusions about treatent effects ight change if both levels of clustering were taken into account. Another way to conceive the issue is in ters of survey sapling theory. In experients that assign schools to treatents, treatent effects are just differences

orrecting a significance test for two levels of nesting 4 between independent treatent group eans. he variance of the treatent group eans depends on the sapling design. If students are sapled by first selecting schools and then selecting classroos within schools and then students within classroos, the sapling design is a three-stage cluster saple with schools as clusters and classroos as subclusters. Each stage of cluster sapling adds to the design effect (inflates the variance) of the treatent group ean. Ignoring these design effect (which is the equivalent to assuing that the sapling design is a siple rando saple of students for the total population) leads to an underestiate of the variance of the treatent group eans and therefore an underestiate of the variance of the treatent effect. Designs involving two levels of clustering are widespread in education (e.g., designs that assign schools with ultiple classroos within schools to treatents). While ethods are available to adjust for the effects of one level of clustering on siple tests of significance (e.g., Hedges, in press), less is known about ethods for taking two levels of clustering into account. Such ethods are likely to have wide application in education for two reasons. he first reason is the increasing prevalence of educational experients that assign treatents to schools in order to avoid cross containation of different treatents within the sae school. he second reason is the practical fact that, since students are nested within classroos and classroos are nested within schools, it is easier to saple students by using a ultistage cluster sapling plan that first saples schools and then classroos. Such designs are therefore widely used in quasiexperients as well as experients. Although we use the ters schools and classroos to characterize the stages of clustering, this is erely a atter of convenience and readily understandable terinology. he results of this paper apply equally well to any situation in which there is a three-stage sapling design [where individual units are sapled by first sapling clusters (e.g., schools) and then sapling subclusters (e.g., classroos) within the clusters, and finally sapling individual units (e.g., students) within the subclusters] and treatents are assigned to clusters (e.g., schools). he purpose of this paper is to provide an analysis of the effects of two-levels of nesting (clustering) on significance tests and confidence intervals for treatent effects. First we derive the sapling distribution of the t-statistic under a clustered sapling odel with equal cluster saple sizes. hen we provide and evaluate soe sipler approxiate ethods for adjusting significance tests for the effects of clustering. Next we consider whether acceptable corrections ay be obtained by adjusting for only one of the levels of nesting. hen we provide a generalization for unequal cluster (and subcluster) saple sizes. his research provides a siple correction that ay be applied to a statistical test that was coputed (incorrectly) ignoring the clustering of individuals within groups. he correction requires that a bound on the aount of clustering (in the for of an upper bound on the intraclass correlation paraeters) is known or that the intraclass correlation paraeters can be iputed for sensitivity analysis. We then derive confidence intervals for the ean difference based on the corrected test statistic. Finally we consider the power of the corrected test. Model and Notation Let Y ijk (i =,, ; j =,, p i ; k =,, n ij ) and Y ijk (i =,, ; j =,, p i ; k =,, n ij ) be the k th observation in the j th classroo in the i th school in the treatent and control groups respectively. hus, in the treatent group, there are

orrecting a significance test for two levels of nesting 5 schools, the i th school has p i classroos, and the j th classroo in the i th school has n ij observations. Siilarly, in the control group, there are schools, the i th school has p i classroos, and the j th classroo in the i th school has n ij observations. hus there is a total of M = + schools, a total of i i= i= P= p + p classroos, and a total of i pi pi N = N + N = n ij + n ij i= j= i= j= observations overall. LetY i (i =,, ) andy i (i =,, ) be the eans of the i th school in the treatent and control groups, respectively, let Y ij (i =,, ; j =,, k) and Y ij (i =,, ; j =,, k) be the eans of the j th class in the i th school in the treatent and control groups, respectively, and lety andy be the overall eans in the treatent and control groups, respectively. Define the (pooled) within treatent groups variance S via p n i ij p n i ij ( Yijk Y ) + ( Yijk Y ) i= j= k= i= j= k= S = N. () Suppose that observations within the j th subcluster (classroo) in the i th cluster (school) within the treatent and control group groups are norally distributed about cluster (classroo) eans μ ij and μ ij with a coon within-cluster variance σ W. hat is Yijk ~ N( μij, σ W ), i =,, ; j =,, p i ; k =,, n ij And () Yijk ~ N( μij, σ W ), i =,, ; j =,, p i ; k =,, n ij. Suppose further that the subcluster (classroo) eans are rando effects (for exaple they are considered a saple fro a population of eans) so that the class eans theselves have a noral distribution about the school eans μ i and μ i and coon variance σ B. hat is μij ~ N( μi, σ B ), i =,, ; j =,..., p i and (3) μij ~ N( μi, σ B ), i =,, ; j =,..., p i. Finally suppose that the cluster (school) eans μ i and μ i are also norally distributed about the treatent and control group eans μ and μ with coon variance σ BS. hat is μi ~ N( μ, σ BS ), i =,,

orrecting a significance test for two levels of nesting 6 and (4) i BS μ ~ N( μ, σ ), i =,,. Note that in this forulation, σ B represents true variation of the population eans of classroos over and above the variation in saple eans that would be expected fro variation in the sapling of observations into classroo. Siilarly, σ BS represents the true variation in school eans, over and above the variation in saple eans that would be expected fro variation dues to the sapling of observations into schools. hese assuptions correspond to the usual assuptions that would be ade in the analysis of a ulti-site trial by a three-level hierarchical linear odels analysis, an analysis of variance (with treatent as a fixed effect and schools and classroos as nested rando effects), or a t-test using the school eans in treatent and control group as the unit of analysis. Intraclass orrelations In principle there are several different within-treatent group variances in a design with two levels of nesting (a three level design). We have already defined the within-classroo, between-classroo, and between-school, variances σ W, σ B, and σ BS. here is also the total variance within treatent groups σ W defined via = BS + B + W σ σ σ σ. (5) In ost educational achieveent data when clusters are schools and subclusters are classroos, σ BS and σ B are considerably saller than σ W. Obviously, if the between school and classroo variances σ BS and σ B are sall, then σ will be very siilar to σ W. In two-level odels (e.g., those with schools and students as levels), the relation between variances associated with the two levels is characterized by an index called the intraclass correlation. In three-level odels, two indices are necessary to characterize the relationship between these variances, and they are generalizations of the intraclass correlation. Define the school-level intraclass correlation ρ S by σbs = σ BS BS + B + W ρ S =. (6) σ σ σ σ Siilarly, define the classroo level intraclass correlation ρ by σ ρ B B = = σ. (7) σ BS + σb + σw σ hese intraclass correlations can be used to obtain one of these variances fro any of the others, since σ BS = ρ S σ, σ B = ρ σ, and σ W = ( ρ S ρ )σ. Hypothesis esting he object of the statistical analysis ay be to test the statistical significance of the intervention effect, that is, to test the hypothesis of no treatent effect H 0 : μ = μ. he est Statistic Ignoring lustering Suppose that the researcher wishes to test the hypothesis and carries out the usual t- or F-test. he t-test involves coputing the test statistic t = NY % ( Y ), (8) S

orrecting a significance test for two levels of nesting 7 where S is the usual pooled within treatent group standard deviation defined in () and N% N N = N N +. he F-test statistic fro a one-way analysis of variance ignoring clustering is of course F = t. If there is no clustering (that is, if ρ S = ρ = 0), the test statistic t has Student s t- distribution with N degrees of freedo when the null hypothesis is true. If there is clustering (that is if either ρ S 0 or ρ 0) the test statistic has a different sapling distribution one that depends on ρ S and ρ. Note that this t-test (or the corresponding F-test) would not be coputed if the analyst was properly addressing the clustered nature of the saple. As we noted above, other analyses that would be appropriate include analyses that include the clusters and subclusters as factors nested within treatents, analyses that use a hierarchical linear odel including subclusters and clusters as level and level 3 units, or use cluster eans as the units of analysis. However, the objective of this paper is not to exaine these analyses but to exaine the effects of using (8) as a test statistic when the saple is a clustered saple. When there is no clustering (that is when ρ S = ρ = 0), the nuerator of (8) has a noral distribution with standard deviation σ. In other words, when the null hypothesis is true NY % ( Y )/σ has the standard noral distribution. Siilarly, when there is no clustering (that is when ρ S = ρ = 0), (N )S /σ is distributed as a chi-square with (N ) degrees of freedo so that S is distributed as σ ties a chi-square with (N ) degrees of freedo. In other words S/σ is distributed as the square root of a chi-square with (N ) degrees of freedo divided by its degrees of freedo. Note that the scale factor σ, which occurs in both the nuerator and the denoinator, cancels so that the ratio, t, is scale free. Because the nuerator has the standard noral distribution and the denoinator is the square root of the ratio of a chi-square with (N ) degrees of freedo to its degrees of freedo that is independent of the nuerator, the ratio in (8) has (by definition) Student s t-distribution with (N ) degrees of freedo. he Ipact of lustering When there is clustering (either ρ S 0 or ρ 0), neither the nuerator nor the denoinator of the t-statistic given in (8) has the sae distribution as they do when either ρ S = ρ = 0. We now indicate how the distribution of the nuerator and denoinator are different when ρ S 0 or ρ 0 in the balanced design where the cluster saple sizes p i and p i are all equal to p and the subcluster saple sizes n ij and n ij are all equal to n. Assuing that the design is balanced, the nuerator has a noral distribution with ean 0, but with a generally larger variance: σ [ + (pn )ρs + (n )ρc ]. he factor [ + (pn )ρs + (n )ρc ] is a generalization of Kish s (965) design effect for two levels of nesting. In other words, when ρ S or ρ 0, and the null hypothesis is true NY % ( Y )/ σ + ( pn ) ρs + ( n ) ρ has the standard noral distribution. Assuing a balanced design, the expected value of S is no longer σ, but instead

{ } orrecting a significance test for two levels of nesting 8 N pn N n ( pn ) ρs + ( n ) ρ E S = σ W + BS + B = N N N σ σ σ. hus the scale factor necessary to standardize S is not σ. We show in the Appendix that hs ( pn ) ρ ( ) S + n ρ σ N has, to an excellent approxiation, the chi-square distribution with h degrees of freedo, where [ N ( pn ) ρ ( n ) ρ ] h = ) ( ) ) (, (9) pnnρ nnρ N ρ nnρ ρ Nρ ρ+nρ ρ S S + + ( ) + S + S where N ) = (N pn), N ( = (N n), and ρ = ρ S ρ. aking the partial derivative of h with respect to ρ S or ρ, we see that h is a decreasing function of ρ S and ρ. If ρ S = ρ = 0 and there is no clustering, h = (N ) and S has the noinal degrees of freedo as expected. If ρ S = (so that ρ = 0) and there is coplete clustering by school (no variability within clusters), then h = (M ) as expected (because the only variability is that between the M clusters). If ρ = (so that ρ S = 0) and there is coplete clustering by classroo (no variability within subclusters or between clusters), then h = (Mp ) as expected (because the only variability is that between the Mp subclusters). If 0 < ρ S < and 0 < ρ <, then h is between (M ) and (N ) and its value reflects the effective degrees of freedo in S. hese results iply that when either ρ S 0 or ρ 0, S/σ is no longer distributed as the square root of a chi-square with (N ) degrees of freedo divided by its degrees of freedo, but S ( pn ) ρ ( ) S + n ρ σ N is distributed as the square root of a chi-square with h degrees of freedo divided by its degrees of freedo. he Sapling Distribution of the t-statistic When Either ρ S 0 or ρ 0 he results in the previous section iply that when either ρ S 0 or ρ 0, the statistic NY % ( Y )/ σ + ( pn ) ρs + ( n ) ρ NY % ( Y ) = c = ct ( pn ) ρs + ( n ) ρ S S / σ N has the t-distribution with h degrees of freedo, where c is a constant depending on N, p, n, ρ S, and ρ that absorbs the ratios of the scale factors in nuerator and denoinator, which given by N ( pn ) ρs ( n ) ρ c = (0) ( N ) + ( pn ) ρ + ( n ) ρ hus the statistic t A = ct [ ] S ()

orrecting a significance test for two levels of nesting 9 has the t-distribution with h degrees of freedo and can be thought of as a t-statistic adjusted for both for clustering effects on the ean difference and on the standard deviation. hus a two-sided test of the null hypothesis of equal group eans consists of rejecting H 0 if t A exceeds the 00α percent two-tailed critical value of the t-distribution with h degrees of freedo. he one sided test rejects H 0 on the positive side if t A exceeds the 00α percent one-tailed critical value of the t-distribution with h degrees of freedo. Note that if ρ S = 0 and ρ = 0 so that there is no clustering, then c = and h = N. hat is, when ρ S = 0 and ρ = 0, the test based on t A reduces to the usual t-test ignoring clustering. When ρ S = and ρ = 0 and there is coplete clustering by school, then c = (M )/(N ) and h = M. hat is, when ρ S = and ρ = 0, and the test based on t A reduces to a t-test coputed using the cluster (school) eans. Note that when ρ S = 0 and ρ =, c = (Mp )/(N ) and h = Mp, so that the test based on t A reduces to a t-test coputed using the subcluster (classroo) eans. he sapling distribution of t A is not exact, but it is based on theory that yields a very good approxiation (see, e.g., Welch, 949; Welch, 956; Gaylor and Hopper 969) and is widely used in other settings to construct tests in coplex analyses of variance, such as unbalanced between-subjects designs and repeated easures designs (see, e.g., Geisser and Greenhouse, 958). Extensive siulation experients in connection with two-level designs found the rejection rates of the corresponding test to be indistinguishable fro noinal (see Hedges, in press). Our siulation results in three level designs (not reported here) also confir that rejection rates do not appear to differ fro noinal. One iediate application of the results in this paper is to study the rejection rate of the unadjusted t-test. While it is well known that the unadjusted t-test has a rejection rate that is often uch higher than noinal (see, e.g., Murray, Hannan, and Baker, 996), previous studies have relied on siulation to study this test. he sapling distribution of t A provides an analytic expression for the rejection rates of the unadjusted t-test under the cluster sapling odel. Let t(ν, α) be the level α two-sided critical value for the t- distribution with ν degrees of freedo. hen the usual unadjusted t-test rejects if t > t(n, α). Because t A = ct has the t-distribution with h degrees of freedo under the null hypothesis, the rejection rate of the unadjusted test is { F[ ct(( N ),α),h]}, () where F[x, ν] is the cuulative distribution function of the t-distribution with ν degrees of freedo. oputations with this expression (not reported in this paper) are very consistent with the epirical rejection rates obtained in our siulations. Relation to Previous Work he properties of significance tests in designs with two-levels of nesting were discussed by Murray, Hannan, and Baker (996). In one part of their paper, they provided results of Monte arlo studies of rejection rates of the naïve test that ignored clustering (the test based on the statistic F ind with degrees of freedo ddf ind in their notation). he rejection rates coputed using the ethods in this paper agree well with their results. able gives the values coputed using the ethods in this paper and the results given in able of Murray, Hannan, and Baker (996) for F ind with degrees of freedo ddf ind. All of these results based on this paper are within two standard errors of

orrecting a significance test for two levels of nesting 0 the epirical proportion obtained in the siulation, and all but one are within one standard error. he sapling distribution of t A derived in this paper provide soe insight about other approaches to testing ean differences in clustered saples. For designs with a single level of clustering, Kish (965) suggested ultiplying S (or, equivalently, dividing the t-statistic) by the square root of the design effect to reove the effect of clustering on the nuerator of the t-statistic. he generalization of that suggestion would be to divide the t-statistic by the square root of [ + (pn )ρs + (n )ρc ], yielding the statistic is NY % ( Y ) tk =. S + ( pn ) ρs + ( n ) ρ However because this statistic is does not correct for the fact that the scale factor necessary to standardize S W is not σ, the sapling distribution of t K is not a t- distribution but a constant ties a t-distribution with h degrees of freedo, naely t t A K =. (3) ( pn ) ρ ( ) S + n ρ N If ρ S 0 or ρ 0 the denoinator of (3) is less than one, so t K > t A. However note that the denoinator of (3) will be quite close to unless is sall and ρ S is large. For exaple, if ρ S = 0.5, ρ = 0.5, n = 30, p = 3 and =, the denoinator of (3) is about 0.95, but if n = 30, p = 3, and = 0, the denoinator is 0.986. herefore the sapling distribution of t K is approxiately a t-distribution with h degrees of freedo. One ight wish to avoid the coputation of h by using a sipler approxiation for the degrees of freedo that is used to obtain a critical value for the test using t K. Obvious possibilities for degrees of freedo include the degrees of freedo based on the nuber of individuals, naely (N ); degrees of freedo based on the nuber of schools, naely (M ); and the effective degrees of freedo reduced by the design effect, naely (N )/ [ + (pn )ρs + (n )ρc ]. able shows the actual rejection rates for two-sided tests at the α = 0.05 significance level for the naïve test that ignores clustering and for tests using the statistic t K with critical values based on (N ), (M ), and (N )/ [ + (pn )ρs + (n )ρc ] degrees of freedo for plausible situations. he eighth colun of the table, which gives the results of the naïve test ignoring clustering, shows that the effects of two levels of clustering can be profound. It shows that the actual rejection rates for the 5 percent test under the null hypothesis are as large as 70 percent. Note that the test based on statistic t K using (N ) degrees of freedo is liberal, rejecting ore often than its noinal rate of 5 percent, particularly when the nuber M of clusters is sall. he test based on statistic t K using (M ) degrees of freedo is conservative, rejecting less often than its noinal rate of 5 percent, and is very conservative when the nuber M of clusters is sall. In contrast, the test based on statistic t K using (N )/ [ + (pn )ρs + (n )ρc ] degrees of freedo is soeties slightly liberal, soeties slightly conservative, but generally has a level very close to the noinal 5 percent. Unequal luster Saple Sizes When cluster saple sizes are unequal, the expression for the sapling distribution of the t-test statistic fro clustered saples and is considerably ore coplex. In this section we give the sapling distribution of the usual t-statistic and a

orrecting a significance test for two levels of nesting statistic that is adjusted for the effects of clustering when cluster saple sizes are not equal. hese expressions ay be of use when cluster saple sizes are unequal and are reported explicitly. hey also give soe insight about what single coproise value of p or n ight give ost accurate results when substituted into the equal saple size forulas for rough approxiations. he expressions are quite coplex when subcluster saple sizes are unequal. onsequently we provide expressions for the adjusted t-statistic and its degrees of freedo when the subcluster sizes are equal, but the cluster sizes are unequal. hen we give expressions when the subcluster saple sizes are unequal. Unequal luster (School) Saple Sizes but Equal Subcluster (lassroo) Sizes In this section we consider the case when the subcluster (classroo) saple sizes are equal or nearly so, but clusters differ in the nuber of subclusters (e.g., schools have different nubers of classroos). hat is we assue that the subcluster saple sizes n ij and n ij are all equal to n, but the nuber of treatent and control group clusters ( and ) ay differ and the nuber of subclusters within each treatent and control group clusters (p i and p i ) ay also differ. his situation is of interest for several reasons. First, as a practical atter, schools that are sapled in research studies have different nubers of classroos, but the classroo saple sizes are equal or approxiately equal (see, e.g., Ridgeway, et al., 000). Second, the adjustent to the t-statistic and the degrees of freedo depend uch ore on cluster (school) saple sizes than on subcluster (classroo) saple sizes. herefore adjustent for unequal classroo saple sizes is a second order correction to both test statistic and degrees of freedo, so treating the subcluster saple sizes as equal when they are not quite equal has relatively little effect. hird, the subcluster saple sizes are uch less likely to be reported than the cluster saple sizes, so these expressions are ore likely to be of practical use. Finally, the expressions for the adjustent and the degrees of freedo are uch sipler when subcluster saple sizes are equal. When the nuber of clusters is unequal, the adjusted t-statistic that is a generalization of () becoes t AU = c U t (4) where the adjustent constant c U is given by ( N ) ( pun ) ρs ( n ) ρ cu =, (5) ( N ) [ + ( p% Un ) ρs + ( n ) ρ] where and ( i ) ( i ) n p n p p i i U = = + = N N ( i ) ( i ) N n p N n p i= i= p% U = +. (7) N N N N Note that if all the p i and p i are equal to p, then p U = p, p% U = p, and expression (5) for c U reduces to expression (0) for c. (6)

orrecting a significance test for two levels of nesting he statistic t AU has Student s t-distribution with h degrees of freedo, where h U is given by [ N ( p n ) ρ ( n ) ρ ] U S hu = ( ) ) ( (8) AρS + nnρ + ( N ) ρ + nnuρρ S + NUρρ+Nρ S ρ where N ) U = ( N p U n), N ( = (N n), and ρ = ρ S ρ and the auxiliary constant A is defined via A = A + A and and A A P P = = ( ) ( i ) + ( i ) ( i ) 4 3 n N p n p n N p i= i= i= ( N ) ( ) ( i ) + ( i ) ( i ) 4 3 n N p n p n N p = i= = i= p p i i,. i= i= i= ( N ) Note that when the p i and p i are all equal to p, then p U = p, A = pn(n pn), expression (8) for h U reduces to expression (9) for h. Unequal Subcluster (lassroo) Saple Sizes he exact expression for the degrees of freedo h is quite coplex when subcluster (classroo) saple sizes are unequal. he coplexity of the expression is not unexpected. he denoinator of h is the variance of a linear cobination of three correlated variance coponent estiates, and the variances and covariances of these variance coponent estiates are theselves quite coplex in unbalanced designs with two nested factors (see e.g., Searle, 97, pp. 475-477). o obtain reasonably copact expressions, it is useful to definite several auxiliary constants, which are given in able 3. When the saple size in the subclusters is unequal, the adjusted t-statistic that is a generalization of () becoes t AU = c U t where the adjustent constant c U is given by ( N ) ( k ) ρs ( k3 ) ρ cu = (0) ( N ) + ( k% ) S + ( k% 3 ) ρ ρ where k = k + k, k 3 = k 3 + k 3, N k N k k % + = N + N, N k3 N k k % + 3 3 = N + N, 3 3,, (9)

orrecting a significance test for two levels of nesting 3 and the auxiliary constants k, k, k 3, and k 3 are defined in able. Note that if all the p i and p i are equal to p and if all the n ij and n ij are equal to n, then k = pn and k 3 = n, and expression (0) for c U reduces to expression (0) for c. When the null hypothesis is true, the statistic t AU has Student s t-distribution with h degrees of freedo, where h U is given by h U [ N ( k ) ρ ( k ) ρ ] = ( N ) ρ + Bρ + ρ + Dρ ρ + Eρ ρ+fρ ρ S 3 S S S where ρ = ρ S ρ, and B = B + B, = +, D = D + D, E = E + E, and F = F + F are defined below. In the definition below, the and superscripts denoting the reatent and ontrol groups are oitted for siplicity. hus, the definition below gives the value of the constants B,, D, E, and F within each treatent group (B,, etc.) in ters of auxiliary constants k to k 9 given in able : B = [k (N + k ) k 9 /N], = {k 3 [N(k k 3 ) + k 3 (N k ) ] + (N k 3 ) (k 7 + Nk 3 k 5 ) 4(N k 3 )(k k 3 )(k 7 + Nk 3 k 5 ) + 4(N k 3 )(N k )(k 5 k 7 k 4 /N) + 4(N k )(k k 3 )k 4 /N}/{(N k ) }, D = [k 3 (N + k ) k 8 /N], E = [N k ] F = [N k 3 ]. Note that when the p i and p i are all equal to p, and all the the n ij and n ij are equal to n then expression () for h U reduces to expression (9) for h. onfidence Intervals onfidence intervals based on the standard error of the ean difference and using the critical values used in the test based on t assuing siple rando sapling will not be accurate when either ρ S 0 and p > or ρ 0 and n >. hat is, the actual probability content of these confidence intervals will usually be saller than noinal (the confidence intervals will be too short). he corrected t-statistic t A can be used to obtain confidence intervals that will have the correct probability content. A 00( α) percent confidence interval for the treatent effect μ μ is given by (Y Y ) t( α,h) S / c N% μ μ (Y Y ) + t( α,h) S / c N%, () where c is the constant defined in (0) if the cluster and subcluster saple sizes, respectively are equal or the constant c U defined in (5) or (0) if they are unequal and t(α;ν) is the 00α percent two-sided critical value of the t-distribution with ν degrees of freedo (e.g., if α = 0.05 and ν = 0, then t(α, ν) =.98). Exaple An evaluation of the connected atheatics curriculu reported by Ridgway, et al. (00) copared the achieveent of p = classroos of 6 th grade students who used connected atheatics in each of = 9 schools with that of p = classroo in each of ()

orrecting a significance test for two levels of nesting 4 = 9 schools in a coparison group that did not use connected atheatics. In this quasi-experiental design the clusters were schools and the subclusters were classroos. he class sizes were not identical but the average class size in the treatent group was N / = 338/8 = 8.8 and N / = 6/8 = 8 in the control group. he exact sizes of all the classes were not reported, but here we treat the subcluster sizes as if they were equal and choose n = 8 as a slightly conservative saple size. he ean difference between treatent and control groups is Y Y = -.5, the pooled within-groups standard deviation S W =.436. his evaluation involved sites in all regions of the country and it was intended to be nationally representative. Ridgeway et al. did not give an estiate of the intraclass correlation based on their saple. Hedges and Hedberg (007) provide an estiate of the school level grade 6 intraclass correlation in atheatics achieveent for the nation as a whole (based on a national probability saple) of 0.64. herefore for this exaple we assue that the intraclass correlation at the school level is ρ S = 0.64 and that the classroo level intraclass correlation is about two thirds as large, naely ρ = 0.76. he analysis carried out by the investigators ignored clustering. oparing the ean of all of the students in the treatent group with the ean of all of the students in the control group using a conventional t-test leads to an unadjusted t value of t = 6.399, which is highly statistically significant copared with a critical value based on (N ) = 500 = 498 degrees of freedo or 486 = 484 degrees of freedo using our slightly conservative assuption that classroos had an equal saple size of n = 8. o deterine what ipact clustering ay have had on the statistical significance of these findings we copute the adjusted t-test. We start by coputing p U using (6) and p% U fro (7) we obtain p U =.5 and p% U =.33. Inserting these values into the expression (5) for c U yields c U = 0.309 and a t-statistic adjusted for clustering of t AU =.976, which is uch saller than the unadjusted t-statistic. o copute the degrees of freedo for the adjusted test, we first copute the auxiliary constant A using (9) and obtain A =,960, then we insert this value of A along with N ) = 43 and N ( = 450 into (8) to obtain hu = 96.0. oparing the value of the adjusted statistic, t AU =.976, with Student s t-distribution with h = 96.0 degrees of freedo, we see that the two-tailed p- value is p = 0.05. hus a conventional interpretation would be that the result is not quite statistically significant at the 5 percent level. A 95 per cent confidence interval for μ μ coputed fro () is given by -3.007 μ μ 0.007, which has width 3.04, and as expected fro the outcoe of the significance test, contains zero. oparing this to the confidence interval that would be coputed ignoring clustering, (-.96 to -.04) which has width 0.9, we see that the confidence interval which ignores clustering is considerably (and erroneously) narrower than that using t A, which takes clustering into account. his exaple illustrates that a finding that iplies treatent effects that ay see very reliably different fro zero when the analysis ignores clustering ay be equivocal when clustering is taken into account. he adjustent used in this exaple involves assuptions about intraclass correlations that ay not be exactly correct. It should be viewed ore as a sensitivity analysis than as a sharp estiate of actual significance values. (For exaple, if the value of ρ S was decreased to ρ S = 0.5, the

orrecting a significance test for two levels of nesting 5 adjusted t-test would yield a p-value less than 0.05.) However the assuptions ade in this exaple are likely to be ore plausible than the assuption that ρ S = ρ = 0 that corresponds to the idea that clustering can be safely ignored. his exaple also illustrates that when the sapling design in an experient involves a three stage saple with two levels of clustering (nesting), such as sapling students by first selecting schools, then classroos within schools, then students within classroos, it is iportant to include all of the levels of nesting in adjustents for clustering. If we had ignored the clustering at the classroo level (or equivalently assued that ρ = 0) and continued to assue that ρ S = 0.64, then we would have calculated a value of c U = 0.37 and an adjusted t-statistic of t AU =.37 with h = 65.87 degrees of freedo and a p-value of p = 0.09. hus we would have concluded that that the treatent effect was still reliably different fro zero, even after adjusting for clustering at the school level. Power onsiderations In evaluating any statistical test, it is useful to know its power relative to alternative tests that ight be used. he corrected t-test presented in this paper is likely to be used in situations where there is no obvious alternative (that is in situations where only a data suary such as a t-statistic coputed ignoring clustering is available). Yet it is still useful to know soething about the power of this test copared with that of the alternatives that could be used if ore data were available. wo alternatives that require ore inforation than the test given here, but which ay be coputed without coplete reanalysis of the data, are a t-test perfored on cluster (school) eans (that is using the school as the unit of analysis) and a generalized least squares (GLS) analysis coputed using known values of ρ S and ρ to paraeterize the error covariance atrix. Blair and Higgins (986) give the two level version of the test based on GLS, but its extension to three levels is straightforward. hese two tests provide useful standards of coparison because the test based on cluster (school) eans is the ost powerful exact test when both ρ S and ρ are unknown, while the test based on generalized least squares is the ost powerful exact test when both ρ S and ρ are known. When the null hypothesis is false (and the design is balanced), the test statistic used in all three analyses (the one based on the results in this paper, and the two alternatives requiring ore data) have noncentral t-distributions with the sae noncentrality paraeter, N μ - μ λ = % ( ) σ + ( pn ) ρs + ( n ) ρ, (3) but different degrees of freedo [(N ), h, or (M ), respectively]. Because the power is an increasing function of degrees of freedo for a fixed noncentrality paraeter the relative power of these three tests is therefore deterined by the degrees of freedo. Because the analysis based on generalized least squares has (N ) degrees of freedo and (N ) h (M ), it will provide the ost powerful test if ρ is known and the raw data are available. Because the analysis based on school eans has (M ) degrees of freedo and (M ) h (N ), it should always provide the least powerful of the three tests. Because the test based on t A has h degrees of freedo, it should have power in between the other two tests. However, because the dependence of the power function on degrees of freedo for a fixed noncentrality paraeter) is slight when degrees of

orrecting a significance test for two levels of nesting 6 freedo are 30 or ore, the difference in the power of these three tests need not be substantial. able 4 gives the power of each of the three tests in soe illustrative situations when μ μ =.0σ, and the last colun is the ratio of the power of the test proposed here to that of the test based on generalized least squares. his table illustrates that when the nuber of clusters is sall, the adjusted t-test is considerably ore powerful than the test using cluster eans as the unit of analysis, but the power advantage decreases as the nuber of clusters increases. However it is iportant to reeber that the test based on cluster eans is the ost powerful test if ρ S and ρ are unknown. hat is, the power advantage of the GLS test and the adjusted t-test depends on having known values of ρ S and ρ. While the adjusted t-test is slightly less powerful than the GLS test, it is very nearly as powerful. onclusions luster randoized trials are iportant in education and the social and policy sciences, but these trials are often iproperly analyzed by ignoring the effects of clustering on significance tests. It is obviously desirable that these trials should be analyzed using ore appropriate statistical ethods (such as ultilevel statistical ethods). However, when conclusions ust be drawn fro published reports (using t- or F-tests that ignore clustering), corrected significance levels and confidence intervals can be obtained if the intraclass correlations are known or plausible values can be iputed. Such procedures provide reasonably accurate significance levels and are suitable for bounds on the results. he theory given in this paper can also be used to study alternative suggestions for adjusting t-tests for clustering. Such analyses show that a test based on Kish s statistic t K gives quite conservative results when critical values are obtained using degrees of freedo based strictly on the nuber of clusters. A test based on t K has rejection rates that are generally close to noinal (but not always strictly conservative) when critical values are obtained using degrees of freedo adjusted for the design effect involving both levels of clustering. When using the adjustents to test statistics given in this paper, it is iportant to adjust for both levels of clustering. Ignoring one of the levels of nesting (clustering) in coputing the adjusted t-statistic or t K can result in substantial inflation of significance levels. his paper considered only the siplest analyses for treatent effects under a sapling odel with two levels of nesting. Educational experients soeties involve the use of covariates at one or ore levels of the design to increase precision. he generalization of the ethods used in this paper to ore coplex designs and ore coplex analyses would be desirable to provide ethods for dealing with such cases.

orrecting a significance test for two levels of nesting 7 Appendix Derivations with the Equal luster and Subcluster Saple Sizes Under the odel the sapling distribution of the nuerator of (8) is noral with ean N % ( μ - μ ) and variance σ W + pnσ BS + nσ B = σ [ + (pn )ρ S + (n )ρ ]. he square of the denoinator of (8), can be written as SSBS + SSB + SSW S =, (4) N where SSBS is the pooled su of squares between cluster (school) eans within treatent groups, SSB is the pooled su of squares between subcluster (classroo) eans within schools and treatent groups, and SSW is the pooled su of squares within subclusters (classroos). herefore SSW/σ W has the chi-squared distribution with (N Mp) degrees of freedo, where M = +. Siilarly SSB (5) σw + nσ B has the chi-squared distribution with (Mp M) degrees of freedo and SSBS (6) σw + nσb + pnσ BS has the chi-squared distribution with (M ) degrees of freedo. hus S is a linear cobination of independent chi-squares. o obtain the sapling distribution of S, we use a result of Box (954), which gives the sapling distribution of quadratic fors in noral variables in ters of the first two cuulants of the quadratic for. heore 3. in Box (954) iplies that S is distributed to an excellent approxiation as a constant g ties chi-square with h degrees of freedo, where g and h are given by V{ S } g = (7) E{ S } and ( E{ S }) h =, (8) V{ S } where E{X} and V{X} are the expected value and the variance of X. herefore we have that S /gh = S /E{S } is distributed as a chi-square with h degrees of freedo divided by h. By the definition of the noncentral t-distribution (see, e.g., Johnson and Kotz, 970), it follows that NY % ( Y )/ σ + ( pn ) ρs( n ) ρ = ct S / σ E S { } has the noncentral t-distribution with h degrees of freedo and noncentrality paraeter N% ( μ - μ ) λ =, σ + ( pn ) ρ + ( n ) ρ S

orrecting a significance test for two levels of nesting 8 where c is given by c = { } E S / σ + ( n ) ρ and h is given by (8). When μ μ = 0 (and therefore λ = 0), the distribution is a central t-distribution with h degrees of freedo. It follows fro (4), and standard theory for expected ean squares in hierarchical designs (see, e.g., Kirk, 995) that N n N pn E{ S } = σw + N B + σ N σ BS and ) ( ) ) ( 4 4 4 pnnσ ( ) BS + nnσb + N σw + nnσbsσb + N σbsσw + NσBσ W V{ S } =, ( N ) where N ) = (N pn), N ( = (N n), and ρ = ρ S ρ. Inserting these values for the ean and variance of S into (7) and (8), using the fact that ρ S σ = σ BS, ρ σ = σ B and ( ρ S ρ )σ = σ W, and siplifying gives the values we obtain for c given in (0) and h given in (9). Unequal luster Saple Sizes When cluster saple sizes are unequal but saples sizes in subclasses are equal, expressions for the expressions for the constant c and degrees of freedo h are ore coplex. A direct arguent leads to N N { } ( W B % U BS V Y Y = σ + nσ + p nσ N + N ) (30) where p% U is defined in (7). herefore the sapling distribution of the nuerator of (8) is noral with ean N % ( μ - μ )and variance σw + σ n% B = σ [ + ( p% U n )ρs +(n - )ρ ]. he expected value and variance of S can be calculated fro the analysis of variance between clusters, between subclusters, and within clusters within the treatent groups. When cluster saple sizes are unequal, the sus of squares are still independent, and the within cluster su of squares has a chi-square distribution, but if ρ S, the between cluster su of squares does not have a chi-square distribution. However because S is a quadratic for, Box s theore can be used to obtain the distribution of S. o obtain the expected value and variance of S, use the fact that SSBS + SSBS + SSB + SSB + SSW + SSW S =, N where SSBS, SSB and SSBS, SSW and SSBS, SSB and SSW are the sus of squares between schools, between classes, and within classes in the treatent and control groups, respectively. When subcluster saple sizes are equal, it is easiest to do this in two steps. Start by coputing the su of squares within schools in the treatent and control groups via SSWS = SSB + SSW and SSWS = SSB + SSW. Because the classroo saple sizes are equal, this coputation is straightforward and follows exactly fro results for the two-level odel given in Hedges (007). hen S can be written as (9)

orrecting a significance test for two levels of nesting 9 SSBS + SSBS + SSWS + SSWS S =. N Because SSBS and SSBS are functions of the school eans in the treatent and control groups, and they are independent of SSWS and SSWS, the ean and variance of S follow exactly fro the results for the unequal saple size case for the two-level odel given in Hedges (007) with clusters of size np i or np i, respectively. When the subcluster saple sizes are unequal, we copute S as SS + SS S =, N where SS and SS are the sus of squares about the treatent and control group eans, respectively. Each treatent group can be viewed as a design with two nested factors. he ean and variance of SS and SS are calculated separately fro results on the estiation of variance coponents in unbalanced designs with two nested factors (see, e.g., Searle, 97, pages 474 477). Specifically, for either group, ˆ ˆ SS = ( N ) σ + ( N k ) σ + ( N k ) σˆ. W 3 B BS ˆ ˆ ˆ W, B, and BS Using results on the variances and covariances ofσ σ σ (see, e.g., Searle, 97, pages 474 477), the ean and variance of S are obtained fro the ean and variance of SS and SS. Inserting these values for the ean and variance of S into (9) and (8), and siplifying gives the values we obtain for c U given in (0) and h U given in ().

orrecting a significance test for two levels of nesting 0 References Barcikowski, R. S. (98). Statistical power with group ean as the unit of analysis. Journal of Educational Statistics, 6, 67-85. Blair, R.. & Higgins, J. J. (986). oent on Statistical power with group ean as the unit of analysis. Journal of Educational Statistics,, 6-69. Blitstein, J. L., Hannan, P. J., Murray, D. M., & Shadish, W. R. (005). Increasing degrees of freedo in existing group randoized trials through the use of external estiates of intraclass correlation: he df* approach. Evaluation Review, 9, 4-67. Blitstein, J. L., Murray, D. M., Hannan, P. J., & Shadish, W. R. (005). Increasing degrees of freedo in future group randoized trials through the use of external estiates of intraclass correlation: he df* approach. Evaluation Review, 9, 68-86. Box, G. E. P. (954). Soe theores on quadratic fors applied to the study of analysis of variance probles, I. Effect of inequality of variance in the one-way classification. Annals of Matheatical Statistics, 5, 90-30. Donner, A. & Klar, N. (000). Design and analysis of cluster randoization trials in health research. London: Arnold. Donner, A. & Koval, J.J. (98). Design considerations in the estiation of intraclass correlations. Annals of Huan Genetics, 46, 7-77. Gaylord, D. W. & Hopper, F. N. (969). Estiating degrees of freedo for linear cobinations of ean squares by Satterthwaite s forula. echnoetrics,, 69-706. Geisser, S. & Greenhouse, S. W. (958). An extension of Box s results on the use of the F distribution in ultivariate analysis. Annals of Matheatical Statistics, 9, 885-89. Guilliford, M.., Ukouunne, O.., & hinn, S. (999). oponents of variance and intraclass correlations for the design of counity-based surveys and intervention studies. Data fro the Health Survey for England 994. Aerican Journal of Epideiology, 49, 876-883. Hannan, P. J., Murray, D. M., Jacobs, D. R., & McGovern, P. G. (994). Paraeters to aid in the design and analysis of counity trials: Intraclass correlations fro the Minnesota heart health progra. Epideiology, 5, 88-95. Hedges, L. V. & Hedberg, E.. (007). Intraclass correlation values for planning group randoized experients in education. Educational Evaluation and Policy Analysis, 9, 60-87. Hedges, L. V. (007). orrecting a significance test for clustering. Journal of Educational and Behavioral Statistics, 3, 5-79. Hopkins, K. D. (98). he unit of analysis: Group eans versus individual observations. Aerican Educational Research Journal, 9, 5-8. Johnson, N. L. & Kotz, S. (970). Distributions in statistics-ontinuous univariate distributions-. New York: John Wiley. Kirk, R. (995). Experiental design. Belont, A: Brooks ole. Klar, N. & Donner, A. (00). urrent and future challenges in the design and analysis of cluster randoization trials. Statistics in Medicine, 0, 379-3740. Kish, L. (965). Survey sapling. New York: John Wiley.