Issues of Cost and Efficiency in the Design of Reliability Studies

Size: px

Start display at page:

Download "Issues of Cost and Efficiency in the Design of Reliability Studies"

Kristina Simpson
6 years ago
Views:

1 Biometrics 59, December 2003 Issues of Cost and Efficiency in the Design of Reliability Studies M. M. Shoukri, 1,2, M. H. Asyali, 1 and S. D. Walter 3 1 Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Center, P.O. Box 3354, Riyadh, Saudi Arabia 2 Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada 3 Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada shoukri@kfshrc.edu.sa Summary. Reliability of continuous and dichotomous responses is usually assessed by means of the intraclass correlation coefficient (ICC). We derive the optimal allocation of the number of subjects k and the number of repeated measurements n that minimize the variance of the estimated ICC. Cost constraints are discussed for the case of normally distributed responses. Tables showing optimal choices of k and n are given, along with guidelines for the design of reliability studies in light of our results and those reported by others. Key words: Cost function; Lagrange multiplier; Reliability index; Sample size. 1. Introduction Interobserver reliability studies are conducted to investigate the reproducibility and level of agreement on assessments made by several raters. Typically, several raters score each of a series of subjects and then their assessments are compared. For a general review on the topic of interobserver agreement, for both continuous and binary assessments, we refer the reader to the recent reviews by Dunn (1992), Shoukri (1999, 2000), Shoukri and Asyali (2002), and the references therein. An important aspect of the design of interobserver reliability study is the determination of sample size. Following the notation of Walter, Eliasziw, and Donner (1998), we suppose that n observations are made on each of k subjects, and that the jth observation Y ij for subject i (i =1,2,..., k; j =1, 2,..., n) is Y ij = µ + s i + e ij, (1) where the random subject effects {s i } are normally distributed with mean 0 and variance σ 2 s,orn(0, σ 2 s), the measurement error {e ij } are N(0, σ 2 e), and the {s i } and {e ij } terms are independent. We assume the subjects are randomly drawn from some population of interest. Regardless of whether the assessments are binary or continuous scale measurements, an index of reliability should distinguish the within-subject variation from the between-subjects variation. A widely recognized index of reliability, which possesses this property, is the intraclass correlation coefficient (ICC), defined as ρ = σ 2 s/(σ 2 s + σ 2 e). Therefore, ρ is defined as the proportion of the total variation that is associated with between-subject variation. A frequently adopted design (for interrater reliability) is when the k subjects are each rated by the same n raters. However, a similar approach can also be adopted for test-retest reliability when a single subject is assessed repeatedly on each of several occasions, or when replicates are taken from different subjects by a single judge on different occasions (Haggard, 1958). In each of these cases, for both continuous and binary assessments, ρ can be estimated from an appropriate one-way ANOVA (Fisher, 1925; Elston, 1977). The question to be addressed here concerns the optimal combination (n, k) for the design that permits the most accurate estimation of ρ. For a fixed number of replicates, Donner and Eliasziw (1987) have provided contours of exact power for selected values of k and n. Eliasziw and Donner (1987) used these power results to identify optimal designs that minimize the study costs. Walter et al. (1998) developed an approximation that allows the calculation of the required number of subjects, k, when the number of replicates n is fixed. The approximation avoids the intensive numerical work entailed in the exact method. We note, however, that reliability studies are often designed primarily to estimate the level of observer agreement, and their results are then reported in terms of estimates of agreement, rather than using hypothesis testing. Power considerations in the design of reliability studies require specifications of the hypotheses to be tested; rejection of the null value in a study does not provide useful information, because the investigator needs to know more than the fact that the observed level of reliability is unlikely to be due to chance (Giraudeau and Mary, 2001). Given that inter-rater reliability studies emphasize estimation, it is natural to base their sample size calculations on the attainment of a specified level of precision in the estimation of ρ. For example, Bonett (2002) calculated the sample 1107

2 1108 Biometrics, December 2003 size required to achieve a prescribed expected width for the confidence interval on ρ. This is similar to Donner s approach (1999), which focused on estimating the number of subjects needed to construct a confidence interval with fixed width on the intraclass kappa, for the case of a dichotomous outcome measure. In the following development, we assume that the investigator is interested in the number of replicates, n, per subject, so that the variance of the estimator for ρ is minimized, given that the total number of measurements is constrained to be N = nk a priori. In Section 2, we provide background information on the estimation of the ICC and a variance expression for its moment estimator. In Section 3, we use calculus of optimization to find the optimal combination (n, k) that minimizes the variance of the estimate of the ICC when the response variable is continuously distributed. We also examine the situation where both k and n are determined in such a way that the variance of the estimator for ρ is minimized subject to cost constraints. We devote Section 4 to the issue of optimal design when the assessments are binary. Further discussion is presented in Section Background Under the random effects model given by (1), the variance component estimator for ρ is given as r =(MSB MSW)/{MSB +(n 1)MSW}, where MSB and MSW are, respectively, the between-subject and within-subject mean squares, obtained from the familiar one-way ANOVA. The derivation of r requires no assumption concerning the normality of the Y ij. However, if we do assume that Y ij are normally distributed, we can use an approximate expression for the large sample variance of r (Fisher, 1925; Swiger et al., 1964), as follows: var(r) =V (k, n, ρ) = 2(1 ρ) 2 {1+(n 1)ρ} 2 /kn(n 1). (2) Note that an approximate 100(1 α)% confidence interval on ρ can be given as r ± z 1 α/2 V (k, n, ρ), where z 1 α/2 is the 100(1 α) percentile of the standard normal distribution. This interval depends on (i) the number of subjects k, (ii) the number of replicates n, and (iii) the point estimate of the ICC. Giraudeau and Mary (2001) suggested that a reliability study should be planned with regards to the width of the confidence interval of the ICC, in the same way as is usually done in a descriptive prevalence study (see Sukhatme et al. 1984, p. 45), or as in Freedman, Parmar, and Baker (1993) in estimating probability of observer agreement. In planning a reliability study, we propose an approach similar to that adopted in survey samples aimed at estimating the population mean of some outcome variable (Sukhatme et al. 1984, p. 284). We provide an explicit expression for the required number of replicates, so that researchers can manipulate the value of ρ and cost, to evaluate their effects on the optimal allocation scheme defined by (n, k). 3. Optimal Allocation for Continuous Response Variable 3.1. The Normal Case We will assume that, because of resource limitations, a reliability study is planned with a total of N = nk observations. Formally, we therefore need to decide on the optimal allocation of the observations to minimize (2), subject to N = nk being fixed a priori; ρ is assumed known. Here, we apply the basic idea in the method of constrained variation, using direct substitution. Substitution of N = nk gives var(r) =f(n, ρ) = 2(1 ρ) 2 {1+(n 1)ρ} 2 /N (n 1). (3) Necessary and sufficient conditions for f (n, ρ) to have a unique minimum are given by Rao (1984, p. 53). Differentiating f with respect to n, equating to zero, and solving for n, we obtain n 0 =(1+ρ)/ρ (4) Note that we restrict our investigation to the values of ρ that are strictly positive, since within the framework of reliability studies, negative values of ρ are meaningless. In practice, only integer values of (n, k) are used, and because N = nk is fixed a priori, optimum values of n were first rounded to the nearest integer; then k = N /n is rounded to the nearest integer as well. The values of var(r) at the optimal and appropriately rounded allocations for different values of N and ρ are listed in Table 1. We note that the net loss or gain in precision due to rounding is negligible. We observe from Table 1 that higher number of replicates (n) would lead to a smaller number of subjects, which may reduce the generalizability potential of the study. In addition, when ρ is expected to be larger than 0.6, which is the case in many reliability studies, the results in Table 1 suggest that the study be planned with no more than two or three replicates per subject. This guideline is quite similar to that proposed by Giraudeau and Mary (2001), based on the attainment of a specified width for the 95% confidence interval of the ICC. This is also consistent with the results reported in Table 3 of Walter et al. (1998) The Nonnormal Case As indicated above, the sampling distribution and formula for the variance of the reliability estimates rely on the normality assumptions, despite the fact that real data seldom satisfy these assumptions. We might expect that normality would be only approximately satisfied, at best. A similar problem exists for statistical inference in the one-way random effect model ANOVA, although it has been found that the F-distribution of the ratio of mean squares is quite robust with respect to nonnormality under certain conditions. Scheffé (1959) investigated the effects of nonnormality, concluding that it has little effect on inferences on mean values, but serious effects on inferences concerning variances of random effects whose kurtosis γ differs from zero (p. 345). Although Scheffé s conclusions were based on inferences for the variance ratio φ = σ 2 s/σ 2 e, they may have similar implications for the reliability parameter ρ = φ/(1 + φ). Tukey (1956) obtained the variance of the variance component estimates under various ANOVA models by employing polykeys. For the one-way random effects model, together

3 Issues of Cost and Efficiency in the Design of Reliability Studies 1109 Table 1 Optimal combinations of (n, k), their rounded values, and the corresponding minimized values of var(r), for ρ =0.1(0.1)0.9 and fixed N = nk =60, 90, 120. N ρ n k var(r) n k var(r) n k var(r) with the delta method (Kendall and Stuart, 1986), it can be shown that, to a first order approximation, var (r) = 2(1 ρ) 2 {1+(n 1)ρ} 2 {kn(n 1)} 1 +ρ 2 (1 ρ) 2 k 1( γ s + γ e n 1), (5) where γ s = E(s 4 i )/σ4 s and γ e = E(e 4 ij )/σ4 e (Hemmersley, 1949). Following the same optimization procedure as in Section (3.1), we find that the optimal value for n, say,n,is n =1+ { ρ(1 + γ s ) 1/2} 1. (6) Clearly, when γ s = 0, then n = n 0 (equation [4]). Moreover, for large values of γ s, i.e., increased departure from normality, a smaller number of replicates is needed, implying that a proportionally larger number of subjects (k) should be recruited to ensure precise estimation of ρ. We therefore recommend the same recruitment strategy as in the normal case Cost Implications It has long been recognized that funding constraints determine the recruitment costs of a reliability study. The crucial decision in a typical study is to balance the cost of recruiting subjects with the need for a precise estimate of ρ. There have been some attempts to address the issue of power, rather than precision, in the presence of funding constraints. Eliasziw and Donner (1987) presented a method to determine the number of subjects, k, and number of replications, n, that minimize the overall cost of conducting a reliability study, while still providing acceptable power for tests of hypotheses concerning ρ. They also provided tables showing optimal choices of k and n under various cost constraints. In this section, we shall determine the combinations (n, k) that minimize the variance of r, as given by (2), subject to cost constraints. In our attempt to construct a flexible cost function, we adhere to the general guidelines identified by Flynn, Whitley, and Peters (2002), and Eliasziw and Donner (1987). First, one has to identify the approximate sampling and overhead costs. The sampling cost depends primarily on the size of the sample, and includes costs for data collection, travel, management, and other staff. On the other hand, overhead costs (such as the cost of setting the data collection form) remain fixed, regardless of sample size. Following Sukhatme et al. (1984, p. 284), we assume that the overall cost function is given as: C = c 0 + kc 1 + nkc 2, (7) where c 0 is the fixed cost, c 1, the cost of recruiting a single subject, and c 2 is the cost of making one observation. Using the method of Lagrange multipliers (Rao, 1984), we form the objective function G, asg = var(r)+λ(c c 0 kc 1 nkc 2 ), where var(r) is given by (2) and λ is the Lagrange multiplier. The necessary and sufficient conditions for var(r) to have a constrained relative minimum are given by a theorem of Rao (1984, p. 68). Differentiating G with respect to n, k, and λ, and equating to zero, we obtain and n 3 ρc 2 n 2 c 2 (1 + ρ) nc 1 (2 ρ)+(1 ρ)c 1 =0, (8) λ = 2(1 ρ) 2 {1+(n 1)ρ} {1 2n +(n 1)ρ} / k 2 n 2 (n 1) 2 c 2, k =(C c 0 )/(c 1 + nc 2 ). (9) The third-degree polynomial in (8) has three roots. Using Descartes s rule of signs, we predict that there are two positive or two complex conjugate roots and exactly one negative root. Furthermore, since c 1, c 2 > 0 and 0 <ρ<1, we conclude that there are indeed two (real) positive roots, one of which is

4 1110 Biometrics, December 2003 Table 2 Optimal values of n that minimize var(r) for ρ =0.4(0.1)0.9 and R =0.01, 0.05, 0.25, 1, 5, 25, 50, and 100. R always between 0 and 1. This conveniently leaves us with only one relevant solution for the optimal value of n. The explicit expression for this optimal solution is n opt = {A 1/3 /ρ B +(1+ρ)/ρ}/3 where A =9R(ρ 3 ρ 2 + ρ)+(ρ +1) 3 +3ρ[3R{(R +1) 2 ρ 4 (6R 2 +4R 2)ρ 3 +12R(R +1)ρ 2 (8R 2 +10R +2)ρ R 1}],B = {3Rρ(ρ 2) (ρ +1) 2 }/ρa 1/3, and R = c 1 /c 2. Clearly, n opt depends on ρ and R (the cost of recruiting a subject relative to the cost of measuring a subject). Once the value of n opt is determined, then from (9), the optimal k is ρ k opt = {(C c 0 )/c 1 }/(1 + n opt /R). (10) The numerator of k opt defines the resource available for total recruitment and measurement relative to the recruitment cost per subject. We note, from (10), that n opt and k opt are inversely related. The results of the optimization procedure appear in Table 2. for ρ = 0.4 (0.1) 0.9 and R = 0.01, 0.05, 0.25, 1, 5, 25, 50, and 100. It is apparent from Table 2 that, as R increases (decreases), the number of measurements per subject n opt increases (decreases), while the number of subjects k opt decreases (increases). On the other hand, when R is fixed, an increase in the value of ρ would result in a decrease in the number of replicates and an increase in the number of subjects. This trend reflects two intuitive facts; the first is that it is sensible to decrease the number of items associated with a higher cost, and increase those with a lower cost. The second is that when ρ is large (high reproducibility), fewer number of replicates per subject are needed, while higher number of subjects should be recruited, ensuring that r is estimated with appreciable precision. This remark is similar to the conclusion reached in the previous section, when costs were not explicitly considered. Finally, we note also that by setting c 1 = 0 in (8), i.e., R = 0, we obtain n opt =(1+ρ)/ρ, as in (4). This means that a special cost structure is implied in the optimal allocation discussed in Section 3.1. Example. To assess the accuracy of Doppler echocardiography (DE) in determining aortic valve area (AVA) prospective evaluation on patients with aortic stenosis, an investigator wishes to demonstrate a high degree of reliability (ρ = 90%) in estimating AVA using the velocity integral method. Suppose that the total cost of making the study is fixed at $1600. We assume that the travel costs for a patient in going from the health center to the tertiary hospital (where the procedure is done) is $15. The administrative cost of the procedure and the cost of using the DE is $15 per visit. It is assumed that c 0, the overhead cost, is absorbed by the hospital. From Table 2, n opt for R = 1 and ρ = 0.9 is 2.57, which should be rounded up to 3. From (10), k opt = (1600/15)/(1 + 3) = 27; that is, we need 27 patients, with 3 measurements each. The minimized value of var(r) is Optimal Allocation for Dichotomous Assessments When assessing interrater reliability, a choice must be made on how to measure the condition under investigation. One of the practical aspects of this decision concerns the relative advantages of measuring the trait on a continuous scale, as discussed in the previous sections, or on a dichotomous scale. In many medical screening programs, and in social sciences and psychology studies, it is often more feasible to record the subject s response on a dichotomous scale (such as presence/absence). If this approach is adopted, the issue of optimal allocation becomes very important because, as was demonstrated by Donner and Eliasziw (1994), the loss of power associated with measuring the trait on a dichotomous scale is quite severe, and frequently prohibitive. Our primary focus in this section is the determination of the optimal allocation of fixed N = nk, so that the variance of the estimate of ρ is minimized when the response variable is dichotomous. Fleiss and Cuzick (1979) provided an example where the characteristic under investigation was the presence or absence of schizophrenia in hospitalized mental patients. Let Y ij be the jth rating made on the ith subject, where Y ij = 1 if the condition is present, and 0 otherwise. Analogous to the continuous case, Landis and Koch (1977) employed the one-way random effects model (1), but without the normality assumption being imposed on either e ij or s i. In this context, the standard assumption for the Y ij corresponding to the above ANOVA model is E(Y ij )=π =Pr(Y ij = 1), and σ 2 = var(y ij )=π(1 π). Moreover, let δ =Pr(Y ij =1,Y il = 1) = E(Y ij Y il ). Then, it follows for j l and i =1,2,..., k, that δ =cov(y ij,y il )+E(Y ij )E(Y il )=ρπ(1 π)+π 2 where ρ is the (within-subject) ICC. The probability that two given measurements from the same subject will have the same response is P o = δ +(1 π) 2 + ρπ (1 π) =1 2π (1 π)(1 ρ). When ρ = 0, this probability (probability of agreement by chance) reduces to P e =1 2π(1 π). Therefore, the parameter ρ has a kappa type probabilistic interpretation (Mak, 1988), i.e., κ =(P o P e )/(1 P e )=[{1 2π(1 π)(1 ρ)} {1 2π(1 π)}]/[1 {1 2π(1 π)}] =ρ. The ANOVA estimate for ρ is given by ρ =(MSB MSW)/{MSB +(n 1)MSW}, where MSB and MSW are functions of the ith subject s total Y i = n Y j=1 ij (Fleiss, 1981, p. 226). Crowder (1978) demonstrated the equivalence of the ANOVA model and the well-known common-correlation

5 Issues of Cost and Efficiency in the Design of Reliability Studies 1111 Table 3 Optimal allocation and the minimized values of var(r) for N =60and ρ =0.3(0.1)0.9, for a dichotomous response. π ρ n k var(r) n k var(r) n k var(r) model that occurs when, conditional on the subject effect µ i, the subject s total Y i has a binomial distribution, with conditional mean and variance given by E(Y i µ i ) = nµ i ; var(y i µ i )=nµ i (1 µ i ), with µ i assumed to follow a beta distribution with density function f(µ i )=Γ(α + β)µ α 1 i (1 µ i ) β 1/ Γ(α)Γ(β) with the appropriate parameterization, α = π(1 ρ)/ρ, and β =(1 π)(1 ρ)/ρ. Therefore, the ANOVA model and the beta-binomial model are virtually indistinguishable (Cox and Snell, 1989). Now, since for the nonnormal case, the optimal number of replicates under the ANOVA model was found to be n =1+{ρ (1 + γ s ) 1/2 } 1 (equation [6]). Since γ s is the kurtosis of the subject effect distribution, it turns out that one may use the kurtosis of the beta distribution (the subjectrandom effect distribution for binary data) to determine the optimal number of replications in the case of dichotomous response. One can derive the γ s for the beta distribution from Kendall and Stuart (1986, p. 73) from which γ s = m 4 /m 2 2, where m 4 and m 2 are, respectively, the fourth and the second central moments of the beta distribution. Substituting γ s into (6), we obtain n =1+π(1 π){(1 + ρ)(1 + 2ρ)/ψ(π,ρ)} 1/2 (11) where, ψ(π, ρ) = π[ρ + π(1 ρ)][2ρ + π(1 ρ)][3ρ + π(1 ρ) 4π(1 + 2ρ)]+(1+ρ)(1 + 2ρ)[6π 3 (1 π)ρ + 3π 4 + π 2 (1 π) 2 ρ 2 ]. In contrast to the continuous measurement model, the optimal allocations in the case of dichotomous assessments depend on π, the mean of the binary response variable. Table 3 shows the optimal number of replicates n, the corresponding optimal number of subjects, k = N/n, and var(r) at the optimal and appropriately rounded allocations, for N = 60, ρ = 0.3 (0.1) 0.9. For simplicity, we assigned a value of 0toγ e, while computing var(r) using (5). We note that, for fixed N, the allocations are equivalent for π and 1 π, and therefore we have restricted the values of π to 0.1, 0.3, and 0.5. We also note that as π approaches 0.5, the number of replicates increases while the number of subjects decreases. On the hand, as ρ increases, the number replicates decreases and the number subjects increases. Moreover, similar to Table 1, the rounding has negligible effect on the efficiency of the ICC estimate. 5. Discussion A crucial decision that a researcher faces in the design stage of a reliability study is the determination of the number of subjects k and the number of measurements per subject n. When we have prior knowledge of what constitutes an acceptable level of reliability, a hypothesis testing approach may be used, and the sample size calculations can then be performed using the methods of Donner and Eliasziw (1987) and Walter et al. (1998). However, in most cases, values of the reliability coefficient under the null and alternative hypotheses may be difficult to specify. For instance, the estimated value of ICC depends on the degree of heterogeneity among the sampled subjects: the greater the heterogeneity, the higher the value of ICC. Since most reliability studies focus on the estimation of ICC with sufficient precision, the guidelines provided in this article, which we based on principles of mathematical optimization, allow an investigator to select the pair (n, k) that maximizes the precision of the estimated reliability index. Our proposed approach is quite simple and produces estimates of (n, k) that are in close agreement with results based on considerations of power. An interesting finding from our results is that, regardless of whether the assessments are continuous or binary, the variance is minimized with a small number of replicates, as long as the true index of reliability remains reasonably high. In many clinical investigations, reliability of at least 60% is required, to provide method of measurement that has practical utility. Under such circumstances, one can safely recommend making only two or three observations per subject. We note

6 1112 Biometrics, December 2003 that cost implications for dichotomous assessments are quite important but, because of lack of space, we intend to report on this issue in a future article. Acknowledgement We thank the associate editor and two anonymous referees for their constructive comments. We appreciate the support for this research by the Research Center Administration and NSERC Canada. Résumé La fiabilité de réponses continues ou dichotomiques est habituellement évaluée au moyen du coefficient de corrélation intraclasse (ICC). Nous calculons l allocation optimale du nombre de sujets k et du nombre de mesures répétées n qui minimise la variance de l ICC estimé. Les contraintes de coût sont envisagées dans le cas de réponses à distributions gaussiennes. On donne des tables avec les choix optimaux de k et n, ainsi que des indications pour la mise au point d études de fiabilité à la lumière de nos résultats et de ceux d autres auteurs. References Bonett, D. G. (2002). Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine 21, Cox, D. R. and Snell, E. J. (1989). Analysis of Binary Data, 2nd edition. London: Chapman and Hall. Crowder, M. (1978). Beta-binomial ANOVA for proportions. Applied Statistics 27, Donner, A. (1999). Sample size requirements for interval estimation of the intraclass kappa statistic. Communications in Statistics Simulation 28(2), Donner, A. and Eliasziw, M. (1987). Sample size requirements for reliability studies. Statistics in Medicine 6, Donner, A. and Eliasziw, M. (1994). Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 50, Dunn, G. (1992). Design and analysis of reliability studies. Statistical Methods in Medical Research 1, Eliasziw, M. and Donner, A. (1987). A cost-function approach to the design of reliability studies. Statistics in Medicine 6, Elston, R. (1977). Response to query: Estimating heritability of a continuous trait. Biometrics 33, Fisher, R. A. (1925). Statistical Methods for Research Workers. London: Oliver and Boyd. Fleiss, J. (1981). Statistical Methods for Rates and Proportions, 2nd edition. New York: Wiley. Fleiss, J. and Cuzick, J. (1979). The reliability of dichotomous judgments: Unequal number of judgments per subject. Applied Psychological Measurement 3, Flynn, N. T., Whitley, E., and Peters, T. (2002). Recruitment strategy in a cluster randomized trial: Cost implications. Statistics in Medicine 21, Freedman, L., Parmar, M., and Baker, S. (1993). The design of observer agreement studies with binary assessments. Statistics in Medicine 12, Giraudeau, B. and Mary, J. Y. (2001). Planning a reproducibility study: How many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Statistics in Medicine 20, Haggard, E. R. (1958). Intraclass Correlation and the Analysis of Variance. New York: Dryden Press. Hemmersley, I. M. (1949). The unbiased estimate and standard error of the intraclass variance. Metron 15, Kendall, M. and Stuart, A. (1986). The Advanced Theory of Statistics, Volume I. London: Griffin. Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, Mak, T. K. (1988). Analyzing intraclass correlation for dichotomous variables. Applied Statistics 20, Rao, S. S. (1984). Optimization: Theory and Applications, 2nd edition. New Delhi: Wiley Eastern. Scheffé, H. (1959). The Analysis of Variance. New York: Wiley. Shoukri, M. M. (1999). Agreement. In Encyclopedia of Biostatistics, P. Armitage and T. Colton, (eds). New York: Wiley. Shoukri, M. M. (2000). Agreement. In Encyclopedia of Epidemiology, M. Gail (ed). New York: Wiley. Shoukri, M. M. and Asyali, M. H. (2002). Issues of cost and power in the design of agreement studies. Technical Report, Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia. Sukhatme, P. V, Sukhatme, B. V., Sukhatme, S., and Asok, C. (1984). Sampling Theory of Surveys with Applications. Ames: Iowa State University Press. Swiger, L. A., Harvey, W. R., Everson, D. O., and Gregory, K. E. (1964). The variance of the intraclass correlation involving groups with one observation. Biometrics 20, Tukey, J. W. (1956). Variance of variance components: I. Balanced designs. Annals of Mathematical Statistics 27, Walter, D. S., Eliasziw, M., and Donner, A. (1998). Sample size and optimal design for reliability studies. Statistics in Medicine 17, Received October Revised May Accepted May 2003.

SAMPLE SIZE AND OPTIMAL DESIGNS FOR RELIABILITY STUDIES

STATISTICS IN MEDICINE, VOL. 17, 101 110 (1998) SAMPLE SIZE AND OPTIMAL DESIGNS FOR RELIABILITY STUDIES S. D. WALTER, * M. ELIASZIW AND A. DONNER Department of Clinical Epidemiology and Biostatistics,