A Bayesian Approach for Sample Size Determination in Method Comparison Studies

Size: px

Start display at page:

Download "A Bayesian Approach for Sample Size Determination in Method Comparison Studies"

Theodora Lawrence
6 years ago
Views:

1 A Bayesian Approach for Sample Size Determination in Method Comparison Studies Kunshan Yin a, Pankaj K. Choudhary a,1, Diana Varghese b and Steven R. Goodman b a Department of Mathematical Sciences b Department of Molecular and Cell Biology University of Texas at Dallas Abstract Studies involving two methods for measuring a continuous response are regularly conducted in health sciences to evaluate agreement of a method with itself and agreement between methods. However, to our knowledge, no statistical literature is available on how to determine sample sizes for planning such studies when the goal is simultaneous evaluation of intra- and inter-method agreement. We develop a simulation-based Bayesian methodology for this problem in a hierarchical model framework. Unlike a frequentist approach, it takes into account of uncertainty in parameter estimates. The proposed methodology can be used with any scalar measure of agreement currently available in the literature. Its application is illustrated with two real examples one from proteomics and the other involving blood pressure measurement. Key Words: Agreement; Concordance correlation; Fitting and sampling priors; Hierarchical model; Sample size determination; Tolerance interval; Total deviation index. 1 Corresponding author. Address: EC 35, PO Box ; University of Texas at Dallas; Richardson, TX ; USA. pankaj@utdallas.edu; Tel: (972) ; Fax: (972)

2 1 Introduction Comparison of two methods for measuring a continuous response variable is a topic of considerable interest in health sciences research. In practice, a method may be an assay, a medical device, a measurement technique or a clinical observer. The variable of interest, e.g., blood pressure, heart rate, cardiac stroke volume, etc., is typically an indicator of the health of the individual being measured. Over 11,000 method comparison studies have been published in the biomedical literature in the past three decades. Such a study has two possible goals. The first is to determine the extent of agreement between two methods. If this inter-method agreement is sufficiently good, the methods may be used interchangeably or the use of the cheaper or the more convenient method may be preferred. The second goal is to evaluate the agreement of a method with itself. The extent of this intra-method agreement gives an idea of the repeatability of a method and serves as a baseline for evaluation of the inter-method agreement. The topic of how to assess agreement in two methods has received quite a bit of attention in the statistical literature. See Barnhart, Haber and Lin (2007) for a recent review of the literature on this topic. There are several measures of agreement such as limits of agreement (Bland and Altman, 1986), concordance correlation (Lin, 1989), mean squared deviation (Lin, 2000), coverage probability (Lin et al., 2002), and total deviation index or tolerance interval (Lin, 2000; Choudhary and Nagaraja, 2007) among others. On the other hand, the topic of how to plan a method comparison study in particular, the sample size determination (SSD), has not received the same level of attention. We are not aware of any literature on SSD when the interest lies in evaluating both intra- and inter-method agreement. This involves determining the number of replicate measurements in addition to the number of individuals for the study. Lin (1992), Lin, Whipple and Ho (1998), Lin et 2

3 al. (2002), and Choudhary and Nagaraja (2007) do provide frequentist sample size formulas associated with agreement measures of their interest. But they are restricted to just the evaluation of inter-method agreement. However, a simultaneous evaluation of both intraand inter-method agreement is crucial because the amount of agreement possible between two methods is limited by how much the methods agree with themselves. A new method, which is perfect and worthy of adoption, will not agree well with an established standard if the latter has poor agreement with itself (Bland and Altman, 1999). Our aim in this article is to develop a methodology for determining the number of individuals and the number of replicate measurements needed for a method comparison study to achieve a given level of precision of simultaneous inference regarding intra- and inter-method agreement. We take a Bayesian route to this SSD problem as it allows us to overcome an important limitation of a frequentist procedure. In the frequentist paradigm, one generally casts the problem of agreement evaluation in a hypothesis testing framework, and then derives an approximate sample size formula by performing an asymptotic power analysis on the lines of Lehmann (1998, Section 3.3). But this formula involves asymptotic standard error of the estimated agreement measure, which is a function of the unknown parameters in model. So typically one obtains their estimates, either from the literature or through a pilot study, and substitutes them in the SSD formula. However, by merely inserting their values in the formula no account is taken of the variability in the estimates. This issue can be addressed in a Bayesian paradigm. See, e.g., the review articles of Adcock (1997) and Wang and Gelfand (2002), and also the references therein. The proposed procedure can be used in conjunction with any scalar measure of agreement currently available in the literature. This article is organized as follows. In Section 2, we describe the SSD methodology. We investigate its properties in Section 3, and illustrate its application in Section 4 with two real 3

4 examples. Section 5 concludes with a discussion. All the computations reported here are performed using the software packages R (R Development Core Team, 2006) and WinBUGS (Spiegelhalter et al., 2003). We now introduce the two examples. In both cases, we treat the available data as pilot data and use the estimates obtained from them to determine sample sizes for a future study. These examples represent two qualitatively different method comparison studies. As shown in Section 3, in the first example, the measurements on the same individual, whether from the same method or two different methods, appear only weakly correlated. This scenario is somewhat unusual. In contrast, the measurements are highly correlated in the second example, which is the more typical scenario. These distinct scenarios allow us to get a better insight into the workings of the proposed methodology. Example 1 (Protein ratios): In this case, we are interested in SSD for a study to compare two softwares XPRESS of Han et al. (2001) and ASAPRatio of Li et al. (2003), for computing abundance ratios of proteins in a pair of blood samples. The former is labor intensive and requires much user input, whereas the latter is automated to a large extent. These softwares are used in proteomics for comparing protein profiles under two different conditions to discover proteins that may be differentially expressed. See, e.g., Liebler (2002) for a gentle introduction to proteomics. The response of specific interest to us is the protein abundance ratios in blood samples of a pair of healthy individuals. These ratios are expected to be near one since both samples come from healthy people. We have ratios of 8 proteins in blood samples of 5 pairs of individuals computed by the two softwares. Only a total of 60 ratios are available as not all proteins are observed in every sample. These data were collected in the Goodman lab (Department of Molecular and Cell Biology) at our institution to study the variation in the measurement technique. 4

5 Example 2 (Blood pressure): In this case, we consider the blood pressure data of Carrasco and Jover (2003). Here systolic blood pressure (in mmhg) is measured twice, each using two methods a manual mercury sphygmomanometer and an automatic device, OMRON 711, on a sample of 384 individuals. The manual device serves as a reference method and the automatic device is the test method. 2 SSD for method comparison studies Let y ijk, k = 1,..., n, j = 1, 2, i = 1,..., m, represent the k-th replicate measurement from the j-th method on the i-th individual. Here m is the number of individuals and n is the common number of replicate measurements from each method on every individual. A common number of replicates is assumed to simplify the SSD problem. Replicate measurements refer to multiple measurements taken under identical conditions. In particular, the underlying true measurement does not change during replication. Any SSD procedure depends on the model assumed for the data, the inference of interest, and a measure of precision of inference. So we first describe them briefly. 2.1 Modelling the data We assume that the data follow a hierarchical model y ijk β j, b i, b ij, σ 2 j independent N (β j + b i + b ij, σ 2 j ), b i ψ 2 independent N (0, ψ 2 ), b ij ψi 2 independent N (0, ψ2 I ), (1) where β j is the effect of the j-th method; b i is the effect of the i-th individual; and b ij is the individual method interaction. Oftentimes, there is no need for the interaction term b ij in (1). Moreover, when n = 1, this term is not identifiable and we drop it from the model. 5

6 To complete the Bayesian specification of this model, we assume the following mutually independent prior distributions: β j N (0, V 2 j ), ψ 2 IG(A ψ, B ψ ), ψ 2 I IG(A I, B I ), σ 2 j IG(A j, B j ), j = 1, 2; (2) where IG(A, B) represents an inverse gamma distribution its reciprocal follows a gamma distribution with mean A/B and variance A/B 2. All the hyperparameters are positive and need to be specified. Choosing large values for V 2 j and small values for A s and B s typically lead to noninformative priors. For β j, an alternative noninformative choice is an improper uniform distribution over the real line. It can also be thought of as taking V 2 j = in (2). The resulting posterior is proper for fixed hyperparameter values. These prior choices are quite standard in the Bayesian literature (Gelman et al., 2003, ch 15). See also Browne and Draper (2006) and Gelman (2006) for some recent suggestions for priors on variance parameters. The joint posterior of the parameters under (2) is not available in a closedform. So we use the Gibbs sampler a Markov chain Monte Carlo (MCMC) algorithm, described in the Appendix, to simulate draws from the posterior distribution. 2.2 Evaluation of agreement In what follows, we use γ as a generic notation for the vector of all relevant parameters without identifying them explicitly. They will be clear from the context. Let the random vectors (y 1, y 2 ) and (y j1, y j2 ) respectively denote the bivariate populations of measurements from the two methods on the same individual, and two replicate measurements from the j-th 6

7 method on the same individual. The model (1) postulates that y 1 γ N β 1 2, ψ2 + ψ I 2 + σ2 1 ψ2, y 2 β 2 ψ 2 ψ 2 + ψi 2 + σ2 2 y j1 γ N β j 2, ψ2 + ψ I 2 + σ2 j ψ 2 + ψi 2. (3) y j2 β j ψ 2 + ψi 2 ψ 2 + ψi 2 + σ2 j The data in (1) need not be paired for the first joint distribution to be defined. Next, let d 12 = y 1 y 2 and d jj = y j1 y j2 respectively denote the populations of inter-method differences and intra-method differences for the j-th method. It follows that d 12 γ N (µ 12 = β 1 β 2, τ 2 12 = 2ψ 2 I + σ σ 2 2), d jj γ N (0, τ 2 jj = 2σ 2 j ). (4) Let θ 12 be a measure of agreement between two methods and θ jj be the corresponding measure of intra-method agreement for the j-th method. The former is a function of parameters of the distribution of (y 1, y 2 ), whereas the latter is the same function of parameters of the distribution of (y j1, y j2 ). We assume that they are scalar and non-negative, possibly after a simple monotonic transformation, and their low values indicate good agreement. All the agreement measures mentioned in the introduction satisfy these criteria except the limits of agreement, which quantifies agreement using a lower limit and an upper limit. Let U 12 and U jj be upper credible bounds for θ 12 and θ jj, respectively, with (1 α) posterior probability. The agreement is considered adequate when these bounds are small. To compute them, we first fit the model (1) to the observed data with priors (2) and obtain a large number of draws from the joint posterior of model parameters. These draws are then used to simulate from the posteriors of θ 12 and θ jj. The (1 α)-th sample quantiles of draws of θ 12 and θ jj are taken as the bounds U 12 and U jj, respectively. 7

8 2.3 A measure of precision of inference Consider for now only the evaluation of inter-method agreement with the credible bound U 12 for θ 12. It satisfies P r(θ 12 U 12 y) = 1 α, where y represents the observed data. Although this condition ensures {θ 12 U 12 } with a high posterior probability, it says nothing about how the posterior density of θ 12 is distributed near U 12. In particular, if a large portion of this density is concentrated in a region far below U 12, then the inference is not very precise. Various suggestions have been made in the literature for measuring the precision of a credible interval (see Wang and Gelfand, 2002 for a summary). However, they are relevant only for a two-sided interval, and not an one-sided bound such as U 12. So below we develop a new measure of precision that is appropriate for upper bounds. Similar arguments can be used to develop a precision measure for lower bounds. Under certain regularity conditions, the posterior distribution of a population parameter converges to a point mass as m (Gelman et al., 2003, ch 4). This suggests that one can focus on the posterior probability P r(δu 12 θ 12 U 12 y), for a specified δ (0, 1), in search for a measure of precision. Further, considering that SSD takes place prior to data collection, our proposal is to take E { P r(δu 12 θ 12 U 12 y) } = 1 α E { F 12 y (δu 12 ) }, or equivalently T 12 (m, n, δ) = E { F 12 y (δu 12 ) }, (5) as the measure of precision of U 12. The expectation here is taken over the marginal distribution of y, and F 12 y is the posterior cumulative distribution function (cdf) of θ 12. A small value for T 12 indicates a high precision of inference since it represents the probability mass below δu 12. Also, for a fixed (m, n), T 12 is an increasing function of δ increasing from zero when δ = 0 to (1 α) when δ = 1. For the bound U jj, one can similarly take T jj (m, n, δ) = E { F jj y (δu jj ) }, j = 1, 2, (6) 8

9 as the measure of precision. We have T 11 = T 22 when identical prior distributions are used for parameters of the two methods. For T 12 and T jj to be useful in SSD, they should be decreasing functions of m and n, and should tend to zero as m, keeping everything else fixed. Analytically verifying the first property is hard, but simulation can be used to clarify the approximate monotonicity. The second property follows from a straightforward application of asymptotic properties of a posterior distribution. 2.4 Computing the precision measure The expectations T 12 and T jj are not available in closed-forms under model (1). So we use the simulation-based approach of Wang and Gelfand (2002) to compute them. Prior to their work, Bayesian SSD focussed mostly on normal and binomial one- and two-sample problems, partly because of a lack of a general approach for computing precision measures under hierarchical models, which are typically not available in closed-forms. Their key idea is to distinguish between a fitting prior and a sampling prior. A prior that is used to fit a model is what they call a fitting prior. Frequently, this prior is noninformative and sometimes even improper, provided the resulting posterior is proper. On the other hand, for SSD in practice, we are usually interested in achieving a particular level of precision when γ is concentrated in a relatively small portion of the parameter space. This uncertainty in γ is captured using what they call a sampling prior. This prior is necessarily proper and informative. One choice for this is a uniform distribution over a bounded interval, but other distributions can also be used. In our case with model (1), the fitting prior is given by (2). Examples of sampling priors appear in Section 3. Let γ denote a draw of γ from its sampling prior and y be a draw of y from [y γ ], where 9

10 [ ] denotes a probability distribution. This y alone is a draw from [y ] the marginal distribution of y under the sampling prior. By repeating this process we can obtain a large number of draws from [y ], say, yl, l = 1,..., L. Following Wang and Gelfand (2002), we compute the expectations T 12 in (5) and T jj in (6) with respect to [y ] and approximate them using Monte Carlo integration as T12(m, n, δ) = E { F 12 y (δu12) } 1 L Tjj(m, n, δ) = E { F jj y (δujj) } 1 L L F 12 y l (δu12) l=1 L F jj y l (δujj), j = 1, 2, (7) where U 12 and U jj are counterparts of U 12 and U jj when y is used in place of y. l=1 Notice here that the role of the sampling prior for γ is limited to generating y for averaging in (7). The posterior cdf s in these expressions are computed using the fitting prior for γ. Since they too do not have simple forms, we use the posterior draws obtained via MCMC to approximate them. For a given y l, F 12 y l (δu 12) is approximated by the proportion of posterior draws of θ 12 y l that are less than or equal to δu 12, where U 12 is the (1 α)-th sample quantile of all the draws. A similar strategy is used to approximate F jj y l (δu jj). As in the case of T 12 and T jj, the approximate monotonicity of T 12 and T jj with respect to m and n can be clarified using simulation. We investigate this in Section The SSD procedure We would like to find sample sizes (m, n) to make T12 and Tjj, defined in (7), sufficiently small. There may be several combinations of (m, n) that give the same precision of inference. So the cost of sampling must be taken into account to find the optimal combination. Let C I and C R respectively represent the fixed costs associated with sampling an individual and taking a replicate from the two methods. Thus the total cost of sampling n replicates from 10

11 both methods on each of m individuals is C T (m, n) = m(c I + nc R ), C I, C R > 0. (8) We now propose the following SSD strategy: Specify the sampling priors, a large δ (0, 1) and a small (1 β) (0, 1 α). Find (m, n) such that max { T 12(m, n, δ), T 11(m, n, δ), T 22(m, n, δ) } 1 β, (9) and C T (m, n) is minimized. The condition in (9) ensures that the posterior probabilities of the intervals [δu 12, U 12], [δu jj, U jj], j = 1, 2, are all at least (β α), providing sufficiently precise inference. In practice, one may take δ 0.75 and (1 β) Simulation study We now use simulation to verify the monotonicity of the averages in (7) for four agreement measures concordance correlation (Lin, 1989), coverage probability (Lin et al., 2002), mean squared deviation (Lin, 2000), and total deviation index (Lin, 2000; Choudhary and Nagaraja, 2007). They are defined as follows for evaluation of inter-method agreement. concordance correlation: 2cov(y 1, y 2 ) ( E(y1 ) E(y 2 ) ) 2 + var(y1 ) + var(y 2 ). coverage probability: P r( y 1 y 2 q 0 ) for a specified small q 0. mean squared deviation: E ( (y 1 y 2 ) 2). total deviation index: p 0 -th quantile of y 1 y 2 for a specified large p 0. The first one is based on the joint distribution of (y 1, y 2 ), whereas the others are based on the distribution of y 1 y 2. For evaluation of intra-method agreement of j-th method, these measures are defined simply by substituting (y j1, y j2 ) in place of (y 1, y 2 ). Table 1 gives 11

12 their expressions assuming distributions (3) and (4). The first two measures range between (0, 1) and their large values indicate good agreement. For the purpose of SSD, we transform them as θ 12 = log φ 12 and θ jj = log φ jj, where φ s are defined in Table 1, to satisfy the assumptions on θ s that they lie in (0, ) and their small values indicate good agreement. On the other hand, the last two measures range between (0, ) and their small values indicate good agreement. No transformation is needed in this case, i.e., θ 12 = φ 12 and θ jj = φ jj. It may be noted that the usual range of concordance correlation is ( 1, 1), but here they are restricted to be positive because cov(y 1, y 2 ) and cov(y j1, y j2 ) under model (1) are positive. Moreover, the intra-method versions of the last three measures depend on model parameters only through σj 2. For the simulation study, we focus on the two examples introduced in Section 1. First, we need to choose sampling priors for parameters in model (1). We now describe how we construct them by fitting appropriate models to the pilot data in the two examples. In the first example, we take the ASAP software as method 1 and the XPRESS software as method 2. After a preliminary analysis, we model the available data as y ijr β j, b i, protein r, σ 2 j independent N (β j + b i + protein r, σ 2 j ), b i independent N (0, ψ 2 ); i = 1,..., m = 5, j = 1, 2, r = 1,..., 8; where y ijr represents the ratio of r-th protein from j-th method on i-th blood sample pair. The interaction term b ij is not included here as the data do not have enough information to estimate it. To fit this model, we assume mutually independent (fitting) priors uniform ( 10 3, 10 3 ) for β s and protein effects, and IG(10 3, 10 3 ) for variance parameters. The model is fitted using WinBUGS package of Spiegelhalter et al. (2003) by calling it from R through the R2WinBUGS package of Sturtz, Ligges and Gelman (2005). Table 2 presents posterior summaries for (β 1, β 2, log σ1, 2 log σ2, 2 log ψ 2 ). Further, the 95% central credible intervals 12

13 for corr(y 1, y 2 ), corr(y 11, y 12 ) and corr(y 21, y 22 ), irrespective of the protein, are (0.01, 0.44), (0.01, 0.43) and (0.01, 0.51), respectively. These correlations are computed using (3). They are relatively small for a method comparison study because the between-individual variation ψ 2 here is small in comparison with the within-individual variations σj 2, j = 1, 2. Under this model, the agreement measures listed in Table 1 do not depend on proteins. So for the purpose of SSD for a future study, we ignore their effects and assume model (1). Further, to examine the effect of sampling priors on SSD, we consider two sets of sampling priors for (β 1, β 2, log σ1, 2 log σ2, 2 log ψ 2 ) independent normal distributions with means and variances equal to their posterior means and variances, and independent uniform distributions over their central 95% credible intervals. Moreover, since ψ 2 I could not be estimated in this small pilot study but might be present in a bigger study, we take the sampling prior for log ψi 2 to be the same as that of log ψ 2. In the second example, we take the manual device as method 1 and the automatic device as method 2. These data are modelled as (1) with (m, n) = (384, 2) but without the interaction term b ij. This term is dropped from the model on the basis of an exploratory analysis of the data and the deviance information criterion (Spiegelhalter et al., 2003). This model is fitted on the lines of previous model with the same (fitting) priors for β s and variances. Posterior summaries for (β 1, β 2, log σ1, 2 log σ2, 2 log ψ 2 ) are given in Table 2. The 95% central credible intervals for corr(y 1, y 2 ), corr(y 11, y 12 ) and corr(y 21, y 22 ) are (0.86, 0.90), (0.86, 0.90) and (0.85, 0.90), respectively. Such high correlations are typical in method comparison studies. In this case also, we consider two sampling priors for (β 1, β 2, log σ1, 2 log σ2, 2 log ψ 2 ) they are constructed using posterior summaries in exactly the same way as the previous example except that b ij term is not included in (1). This is because the present study, which itself is quite large, does not seem to suggest its need. 13

14 Our investigation of properties of averages in (7) focuses on the following settings: (α, δ) = (0.05, 0.80); fitting priors for β j s in (1) as independent (improper) uniform distributions on real line, and as independent IG(10 3, 10 3 ) distributions for variances; m {15, 20,..., 150}; and n {1, 2, 3, 4}. We restrict attention to n 4 as it is rare to find studies in literature with more than four replicates. We perform the following steps for each combination of example, sampling prior, agreement measure and the settings above: (i) Draw γ from the sampling prior. Then draw [y γ ] from model (1). (ii) Fit model (1) to the data y with above fitting priors, and simulate draws from the posterior of γ. The Gibbs sampler described in Appendix is used for posterior simulation. The Markov chain is run for 2,500 iterations and the first 500 iterations are discarded as burn-in. These numbers were determined through a convergence analysis of the simulated chain. (iii) Use the posterior draws to compute the credible bounds U 12 and U jj, j = 1, 2, for the agreement measure as described in Section 2.4. Also, approximate F 12 y (δu 12) and F jj y (δu jj), j = 1, 2, as described in this section. (iv) Repeat above steps 2,000 times and compute the averages T 12 and T jj, j = 1, 2, in (7). It took about 45 minutes on a Linux computer with 4 GB RAM to compute one set of the three averages. The values of T11 and T12 in case of normal sampling priors are plotted in Figure 1 for concordance correlation (both examples), in Figure 2 for total deviation index with p 0 = 0.8 (both examples), and in Figure 3 for mean squared deviation and coverage probability with q 0 = log(1.5) (only example 1). The omitted scenarios have qualitatively similar results. 14

15 The graphs confirm that the averages T11 and T12 can be considered decreasing functions of m and n. This is also true for T22 for which the results are not shown. Some of the curves do show minor departures from monotonicity, particularly in case of n = 1 when the curves drop rather slowly with m. But they are due to the Monte Carlo error involved in averaging and can be ignored. A slow decline also means that a precise evaluation of agreement is difficult even with a large m. For all agreement measures except concordance correlation, we have max{t12, T11, T22} = max{t11, T22} for all (m, n), in both examples. It implies that to achieve same precision of inference, as measured by (δ, β), these measures require a larger sample size for intra-method evaluation than for inter-method evaluation. This property also holds for concordance correlation in example 2, but it holds in example 1 only when n = 1. The Figures 1-3 also demonstrate that different sample sizes are needed for different agreement measures to attain the same precision of inference. All these conclusions for normal sampling priors also hold for uniform sampling priors (results not shown). A further comparison of the two priors reveals that they lead to virtually identical results in case of example 2. Even in case of example 1, their differences are generally small, about 2-3% at most. This suggests that for SSD, it does not matter much whether normal or uniform sampling priors are used as they lead to roughly the same results. 4 Illustration The results in previous section demonstrate, in particular, that the proposed SSD procedure in Section 2.5 can be used in conjunction with any of the four agreement measures listed in Table 1. In this section, we apply this procedure to determine sample sizes for the two examples introduced in Section 1. For simplicity, we focus only on the total deviation index 15

16 as the measure of agreement. We also investigate the impact of various choices one has to make for SSD based upon this measure. Other agreement measures can be handled similarly. The method for computing T s, and the sampling and fitting priors remain the same as in the previous section. We also take (α, β, δ) = (0.05, 0.85, 0.8). Our choice of δ = 0.80 is motivated by bioequivalence studies (Berger and Hsu, 1996; Wang, 1999) where a difference of about 20% is considered practically equivalent. Table 3 presents combinations of (m, n), 1 n 4, that individually satisfy T12 = 1 β, Tjj = 1 β, j = 1, 2. In other words, we have at least β α = 0.8 individual probabilities in the intervals [0.8U12, U12], [0.8Ujj, Ujj], j = 1, 2, where the upper bounds have 0.95 credible probabilities. The results are presented for both normal and uniform sampling priors, and for p 0 {0.80, 0.85, 0.90, 0.95}. They are computed by performing a grid search, and some of them can be determined from Figure 2. As we expect from simulation studies, larger sample sizes are needed for intra-method evaluation than for inter-method evaluation. Moreover, substantially large values of m are needed for intra-method evaluation with n = 1. The values of m with uniform sampling priors tend to be a bit higher than those with normal sampling priors. They differ by at most two except in the case of intra-method evaluation with n = 1. The sample sizes for inter-method evaluation also appear quite robust to the choice of p 0 sometimes the values of m do increase by one as p 0 increases to the next level. But their maximum difference over the four settings of p 0 is three and most of the times the differences are two or less. The sample sizes for intra-method evaluation do not depend on p 0 since the term involving p 0 in total deviation index appears as a multiplicative constant (see Table 1). It is also worth pointing out that, with the exception of n = 1 case, the sample sizes for our two very different examples show remarkable similarity. In case of n = 1, the value of m for intra- 16

17 method evaluation is substantially higher for the second example than the first example. This is because the sampling priors in the second example induce high correlations between the measurements, whereas they induce only weak correlations in the first example. It follows from the results in Table 3 and (9) that the desired precision of simultaneous evaluation of intra- and inter-method agreement can be achieved by any of the following (m, n) combinations: (121, 1), (58, 2), (34, 3) and (23, 4) for example 1; and (655, 1), (60, 2), (33, 3) and (23, 4) for example 2. These values are based on uniform sampling priors and do not depend on p 0. To find the optimal combination, recall that the cost of sampling is given by (8), where we expect that the relative cost (C R /C I ) (0, 1]. Under this cost function, a combination (m 2, n 2 ) is better than (m 1, n 1 ) provided (m 2 m 1 ) < (C R /C I )(m 1 n 1 m 2 n 2 ). In particular, if (m 1 n 1 m 2 n 2 ) > 0 and (m 2 m 1 ) < 0, then the (m 2, n 2 ) combination is better than (m 1, n 1 ) irrespective of the cost of sampling. A naive application of this criterion to the two examples, without any subject-matter knowledge, suggests that a higher n is better, and hence (23, 4) is the optimal sample size choice in both cases. In case of example 1, the experimental protocol does allow us to draw enough blood from individuals to have four replications. So we decide to use (m, n) = (23, 4) as the sample size for a future study. In case of example 2, however, it is unlikely that four replicate measurements of blood pressure can be obtained from each method without a change in the blood pressure. On the other hand, we do need replications, particularly for evaluating intra-method agreement. So, as a compromise, we take (m, n) = (60, 2) for a future study. There is no change in these sample sizes if normal sampling priors are used instead of uniform sampling priors. Further, to examine the impact of hyperparameters of variances in fitting priors (2), we repeat the determination of optimal (m, n) combination with 10 1 and 10 2 as the common hyperparameter value. We find that m changes by at most one indicating 17

18 that the sample sizes are not sensitive to their choice, provided they are small. 5 Discussion In this article, we described a Bayesian SSD approach for planning a method comparison study. It is conceptually straightforward and can be used with any scalar agreement measure. However, a disadvantage of this approach is that it is simulation-based and hence is computationally intensive. Nevertheless the computations are easy to program in popular software packages R and WinBUGS. An R program that can be used for SSD is publicly available on the website This approach involves specifying sampling priors for parameters. Information regarding them can be obtained from the literature or through a pilot study as we do here. Simulation results show that one does not need good approximations for the distributions since the resulting sample sizes are robust to their choice. The sample sizes are also robust to the choice of variance hyperparameters in the fitting priors if they are small. The proposed approach also involves making rather subjective choices for (δ, β) that control how much precision of inference is desired. But, in a rough sense, their roles here are similar to the roles of effect size and desired power, which also need to be specified in the usual frequentist SSD approach. In both examples, among the (m, n) combinations that yield the same precision, the combination with higher n happened to be more cost efficient, at least in the range of n investigated. This was true irrespective of what the cost of taking a replicate was relative to the cost of sampling an individual. This conclusion is not true in general. It is easy to construct examples where a higher n is more cost efficient only when the relative cost of sampling a replicate is small. 18

19 References Adcock, C. J. (1997). Sample size determination: a review. The Statistician 46, Barnhart, H. X., Haber, M. J. and Lin, L. I. (2007). An overview on assessing agreement with continuous measurement. Journal of Biopharmaceutical Statistics. (to appear). Berger, R. L. and Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science 11, Bland, J. M. and Altman, D. G. (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research 8, Browne, W. J. and Draper, D. (2006). A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Analysis 1, Carrasco, J. L. and Jover, L. (2003). Estimating the generalized concordance correlation coefficient through variance components. Biometrics 59, Choudhary, P. K. and Nagaraja, H. N. (2007). Tests for assessment of agreement using probability criteria. Journal of Statistical Planning and Inference 137, Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 1, Han, D. K., Eng, J., Zhou, H. and Aebersold, R. (2001). Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nature Biotechnology 19, Lehmann, E. L. (1998). Elements of Large Sample Theory. Springer-Verlag, New York. 19

20 Li, X.-J., Zhang, H., Ranish, J. R. and Aebersold, R. (2003). Automated statistical analysis of protein abundance ratios from data generated by stable isotope dilution and tandem mass spectrometry. Analytical Chemistry 75, Liebler, D. C. (2002). Introduction to Proteomics: Tools for the New Biology. Humana Press, Totowa, NJ. Lin, L. I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, Corrections: 2000, 56: Lin, L. I. (1992). Assay validation using the concordance correlation coefficient. Biometrics 48, Lin, L. I. (2000). Total deviation index for measuring individual agreement with applications in laboratory performance and bioequivalence. Statistics in Medicine 19, Lin, L. I., Hedayat, A. S., Sinha, B. and Yang, M. (2002). Statistical methods in assessing agreement: Models, issues, and tools. Journal of the American Statistical Association 97, Lin, S. C., Whipple, D. M. and Ho, C. S. (1998). Evaluation of statistical equivalence using limits of agreement and associated sample size calculation. Communications in Statistics, Part A Theory and Methods 27, R Development Core Team (2006). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press, New York. 20

21 Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2003). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B 64, Spiegelhalter, D. J., Thomas, A., Best, N. G. and Lunn, D. (2003). WinBUGS Version 1.4 User Manual. Sturtz, S., Ligges, U. and Gelman, A. (2005). R2WinBUGS: A Package for Running Win- BUGS from R. Journal of Statistical Software 12, Wang, F. and Gelfand, A. E. (2002). A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Statistical Science 17, Wang, W. (1999). On equivalence of two variances of a bivariate normal vector. Journal of Statistical Planning and Inference 81, Appendix: Gibbs sampler for posterior simulation The hierarchical model (1) can be written as y (β, b, R) N (Xβ + Zb, R), b G N (0, G), where X and Z are appropriately chosen design matrices consisting of zeros and ones; y = (y 1,..., y m ) is the data vector; β = (β 1, β 2 ) ; b = (b 1,..., b m, b 11, b 12,..., b m1, b m2 ) ; R = diag{r 1,..., R m } is the conditional covariance matrix of y; and G = diag{ψ 2 I m, ψi 2I 2m} is the covariance matrix of b. Here I m denotes an m m identity matrix, and y i and R i are respectively a column vector and a covariance matrix of order 2n, given as y i = (y i11,..., y i1n, y i21,..., y i2n ), R i = diag{σ 2 1I n, σ 2 2I n }. 21

22 The parameters in this case are (β, b, ψ 2, ψ 2 I, σ2 1, σ 2 2). It is well-known that this model is conditionally conjugate under the priors (2) (Ruppert, Wand and Carroll, 2003, ch 16). In particular, the full conditional distribution the conditional distribution of a parameter given the remaining parameters and y, of the column vector (β, b) is N ( {C R 1 C + D} 1 C R 1 y, {C R 1 C + D} 1), where C = [X, Z], D = diag{b 1, G 1 }, and B = diag{v 2 1, V 2 2 }. The matrices involved in this distribution can be computed using a QR decomposition. The full conditionals of variance parameters are the following independent inverse gamma distributions: ( ψ 2 IG A ψ + m/2, B ψ + ) ( b 2 i /2 ; ψi 2 IG A I + m, B I + i i ( σj 2 IG A j + (mn)/2, B j + ) (y ijk β j b i b ij ) 2 /2 i k j, j = 1, 2. ) b 2 ij/2 ; The Gibbs sampler algorithm simulates a Markov chain whose limiting distribution is the desired joint posterior of parameters (Gelman et al., 2003, ch 11). It iterates the following two steps until convergence: First, use the current draw of (ψ 2, ψ 2 I, σ2 1, σ 2 2) to sample from the normal full conditional of (β, b). Then, use the draw of (β, b) in the previous step to sample from the independent inverse gamma full conditionals of (ψ 2, ψ 2 I, σ2 1, σ 2 2). The estimates of variance parameters from a frequentist mixed model fit can be used as the starting points. 22

23 List of Tables and Figures Table 1 Expressions for various agreement measures for inter-method evaluation and intra-method evaluation of j-th method, j = 1, 2. They are derived using (3) and (4) under model (1). Here Φ( ) represents a N (0, 1) cdf; and χ 2 1(p 0, ) represents the p 0 -th quantile of a chi-squared distribution with one degree of freedom and non-centrality parameter. Table 2 Posterior summaries for selected parameters in the protein ratios data (example 1) and the blood pressure data (example 2). Table 3 Values of (m, n) that give the desired precision of agreement evaluation. Here 1-1, 2-2 and 1-2 respectively indicate intra-method agreement of method 1, method 2, and inter-method agreement. Results are presented assuming total deviation index is used as the measure of agreement with various common choices for p 0. The sample sizes for intra-method evaluation do not depend on p 0. Figure 1 Average probabilities T11 and T12 when agreement is measured using concordance correlation. These probabilities measure precisions of inter-method agreement and intra-method agreement of method 1, respectively. Here n represents the number of replicates and results are presented for both examples. Figure 2 Average probabilities T 11 and T 12 when agreement is measured using total deviation index with p 0 = Here n represents the number of replicates and results are presented for both examples. Figure 3 Average probabilities T11 and T12 when agreement is measured using mean squared deviation (top) and coverage probability (bottom) with q 0 = log(1.5). Here n represents the number of replicates and results are presented only for example 1. 23

24 measure inter-method (φ 12 ) intra-method (φ jj ) concordance correlation 2ψ 2 µ ψ2 +2ψI 2+σ2 1 +σ2 2 ψ 2 +ψi 2 ψ 2 +ψi 2+σ2 j coverage probability Φ( q 0 µ 12 τ 12 ) Φ( q 0 µ 12 τ 12 ) Φ( q 0 τ jj ) Φ( q 0 τ jj ) mean squared deviation µ τ 2 12 τ 2 jj total deviation index τ 12 {χ 2 1(p 0, µ 2 12/τ 2 12)} 1/2 τ jj {χ 2 1(p 0, 0)} 1/2 Table 1: Expressions for various agreement measures for inter-method evaluation and intramethod evaluation of j-th method, j = 1, 2. They are derived using (3) and (4) under model (1). Here Φ( ) represents a N (0, 1) cdf; and χ 2 1(p 0, ) represents the p 0 -th quantile of a chi-squared distribution with one degree of freedom and non-centrality parameter. 24

25 Protein ratio data Blood pressure data mean sd 2.5% 50% 75% 97.5% mean sd 2.5% 97.5% β β log σ log σ log ψ Table 2: Posterior summaries for selected parameters in the protein ratios data (example 1) and the blood pressure data (example 2). 25

26 Normal n Uniform n agreement p Example 1 (protein ratios) 1-1 n/a n/a Example 2 (blood pressure) 1-1 n/a n/a Table 3: Values of (m, n) that give the desired precision of agreement evaluation. Here 1-1, 2-2 and 1-2 respectively indicate intra-method agreement of method 1, method 2, and inter-method agreement. Results are presented assuming total deviation index is used as the measure of agreement with various common choices for p 0. The sample sizes for intra-method evaluation do not depend on p 0. 26

27 concordance correlation (example 1) concordance correlation (example 1) average probability (Tstar.11) n = 1 n = 2 n = 3 n = 4 average probability (Tstar.12) n = 1 n = 2 n = 3 n = number of individuals (m) number of individuals (m) concordance correlation (example 2) concordance correlation (example 2) average probability (Tstar.11) n = 1 n = 2 n = 3 n = 4 average probability (Tstar.12) n = 1 n = 2 n = 3 n = number of individuals (m) number of individuals (m) Figure 1: Average probabilities T 11 and T 12 when agreement is measured using concordance correlation. These probabilities measure precisions of inter-method agreement and intramethod agreement of method 1, respectively. Here n represents the number of replicates and results are presented for both examples. 27

28 total deviation index (example 1) total deviation index (example 1) average probability (Tstar.11) n = 1 n = 2 n = 3 n = 4 average probability (Tstar.12) n = 1 n = 2 n = 3 n = number of individuals (m) number of individuals (m) total deviation index (example 2) total deviation index (example 2) average probability (Tstar.11) n = 1 n = 2 n = 3 n = 4 average probability (Tstar.12) n = 1 n = 2 n = 3 n = number of individuals (m) number of individuals (m) Figure 2: Average probabilities T 11 and T 12 when agreement is measured using total deviation index with p 0 = Here n represents the number of replicates and results are presented for both examples. 28

29 mean squared deviation (example 1) mean squared deviation (example 1) average probability (Tstar.11) n = 1 n = 2 n = 3 n = 4 average probability (Tstar.12) n = 1 n = 2 n = 3 n = number of individuals (m) number of individuals (m) coverage probability (example 1) coverage probability (example 1) average probability (Tstar.11) n = 1 n = 2 n = 3 n = 4 average probability (Tstar.12) n = 1 n = 2 n = 3 n = number of individuals (m) number of individuals (m) Figure 3: Average probabilities T 11 and T 12 when agreement is measured using mean squared deviation (top) and coverage probability (bottom) with q 0 = log(1.5). Here n represents the number of replicates and results are presented only for example 1. 29

Bayesian and Frequentist Methodologies for Analyzing Method Comparison Studies With Multiple Methods

Bayesian and Frequentist Methodologies for Analyzing Method Comparison Studies With Multiple Methods Pankaj K. Choudhary a,1 and Kunshan Yin b a University of Texas at Dallas, Richardson, TX 75080, USA