A Heteroscedastic Measurement Error Model for Method Comparison Data With Replicate Measurements

Size: px

Start display at page:

Download "A Heteroscedastic Measurement Error Model for Method Comparison Data With Replicate Measurements"

Jade Hicks
5 years ago
Views:

1 A Heteroscedastic Measurement Error Model for Method Comparison Data With Replicate Measurements Lakshika S. Nawarathna Department of Statistics and Computer Science University of Peradeniya Peradeniya 20400, Sri Lanka Pankaj K. Choudhary 1 Department of Mathematical Sciences, FO 35 University of Texas at Dallas Richardson, TX , USA Abstract Measurement error models offer a flexible framework for modeling data collected in studies comparing methods of quantitative measurement. These models generally make two simplifying assumptions: (a) the measurements are homoscedastic; and (b) the unobservable true values of the methods are linearly related. One or both of these assumptions may be violated in practice. In particular, error variabilities of the methods may depend on the magnitude of measurement or the true values may be nonlinearly related. Data with these features call for a heteroscedastic measurement error model that allows nonlinear relationships in the true values. We present such a model for the case when the measurements are replicated, discuss its fitting, and explain how to evaluate similarity of measurement methods and agreement between them, which are two common goals of data analysis, under this model. Model fitting involves dealing with lack of a closed form for the likelihood function. We consider estimation 1 Corresponding author. pankaj@utdallas.edu, Tel: (972) , Fax: (972)

2 methods that approximate either the likelihood or the model to yield approximate maximum likelihood estimates. The fitting methods are evaluated in a simulation study. The proposed methodology is used to analyze a cholesterol dataset. Keywords: Agreement, calibration, mixed-effects model, nonlinear model, repeated measures, total deviation index. 1 Introduction Method comparison studies are concerned with comparing a new cheaper or easier test method for measuring a quantitative variable with an established reference method. Such studies are routinely conducted in biomedical disciplines. The variable being measured often has some clinical interest, e.g., cholesterol level or fat content. The methods may be medical devices, assays, clinical observers, or measurement protocols. None of the methods is considered error-free. The data consist of at least one measurement from each method on every subject in the study. Our focus is on the study design wherein measurements from both methods are replicated. The primary goal of the comparison, especially if the methods measure in the same nominal unit, is to evaluate agreement between the methods to see if they can be used interchangeably. A number of articles have developed statistical methodologies for this purpose, including [1 6]. See [7, 8] for overviews. Another common goal of the comparison, irrespective of whether the methods measure in the same nominal unit or not, is to evaluate similarity of the methods by comparing their accuracies and precisions, and recalibrate one method with respect to the other. Statistical methodologies for accomplishing this goal are reviewed in [9]. Regardless of the goal of a method comparison study, modeling of data is a key step in the data analysis. Two modeling frameworks, namely, a mixed-effects model [10] and a mea- 2

3 surement error model [11], are especially popular. A mixed-effects model is employed when the methods can be assumed to have the same measurement scale, meaning that the true (i.e., error-free) values of the methods may differ only by a constant [12 16]. An example of methods with the same scales is two thermometers, one measuring in Celsius ( C) and the other in Kelvin (K) because K C = The assumption of a common scale is not needed in a measurement error model because it allows the true values to be linearly related rather than just differ by a constant [17 20]. An example of methods with linearly related true values is two thermometers, one measuring in C and the other in Fahrenheit ( F) because F = 32 + (9/5) C. Note that for methods to have the same scale it is neither necessary nor sufficient that they have the the same unit of measurement. While thermometers measuring in C and K are an example of methods with different units but same scales, an example of methods with same units but different scales is two thermometers, both measuring in C, but one consistently giving 10% higher measurement than the other due to miscalibration. Of course, in these temperature related examples we know the relationships between the various thermometers. But in most method comparison studies in practice these relationships need to be estimated from the data. Obviously since a constant difference in true values is a special case of a linear relation in them when the slope is one, a measurement error model offers a more flexible framework for modeling method comparison data than a mixed-effects model. Measurement error models have been advocated in [17 21]. But these models generally make a simplifying assumption that the measurements are homoscedastic, i.e., the variability of measurements remains constant over the entire measurement range. In practice, however, it frequently happens that the variability of a measurement changes with its magnitude [1, 17, 22]. The cholesterol data of [12], which motivated this work and is analyzed later in this article, provides a specific example of this phenomenon. In presence of such heteroscedasticity, the precisions of the methods 3

4 as well as the extent of agreement between them change with the magnitude of measurement. But these quantities would be treated as constants if the heteroscedasticity is ignored, leading to potentially misleading conclusions. Variance stabilizing transformation of data may be used to remove the heteroscedasticity, but the difference of transformed measurements may be difficult to interpret. This is a problem because the measurement differences need to be interpretable to evaluate agreement between the methods [22]. Therefore, models that explicitly incorporate heteroscedasticity are of considerable interest to practitioners. Recently, a heteroscedastic mixed-effects model was proposed in [23] for replicated method comparison data. As for the measurement error model framework, heteroscedastic models have been considered, e.g., in [9, 24, 25], but none is specifically designed for replicated method comparison data. This brings us to the main goals of this article, which are to present such a model, discuss computational algorithms for fitting it, and illustrate its application. The novelty of our approach also lies in that we allow the true values of the methods to be nonlinearly related thereby obtaining the standard model with linear relationship as an important special case. Heteroscedastic models allowing nonlinear relationships have hitherto not been studied in the method comparison literature. The rest of this article is organized as follows. Section 2 presents the proposed model. Section 3 discusses computational methods for fitting it. Section 4 describes a simulation study to evaluate the model fitting methods. Section 5 shows how to use the model to evaluate similarity of measurement methods and agreement between them. Section 6 illustrates an application by analyzing a cholesterol data set. Section 7 concludes with a discussion. All the computations in this article have been performed using the statistical software R [26]. 4

5 2 The proposed heteroscedastic model Consider a method comparison study involving two measurement methods and m subjects. Let Y ijk be the kth replicate measurement by the jth method on the ith subject. The data in the study consist of Y ijk, k = 1,..., n ij, j = 1, 2, i = 1,..., m. Here method 1 represents the reference method and method 2 represents the test method. The multiple measurements from a method on a subject are exchangeable in that they are replications of the same underlying measurement. The replicate measurements from the two methods on a subject are dependent but they are not paired. In fact, the methods may not even have the same number of replications on a subject. Let n i = n i1 + n i2 be the total number of measurements on the ith subject. The n i need not be equal. In what follows, we will use bold-face letters to denote vectors and matrices. By default, a vector is a column vector unless specified otherwise. The transpose of a vector or matrix A is denoted as A T. Let Y ij = (Y ij1,..., Y ijnij ) T denote the n ij -vector of measurements on subject i from method j. The n i -vector Y i = (Y T i1, Y T i2) T denotes all measurements on subject i. Let Ỹ = (Ỹ1, Ỹ2) T denote paired measurements from the two methods on a randomly selected subject from the population. We think of Ỹ as a typical measurement pair in that the observed (Y i1k, Y i2l ) pairs even though the replications are not paired by design are identically distributed as Ỹ. Let θ be the vector of all unknown model parameters. We use h θ (y 1, y 2 ) for the joint probability density function of (Y 1, Y 2 ) and h θ (y 1 y 2 ) for the conditional density of Y 1 Y 2 = y 2. We also use N d (µ, Σ) to denote a d-variate normal distribution with mean vector µ and covariance matrix Σ. 5

6 2.1 The model for Ỹ To prepare the groundwork for presenting a heteroscedastic model for the observed data, we first present it for Ỹ and then adapt it for the observed data. Let b denote the true unobservable measurement underlying Ỹ, and e 1 and e 2 denote the random errors of the two methods. The basic measurement error model for Ỹ is written as [18] Ỹ 1 = b + e 1, Ỹ 2 = β 0 + β 1 b + e 2 ; b N 1 (µ, τ 2 ), e 1 N 1 (0, σ 2 1), e 2 N 1 (0, σ 2 2), (1) where β 0 and β 1 are regression coefficients respectively known as fixed and proportional biases of the test method, and (b, e 1, e 2 ) are mutually independent. For reasons of model identifiability, the true measurement b is also the true value of the reference method. This model postulates a linear relationship between the true values b and β 0 + β 1 b of the two methods. The methods have different scales when the slope β 1 1. The model (1) is called a measurement error model because the covariate b of Ỹ2 is measured with error as Ỹ1. We now make three changes to this basic model. First, we replace the linear calibration function β 0 + β 1 b relating the true values of the two methods by a more general function f(b, β). The function f has a known parametric form which depends on a fixed unknown parameter vector β. Moreover, f is differentiable and may be nonlinear in b as well as in β. Specific examples of f include β 0 + β 1 b (linear model), β 0 + β 1 b + β 2 b 2 (quadratic model), and β 0 exp( β 1 b) (exponential model). Second, we add independent method subject interaction effects b 1 N 1 (0, ψ 2 ) and b 2 N 1 (0, ψ 2 ) to the respective expressions for Ỹ1 and Ỹ2. These effects are essentially subject specific biases of the methods. They are also known as equation errors in the measurement error literature and as matrix effects in analytical chemistry [9]. They appear additively in the model and are mutually independent of (b, e 1, e 2 ). See also the discussion in Section 7 for a note regarding how the equal variance assumption for the two interaction 6

7 effects may be relaxed. It may be noted that interaction effects for both methods are almost always included in the model when a mixed-effects model is used for replicated method comparison data. However, when a measurement error model is used, they are often included only for the test method but not for the reference method, see, e.g., [9, pp ]. Finally, we replace the constant error variance σj 2 by var[e j b] = σj 2 gj 2 (b, δ j ), j = 1, 2, where g j is a variance function that models how the error variance of the jth method depends on the true value b. This g j is also differentiable and has a known parametric form depending on an unknown heteroscedasticity parameter vector δ j, which is such that g j (b, δ j ) 1 when δ j = 0. The two methods may have different variance functions. Examples of a variance function include b δ (power model), δ 0 + b δ 1 (constant plus power model), and exp(δb) (exponential model) [10]. The model becomes homoscedastic when δ 1 = 0 = δ 2. After these changes, the basic model (1) for Ỹ becomes a heteroscedastic measurement error model, Ỹ 1 = b + b 1 + e 1, Ỹ 2 = f(b, β) + b 2 + e 2 ; b j N 1 (0, ψ 2 ), e j b N 1 ( 0, σ 2 j g 2 j (b, δ j ) ), b N 1 (µ, τ 2 ), j = 1, 2. (2) Here e 1 and e 2 are conditionally independent given b. Marginally, they are uncorrelated but dependent. Further, b is independent of (b 1, b 2 ) and (b 1, b 2, e 1, e 2 ) are mutually independent. This model is nonlinear in b unless f is linear in b and the g j are free of b. By marginalizing over (b 1, b 2 ), we get a hierarchical representation of this model as Ỹ 1 b N 1 ( b, ψ 2 + σ 2 1 g 2 1(b, δ 1 ) ), Ỹ 2 b N 1 ( f(b, β), ψ 2 + σ 2 2 g 2 2(b, δ 2 ) ), b N 1 (µ, τ 2 ), (3) where Ỹ1 and Ỹ2 are conditionally independent given b. In general, further marginalization over b does not yield a closed-form marginal distribution for Ỹ, albeit its marginal mean 7

8 vector and covariance matrix can be written as E[Ỹ] = µ, E[f(b, β)] var[ỹ] = diag{ ψ 2 + σ1 2 E[g1(b, 2 δ 1 )], ψ 2 + σ2 2 E[g2(b, 2 δ 2 )] } + Γ, (4) where Γ is the covariance matrix of (b, f(b, β)) T, Γ = τ 2 cov[b, f(b, β)]. (5) cov[b, f(b, β)] var[f(b, β)] 2.2 The model for observed data We get a model for observed data Y i from that of Ỹ in (2) by simply replacing Ỹj with Y ijk and (b, b j, e j ) with its independent copies (b i, b ij, e ijk ) for k = 1,..., n ij, j = 1, 2, i = 1,..., m. This gives Y i1k = b i + b i1 + e i1k, Y i2k = f(b i, β) + b i2 + e i2k ; b ij N 1 (0, ψ 2 ), e ijk b i N 1 ( 0, σ 2 j g 2 j (b i, δ j ) ), b i N 1 (µ, τ 2 ). (6) In this model, the multiple measurements from method j on subject i are dependent because they share the same b i and b ij. Furthermore, the measurements from different methods on subject i are also dependent because they share the same b i. The measurements on different subjects are independent. To write this model in the matrix form, let 1 n and 0 n be n-vectors of ones and zeros, and define e ij = (e ij1,..., e ijnij ) T, e i = e i1, b i = b i1, U i = 1 n i1, V i = 0 n i1, Z i = [U i, V i ]. (7) e i2 b i2 Also define Σ ij (b) as a n ij n ij diagonal matrix and Σ i (b) as a n i n i diagonal matrix, 0 ni2 1 ni2 Σ ij (b) = diag { σ 2 j g 2 j (b, δ j ),..., σ 2 j g 2 j (b, δ j ) }, Σ i (b) = diag { Σ i1 (b), Σ i2 (b) }. 8

9 Now the model (6) can be written in the matrix form as Y i = U i b i + V i f(b i, β) + Z i b i + e i ; ( e i b i N ni 0, Σi (b i ) ), b i N 2 (0, ψ 2 diag{1, 1}), b i N 1 (µ, τ 2 ), i = 1,..., m. (8) Proceeding as in (3), we can represent the model in a hierarchical manner as ( Y i b i N ni Ui b i + V i f(b i, β), ψ 2 Z i Z T i + Σ i (b i ) ), b i N 1 (µ, τ 2 ). (9) Here Y i1 and Y i2 are conditionally independent given b i because the matrix ψ 2 Z i Z T i + Σ i (b i ) = diag { ψ 2 1 ni1 1 T n i1 + Σ i1 (b i ), ψ 2 1 ni2 1 T n i2 + Σ i2 (b i ) } has a diagonal structure. This is expected since the dependence in Y i1 and Y i2 is induced only through the common b i. Moreover, analogous to (4), the marginal mean vector and covariance matrix of Y i are E[Y i ] = U i µ + V i E[f(b, β)], var[y i ] = Z i ΓZ T i + ψ 2 Z i Z T i + E[Σ i (b)], (10) with Γ given by (5). In principle, the marginal probability density function h θ (y i ) of Y i can be obtained as h θ (y i ) = h θ (y i, b i )db i. (11) Here h θ (y i, b i ) = h θ (y i1 b i )h θ (y i2 b i )h θ (b i ) from conditional independence. The densities involved in this expression are normal densities obtained from (9). However, the integral (11) does not have a closed-form in general. 2.3 The case of linear f The case of linear calibration function f(b, β) = β 0 + β 1 b is of special interest in practice. In this case, the moments of Ỹ in (4) and Y i in (10) simplify because we explicitly have E[f(b, β)] = β 0 + β 1 µ, Γ = τ 2 β 1 τ 2. β 1 τ 2 β1τ 2 2 9

10 In addition, if the model is homoscedastic then from (9), Y i N ni (E[Y i ], var[y i ]), where E[Y i ] = V i β 0 + (U i + V i β 1 )µ, var[y i ] = τ 2 (U i + V i β 1 )(U i + V i β 1 ) T + ψ 2 Z i Z T i + diag { σ 2 11 T n i1, σ 2 21 T n i2 }. (12) Thus, in this case the marginal density h θ (y i ) is a normal density. This exception occurs because b i appears in the model linearly, allowing it to be explicitly integrated out in (11). 3 Model fitting by maximum likelihood 3.1 Likelihood computation Let L(θ) denote the likelihood function of parameter vector θ = (µ, τ 2, β T, ψ 2, σ1, 2 σ2, 2 δ T 1, δ T 2 ) T under model (6). By definition, m L(θ) = h θ (y i ), i=1 but h θ (y i ), given by (11), does not have an explicit expression in general. Therefore, we now describe two approaches for computing it. The first numerically approximates the integral (11), whereas the second approximates the original model (6) so that the resulting density has a closed-form Approach 1: Numerical integration Let l θ (y i, b i ) = log h θ (y i, b i ) be the negative logarithm of the integrand in (11); b i,min be its minimizer with respect to b i ; and l θ (y i, b i,min ) = ( 2 / b 2 i )l θ (y i, b i ) bi =b i,min be the corresponding Hessian at the minima. A simple approximation of the integral (11) is the Laplace approximation (LA) [27], h θ (y i ) (2π) 1/2 l θ(y i, b i,min ) 1/2 h θ (y i, b i,min ). 10

11 Another approximation is given by the Gauss-Hermite quadrature (GH) [27]. To describe this, let z 1,..., z M be the nodes and w 1,..., w M be the associated quadrature weights with kernel exp( z 2 ). The nodes are centered and scaled to achieve greater accuracy as [28] c ir = b i,min + 2 1/2 l θ(y i, b i,min ) 1/2 z r, r = 1,..., M. The approximated integral in this case is h θ (y i ) 2 1/2 l θ(y i, b i,min ) 1/2 M h θ (y i, c ir )w r exp(zr). 2 This method reduces to LA when M = 1 because the sole node in this case is zero with r=1 weight π 1/2 [28]. This makes it clear that GH is not only more accurate but also more computationally demanding than LA. In practice, nodes tend to provide reasonably good accuracy for GH method Approach 2: Model approximation by linearization This approach approximates the model (6) by linearizing f and g j functions in b i an unobservable random quantity around an observable non-random quantity b i that is close to b i but is held fixed in model fitting. This kind of linearization is a standard strategy in fitting of nonlinear mixed-effects and measurement error models [10, 29] and even generalized linear mixed-effects models [30]. Expanding f and g j functions around b i = b i using Taylor series and keeping the first two terms for f and only the first term for g j, we get f(b i, β) f(b i, β) + (b i b i )f (b i, β), g j (b i, δ j ) g j (b i, δ j ), j = 1, 2, (13) where f (b i, β) = ( / b i )f(b i, β) bi =b. The approximation for f is exact when f is linear in i b i, whereas the approximation for g j is exact when the model is homoscedastic. Replacing 11

12 f and g j in (6) by their approximations in (13) gives the linearized version of (6) as Y i1k = b i + b i1 + e i1k, Y i2k f(b i, β) + (b i b i )f (b i, β) + b i2 + e i2k ; b ij N 1 (0, ψ 2 ), e ijk N 1 ( 0, σ 2 j g 2 j (b i, δ j ) ), b i N 1 (µ, τ 2 ), k = 1,..., n ij, j = 1, 2, i = 1,..., m. (14) Here the notation means is approximately distributed as, and the approximation is caused by the linearization. Letting d(b i, β) = f(b i, β) b i f (b i, β), (15) the approximate model (14) can be written in the matrix form as Y i V i d(b i, β) + (U i + V i f (b i, β))b i + Z i b i + e i ; e i N ni ( 0, Σi (b i ) ), b i N 2 (0, ψ 2 diag{1, 1}), b i N 1 (µ, τ 2 ), i = 1,..., m. (16) It follows from marginalizing over b i and b i that Y i N ni (E[Y i ], var[y i ]), where E[Y i ] V i d(b i, β) + (U i + V i f (b i, β))µ, var[y i ] τ 2 (U i + V i f (b i, β))(u i + V i f (b i, β)) T + ψ 2 Z i Z T i + Σ(b i ). (17) Thus, h θ (y i ) can be approximated by the density of this normal distribution. We refer to this model approximation method as MA. This closed-form approximation is made possible by the way f and g j functions are linearized, which ensures that b i appears in the model linearly. This also explains why only the first term in the Taylor expansion was kept for g j. One may think of this approximation method as a pseudo-likelihood approach because the true model is approximated by model that leads to a normal marginal likelihood. To implement this method it remains to choose b i. A natural choice is b i = y i1, the mean for the reference method. The resulting model (14) with b i held fixed can be fit via maximum likelihood (ML). An alternative choice for b i is the best linear unbiased predictor 12

13 of b i. The model in this case needs to be fit by an iterative scheme because the predictor itself depends on unknown model parameters [10, 29]. Empirical results in [23] show that this additional complexity in model fitting is not worthwhile at least for method comparison studies because the differences in parameter estimates are negligible. Therefore, we only work with b i = y i1 in this article. 3.2 Inference on model parameters The likelihood function approximated using either of the three methods LA, GH and MA can be maximized by an optimization routine, e.g., optim function in R, to compute approximate ML estimate ˆθ of θ. Subsequent inference on θ ignores the approximation in ˆθ and employs the standard large-sample theory of ML estimators [31]. In particular, when m is large, the standard errors (SEs) of estimates and confidence intervals for parameters are obtained by approximating the distribution of ˆθ by a normal distribution with mean θ and the inverse of the observed information matrix I = ( / θ 2 ) log L(θ) θ=ˆθ as the covariance matrix. Here L(θ) represents the likelihood function under the model actually fit to the data. Moreover, the null hypothesis of homoscedasticity (δ 1 = 0 = δ 2 ) is tested by performing a likelihood ratio test wherein the null distribution of the test statistic is approximated by a chi-square distribution with degrees of freedom equal to number of parameters set to zero under the null hypothesis. This strategy of ignoring the approximation in ˆθ for further statistical inference is common in nonlinear mixed-effects and measurement error models [10, 29] and generalized linear mixed-effects models [30]. 13

14 3.3 Fitted values and residuals Let Ŷi denote the fitted value of Y i, i = 1,..., m. Under the linearized model (16), Ŷ i V i d(b i, ˆβ) + (U i + V i f (b i, ˆβ))ˆb i + Z iˆbi, where (ˆb i, ˆb i ) is the estimated best predictor of (b i, b i ) given Y i, obtained by substituting θ = ˆθ in E[(b i, b i ) T Y i ] (µ, 0, 0) T + diag{τ 2, ψ 2, ψ 2 }(var[y i ]) 1 (Y i E[Y i ]), with E[Y i ] and var[y i ] given by (17). The residuals can be computed as ê i = Y i Ŷi, i = 1,..., m. These residuals and their standardized counterparts, computed by dividing residuals by estimated error standard deviations (SDs), are used for model checking. 3.4 Specifying f and g j functions Specifying the calibration function f and the variance functions g j is a part of model building exercise which is no different from what we ordinarily do in regression modeling. Therefore, we proceed just the way we proceed to build a parametric regression model. In particular, this involves relying on graphical techniques, such as scatterplot of measurements from the two methods and the residual plot, to come up with preliminary forms for these functions. As we are dealing with unpaired replicate measurements here, we can plot either the randomly formed measurement pairs (Y i1k, Y i2l ) [14] or the paired averages (y i1, y i2 ) on the scatterplot. 4 A simulation study Our next task is to use Monte Carlo simulation to evaluate finite sample performance of the three model fitting methods LA, GH, and MA on four performance measures: 14

15 biases of parameter estimators, their mean squared errors (MSEs), coverage probabilities of 95% confidence intervals, and type I error probability for 5% level likelihood ratio test of homoscedasticity. We focus on f(b, β) = β 0 + β 1 b and g j (b, δ j ) = b δ j, j = 1, 2 because these are the functions we adopt later for analysis of cholesterol data. In addition, we assume a balanced design with n ij {2, 3} replications per method; let δ 1 = δ 2 = δ {0, 0.9, 1, 1.1}; and take m = 50. Table 1 summarizes the actual parameter settings used. These are also motivated by the cholesterol data. The simulation study involves simulating data from the true model (6); computing point and interval estimates and performing the test of homoscedasticity using each of the three fitting methods; repeating the entire process 500 times; and obtaining the desired estimates of the performance measures. For greater accuracy, inference on the variance components (τ 2, ψ 2, σ1, 2 and σ2) 2 is performed on log scale. The GH method uses M = 30 nodes. Table 2 presents the estimated biases of point estimators in case of n ij = 2. The biases for β 1 and µ are negligible relative to their true values for all settings. The biases are small, negative for log τ 2 and log ψ 2. The situation is less clear for other parameters as the biases of their estimators may be positive or negative, albeit their magnitudes are relatively small. Moreover, there is no method that produces the smallest bias for all parameters. The same qualitative conclusions hold in case of n ij = 3 (results not presented). Tables 3 and 4 present estimated efficiencies of LA and MA relative to GH, defined as MSE LA /MSE GH and MSE MA /MSE GH, respectively. We see that the efficiency depends on the parameter, the level of heteroscedasticity and the number of replications. The entries in Table 3 lie between 0.98 to 1.29, implying that GH is more accurate than LA. This finding is not unexpected because LA is a special case of GH with one node. Although there is no difference in the two methods for homoscedastic case, GH s gain in accuracy can be substantial when the extent of heteroscedasticity is high and there are three replications. 15

16 The conclusion is less clear-cut when we look at Table 4 but 90% of entries are between 0.77 to 1.10, implying that in a vast majority of cases MA produces either nearly as accurate or more accurate estimate than GH. There are a few entries greater than 1.10, but it is hard to see a simple pattern among them except that most occur when δ > 0. Table 5 presents estimated coverage probabilities of 95% confidence intervals in case of n ij = 2. All entries are quite close to 95% when δ = 0. But the performance of LA and GH methods degrades as δ increases, and it is not acceptable when δ 1, especially in settings 2 and 3. On the other hand, MA method maintains its coverage probability reasonably close to 95% in all cases. The results for n ij = 3 are omitted as they lead to the same conclusion. Table 6 presents estimated type I error probabilities for 5% level likelihood ratio test for homoscedasticity. All entries are reasonably close to 5%, implying that there is little to distinguish between the three estimation methods on this criterion. Taken together, these findings allow us to conclude that MA is the best choice among the three model fitting methods. Not only it is simplest to implement but it also generally produces the most accurate point and interval estimates. Besides, the test of homoscedasticity based on it has type I error rates close to the nominal level. We also see that m = 50 subjects is large enough for acceptable accuracy of this method. The two numerical approximation methods LA and GH do not perform as well with m = Evaluation of similarity and agreement Evaluation of similarity of measurement methods and agreement between them are two key goals in a method comparison study. This evaluation is conducted by performing inference on measures of similarity and agreement, which are functions of the model parameters. Now we take up the task of obtaining these measures and performing inference on them under 16

17 (6) as the data model. The task includes examining biases and precisions of methods and also the marginal and joint distributions of their measurements. These entities are easy to define and interpret when the model is linear in b. Therefore, instead of the original model (6) we work with its approximation (14) wherein b appears linearly. Further, to make the exposition simpler, we use the companion model of (14) for Ỹ. It can be written as Ỹ 1 = b + b 1 + e 1, Ỹ 2 d(b, β) + f (b, β)b + b 2 + e 2 ; b j N 1 (0, ψ 2 ), e j N 1 ( 0, σ 2 j g 2 j (b, δ j ) ), b N 1 (µ, τ 2 ), j = 1, 2, (18) where d is defined in (15) and b a fixed quantity close to b serves as a proxy for the magnitude of measurement. As before, by marginalizing over (b, b 1, b 2 ) we see that Ỹ N 2 ( E[ Ỹ], var[ỹ]) with E[Ỹ] µ (19) d(b, β) + f (b, β)µ and var(ỹ) τ 2 + ψ 2 + σ1 2 g1(b 2, δ 1 ) τ 2 f (b, β). (20) τ 2 f (b, β) τ 2 {f (b, β)} 2 + ψ 2 + σ2 2 g2(b 2, δ 2 ) For the difference D = Ỹ1 Ỹ2, it follows that D N 1 ( E[ D], var[ D] ), where E[ D] d(b, β) + ( 1 f (b, β) ) µ, var[ D] τ 2( 1 f (b, β) ) 2 + 2ψ 2 + σ 2 1 g 2 1(b, δ 1 ) + σ 2 2 g 2 2(b, δ 2 ). (21) Both the distributions depend on b B, which we take as the observed range of the data. 5.1 Measures of similarity Measures of similarity compare features of marginal distributions of methods, such as biases and precisions. From models (1) and (18) for Ỹ2 we see that the intercept d(b, β) and the 17

18 slope f (b, β) can be respectively interpreted as the fixed bias and the proportional bias of the test method. These biases depend on b unless f(b, β) = β 0 + β 1 b, in which case d(b, β) = β 0 and f (b, β) = β 1. If the slope is one, the methods have the same scale. If, in addition, the intercept is also zero, the methods have the same true values. Precisions of methods can be compared via their ratio but it requires the methods to have the same scale [9, pp ]. The scale of the test method can be made same as the reference method by dividing Ỹ2 by the slope f (b, β). The precision ratio then becomes λ(b ) = {f (b, β)} 2 σ2 1 g 2 1(b, δ 1 ) σ 2 2 g 2 2(b, δ 2 ), b B. (22) This ratio depends on b unless the model is homoscedastic and f is linear in b. If λ(b ) < 1, the reference method is more precise at b than the test method, and vice versa. While this ratio compares var[e j ], often a comparison of var[b j + e j ] is of interest [9, p. 115]. This can be done by replacing σ 2 j g 2 j (b, δ j ) in (22) with ψ 2 + σ 2 j g 2 j (b, δ j ), j = 1, Measures of agreement In contrast to the measures of similarity that compare marginal distributions of methods, the measures of agreement essentially look at their joint distribution to quantify how close the individual measurements are. The methods agree perfectly well if their measurements are identical. Potentially one can directly evaluate agreement between (Ỹ1, Ỹ2). In this case, the effect of any fixed or proportional bias that may exist in the test method manifests in the agreement measures, which are functions of parameters of the bivariate distribution of (Ỹ1, Ỹ2). However, this approach is appropriate only if the similarity evaluation does not show any proportional bias in the test method because in this case the methods are on the same scale, and hence are comparable. Otherwise, it is more appropriate to recalibrate the test method to make its scale same as the reference method prior to evaluating agreement. 18

19 This is just like the rescaling done in (22) before comparing precisions. Taking the rescaling a step further, one can additionally remove any fixed bias that may be present in the test method besides the proportional bias by transforming its measurements as Ỹ 2 = Ỹ2 d(b, β). (23) f (b, β) The recalibrated test method has the same true value as the reference method. It follows from (18)-(20) that Ỹ 1 Ỹ2 N 2 µ, τ 2 + ψ 2 + σ1 2 g1(b 2, δ 1 ) τ 2 µ Moreover, the difference D = Ỹ1 Ỹ 2 τ 2 τ 2 + ψ2 +σ 2 2 g2 2 (b,δ 2 ) {f (b,β)} 2 N 1 ( 0, var[ D ] ), where. (24) var[ D ] ψ 2 + σ 2 1 g 2 1(b, δ 1 ) + ψ2 + σ 2 2 g 2 2(b, δ 2 ) {f (b, β)} 2. (25) The measures of agreement in this case are functions of parameters of the bivariate distribution of (Ỹ1, Ỹ 2 ). While it is true that the recalibration is likely to make the test method agree more with the reference method, but measuring their agreement is appropriate in the first place only if the two methods are on the same scale. The expression for any measure of agreement between either (Ỹ1, Ỹ2) or (Ỹ1, Ỹ 2 ) can be obtained by simply taking the definition of the measure and plugging-in the relevant parameters from their respective bivariate distributions. This approach works for any measure of agreement available in the literature. For example, the two versions of the agreement measure concordance correlation coefficient (CCC) [2] are CCC(b ) 2cov[Ỹ1, Ỹ2] {E[ D]} 2 + var[ỹ1] + var[ỹ2], CCC (b ) 2cov[Ỹ1, Ỹ 2 ] var[ỹ1] + var[ỹ 2 ], b B, where the moments are from (20), (21) and (24). The CCC lies in [ 1, 1] and the larger positive its value the better is the agreement. The starred version of the measure is for the recalibrated data. 19

20 The total deviation index (TDI) [3, 4] is another agreement measure. It is defined as the 100pth percentile of absolute difference in measurements, where p is a specified large probability, typically between 0.80 and The two versions of TDI can be written as TDI(b, p) = 100pth percentile of D sd[ D] { ( ) χ1( 2 2 )} 1/2, p, E[ D]/sd[ D] TDI (b, p) = 100pth percentile of D sd[ D ] { χ 2 1(p, 0) } 1/2, b B, (26) where the moments are from (21) and (25), and χ 2 1(p, ) denotes the 100pth percentile of a noncentral chi-squared distributed with one degree of freedom and noncentrality parameter. When the noncentrality parameter is zero, {χ 2 1(p, 0)} 1/2 = z ( (1 + p)/2 ), the 100(1 + p)th percentile of a standard normal distribution. The TDI is a non-negative measure and the smaller its value the better is the agreement. We only focus on TDI with p = 0.90 for the illustration here. 5.3 Inference on measures of similarity and agreement All measures of similarity and agreement are functions of θ and b. Let φ denote any such measure and φ(b ) be its value at b B. The measure is assumed to be a scalar quantity. Replacing θ by ˆθ in its expression gives its ML estimator ˆφ(b ). From delta method [31], when m is large, ˆφ(b ) N 1 ( φ(b ), G (b )I 1 G(b ) ), where G(b ) = ( / θ)φ(b ) θ=ˆθ can be computed numerically. Thus, approximate 100(1 α)% two-sided pointwise confidence interval for φ(b ) on a grid of values of b B can be computed as ˆφ(b )±z 1 α/2 {G (b )I 1 G(b )} 1/2. One-sided pointwise bands for φ(b ) can by computed by replacing z 1 α/2 with z 1 α and using only either the lower limit or the upper limit of this interval. If φ is a measure of similarity, a two-sided confidence interval for φ is of interest. On the other hand, if φ is a measure of agreement, an appropriate one-sided confidence bound for φ is of interest. In particular, if small values for φ imply good agreement (e.g., TDI), we need an upper bound. 20

21 Whereas, if large values for φ imply good agreement (e.g., CCC), we need a lower bound. These bounds and intervals can be made more accurate by computed them after applying a suitable normalizing transformation to the measure (e.g., log transformation of TDI or Fisher s z-transformation for CCC) and back-transforming the results to the original scale. These confidence bounds and intervals are used to evaluate similarity and agreement over the measurement range B. 6 Application to cholesterol data The cholesterol data come from a trial conducted at Virginia Commonwealth University to compare two methods for assaying serum cholesterol (mg/dl) [12]. One is Cobas Bio, an assay standardized by the Centers for Disease Control that serves as the reference method (method 1). The other is Ektachem 700, a routine laboratory analyzer that serves as the test method (method 2). There are 100 subjects in this study. Measurements from each assay are replicated ten times on every subject. The replications from an assay are exchangeable and the replications across the assays are unpaired. The measurements range between 45 to 372 mg/dl. Figure 1 shows a trellis plot the data from [23]. We see that Ektachem s measurements tend to be larger and have higher within-subject variation than Cobas Bio s. This variation increases with cholesterol level for both assays. There is also evidence of assay subject interaction. In our notation, Y ijk represents the kth replicate measurement of serum cholesterol obtained by the jth assay from the blood sample of the ith subject, i = 1,..., 100, j = 1, 2, and k = 1,..., 10. The first step in the analysis is modeling of data. This involves specifying parametric forms for the calibration function f and the variance functions g j, j = 1, 2, in (6). Figure 2 shows a scatter plot of (y i1, y i2 ) pairs and also assay-specific plots of the logarithm of SD of 21

22 a subject s ten measurements against the logarithm of their mean. The points in each plot cluster around a straight line, suggesting f(b, β) = β 0 + β 1 b and g j (b, δ j ) = b δ j, j = 1, 2, as plausible choices. The same choice for g j is suggested when residuals from a homoscedastic fit are analyzed. With these f and g j, the resulting model (6) has nine parameters. Table 7 summarizes estimates of these parameters and their SEs obtained using the three model fitting methods described in Section 3. Although the three methods produce practically the same results, there is a slight difference in the estimates of β 0 obtained by numerical approximation methods (LA and GH) and the model approximation method (MA). Nevertheless, the difference is not large enough to be of concern. Besides, the simulations in Section 4 show that the MA method is generally more accurate than the other two methods anyway. Therefore, only the results from MA method will be presented hereafter. Figure 2 (d) shows a plot of standardized residuals against the fitted values. It has no discernible pattern. This, together with additional model diagnostics suggested in [10] (not presented here), allows us to conclude that the fit of the assumed model is adequate. The p-value for the likelihood ratio test of null hypothesis of homoscedasticity is practically zero, confirming nonconstant error variances. Plugging-in parameter estimates in (18)-(20) gives the fitted distribution of (Ỹ1, Ỹ2) as Ỹ 1 Ỹ 2 N , ( )b ( )b 1.98 This distribution depends on the cholesterol level b B = [45, 372] mg/dl because of heteroscedasticity. Notice that the contribution of error variation to the total variation in response is swamped by other components of variation. In particular, this makes the estimated correlation between (Ỹ1, Ỹ2) very high over throughout B. The second step in the analysis is evaluation of similarity. The estimate of proportional bias β 1 is 1.02 (SE = 0.01) and its 95% confidence interval is [1.00, 1.04]. Thus, there 22

23 is evidence of a slight upward proportional bias of up to 4% in Ektachem assay, but the evidence is borderline. Further, the estimate of fixed bias β 0 is 2.17 (SE = 2.20) and its 95% confidence interval is [ 2.14, 6.48]. Although this interval covers zero, it also provides evidence of a small fixed bias. These findings are consistent with the observation that Ektachem s measurements tend to be larger than Cobas Bio s. Figure 3 presents estimate and 95% two-sided pointwise confidence band for precision ratio λ, defined in (22), as a function of cholesterol level b. The entire band lies below one. Notwithstanding the fact this band is pointwise rather than simultaneous, it does indicate that Cobas Bio is more precise than Ektachem. The former is estimated to be about 40% more precise than the latter. To summarize, we find that the two assays cannot be regarded as similar. Not only they do not have the same true values, but also Cobas Bio is more precise than Ektachem. The third step in the analysis is evaluation of agreement. Since there is evidence of a slight bias in Ektachem, we use (23) to recalibrate its measurement Ỹ2 as Ỹ 2 to make its true value same as Cobas Bio s. The estimated transformation is Ỹ 2 = (Ỹ2 2.17)/1.02. Using (25), the fitted distribution of the difference D after the transformation is D N 1 (0, ( )b ( )b 1.98). The SD of this distribution ranges between 7.15 to Next, we perform inference on the agreement measure TDI (with p = 0.90), given by (26), as described in Section 5. Figure 3 shows its 95% pointwise upper confidence bound as a function of cholesterol level b. This bound increases from 13.6 to 16.7 as the cholesterol level increases from 45 to 372 mg/dl. The leftmost bound of 13.6 shows that 90% of differences in measurements from the assays when the true value is 45 fall within ±13.6. Relative to the true value, this difference is too large to be deemed acceptable. On the other hand, the rightmost bound of 16.7 shows that 90% of differences in measurements from the assays when the true value is 372 fall within ±16.7. This difference may be considered acceptable. Thus, we may conclude that the 23

24 assays, after the recalibration, have satisfactory agreement for large cholesterol values but not for small values. Obviously, this means that the Cobas Bio and recalibrated Ektachem do not agree well enough to be considered interchangeable. In fact, we know from evaluation of similarity that Cobas Bio is superior to Ektachem by virtue of being more precise. It may be noted that if Ektachem is not recalibrated prior to agreement evaluation, then the 95% pointwise upper confidence bound for TDI ranges from 17.2 to 19.8 over B. These bounds are a bit larger than before because of Ektachem s bias, and hence imply a somewhat worse level of agreement between the two assays. To see the effect of ignoring heteroscedasticity, we repeat the analysis assuming constant error variances, i.e., setting the heteroscedasticity parameters in (6) to zero. The estimate of TDI and its 95% confidence bound come out to be 12.9 and 14.5, respectively. Although these quantities do not depend on the cholesterol value, they are not too far from their heteroscedastic counterparts that range between 11.8 to 15.3 and 13.6 to 16.7, respectively. This happens because the error variation, albeit nonconstant, is swamped by other variance components that do not change with cholesterol value. Nevertheless, it is apparent that the homoscedastic model underestimates the extent of agreement for small cholesterol values and overestimates it for large cholesterol values. 7 Discussion This article presents a measurement error model for replicated method comparison data that can incorporate heteroscedasticity of errors as well as nonlinear relationships in the true values of the measurement methods. It also shows how the model can be used to evaluate similarity and agreement between the methods. A key advantage of the model is that it allows one method to be recalibrated against the other, either linearly or nonlinearly, to ensure that 24

25 their true values are identical. Here we focussed on comparison of two methods and did not include covariates. But the model can be extended to accommodate more than two methods and covariates. We also assumed normality for random effects and error distributions. The model can deal with skewness and heavytailedness in the data by replacing the normality assumption with generalizations of normal, such as skew-normal and skew-t distributions. We, however, require the measurements to be replicated to avoid identifiability issues. The model also requires the practitioner to specify parametric forms for calibration and variance functions, as one ordinarily does in regression modeling. Further research is needed to allow these functions to be specified semiparametrically or even nonparametrically. A potential limitation of our model (6) or its linearized version (14) is that the interaction effects of the two methods have the same variance even though the methods may have different scales. Without the equal variance assumption, the model is not identifiable in the linear calibration case. If this assumption is a concern, it can be addressed to some extent by replacing the interaction effect b i2 of the test method in (14) with f (b i, β) b i2, making the new effect s variance different from that of the reference method. In the linear calibration case, this means b i2 is replaced with β 1 b i2. The change in (14) can be easily propagated through subsequent steps of the analysis to get the analysis based on the new model. Acknowledgements The authors thank Professor Subharup Guha for asking questions that spurred this work. They also thank the reviewers for their constructive comments. Thanks are also due to Professor Vernon Chinchilli for providing the cholesterol data, and the Texas Advanced Computing Center at The University of Texas at Austin for providing HPC resources for conducting the simulation studies. 25

26 References 1. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i: Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: Corrections: 2000, 56, Lin LI. Total deviation index for measuring individual agreement with applications in laboratory performance and bioequivalence. Statistics in Medicine 2000; 19: Choudhary PK, Nagaraja HN. Tests for assessment of agreement using probability criteria. Journal of Statistical Planning and Inference 2007; 137: Haber MJ, Barnhart HX. A general approach to evaluating agreement between two observers or methods of measurement from quantitative data with replicated measurements. Statistical Methods in Medical Research 2008; 17: Pan Y, Haber M, Gao J, Barnhart HX. A new permutation-based method for assessing agreement between observers making replicated quantitative readings. Statistics in Medicine 2012; 31: Barnhart HX, Haber MJ, Lin LI. An overview on assessing agreement with continuous measurement. Journal of Biopharmaceutical Statistics 2007; 17: Lin LI, Hedayat AS, Wu W. Statistical Tools for Measuring Agreement. Springer: New York, Dunn G. Statistical Evaluation of Measurement Errors. 2nd edn. John Wiley, Chichester, UK,

27 10. Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS. Springer: New York, Cheng C, Van Ness JW. Statistical Regression with Measurement Error. John Wiley, Chichester, UK, Chinchilli VM, Martel JK, Kumanyika S, Lloyd T. A weighted concordance correlation coefficient for repeated measurement designs. Biometrics 1996; 52: Choudhary PK. A tolerance interval approach for assessment of agreement in method comparison studies with repeated measurements. Journal of Statistical Planning and Inference 2008; 138: Carstensen B, Simpson J, Gurrin LC. Statistical models for assessing agreement in method comparison studies with replicate measurements. The International Journal of Biostatistics 2008; 4:article Roy A. An application of linear mixed effects model to assess the agreement between two methods with replicated observations. Journal of Biopharmaceutical Statistics 2009; 19: Carrasco JL, King TS, Chinchilli VM. The concordance correlation coefficient for repeated measures estimated by variance components. Journal of Biopharmaceutical Statistics 2009; 19: Hawkins DM. Diagnostics for conformity of paired quantitative measurements. Statistics in Medicine 2002; 21: Dunn G, Roberts C. Modelling method comparison data. Statistical Methods in Medical Research 1999; 8:

28 19. Carstensen B. Comparing methods of measurement: Extending the LoA by regression. Statistics in Medicine 2010b; 29: Alanen E. Everything all right in method comparison studies? Statistical Methods in Medical Research 2012; 21: Kelly, G. E. Use of structural equations model in assessing the reliability of a new measurement technique. Applied Statistics 1985; 34: Bland JM, Altman DG. Measuring agreement in method comparison studies. Statistical Methods in Medical Research 1999; 8: Nawarathna LS, Choudhary PK. Measuring agreement in method comparison studies with heteroscedastic measurements. Statistics in Medicine 2013; 32: Linnet K. Estimation of the linear relationship between the measurements of two methods with proportional errors. Statistics in Medicine 1990; 9: Meijer E, Mooijaart A. Factor analysis with heteroscedastic errors. British Journal of Mathematical and Statistical Psychology 1996; 49: R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http: // 27. Lange K. Numerical Analysis for Statisticians. 2nd edn. Springer, New York, Liu Q, Pierce DA. A note on Gauss-Hermite quadrature. Biometrika 1994; 81: Davidian M, Giltinan DM. Nonlinear Models for Repeated Measurement Data. Chapman & Hall/CRC Press: Boca Raton, FL,

29 30. Stroup WW. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications. CRC, Boca Raton, FL, Lehmann EL. Elements of Large-Sample Theory. Springer: New York,

30 Table 1: Sets of parameter values used for the simulation study. θ (β 0, β 1 ) (10, 1.2) (5, 1.1) (0, 1) Set (µ, log(τ 2 ), log(ψ 2 )) (185, 8, 3) (185, 8, 3) (185, 8, 3) (log(σ 2 1), log(σ 2 2)) homoscedastic model, δ = 0 (1, 2) (1, 1.25) (1, 1) heteroscedastic model, δ (0.9, 1, 1.1) (-9, -8) (-9, -8.75) (-9, -9) 30

A Tolerance Interval Approach for Assessment of Agreement in Method Comparison Studies with Repeated Measurements

A Tolerance Interval Approach for Assessment of Agreement in Method Comparison Studies with Repeated Measurements Pankaj K. Choudhary 1 Department of Mathematical Sciences, University of Texas at Dallas