Statistical Comparison of ROC Curves from Multiple Readers

Size: px

Start display at page:

Download "Statistical Comparison of ROC Curves from Multiple Readers"

Sophie Booker
5 years ago
Views:

1 Statistical Comparison of ROC Curves from Multiple Readers MARIJKE SWAVING, MSc, HANS VAN HOUWELINGEN, PhD, FENNO P. OTTES, PhD, TON STEERNEMAN, PhD Receiver operating characteristic (ROC) analysis is the commonly accepted method for comparing diagnostic imaging systems. In general, ROC studies are designed in such a way that multiple readers read the same images and each image is presented by means of two different imaging systems. Statistical methods for the comparison of the ROC curves from one reader have been developed, but extension of these methods to multiple readers is not straightforward. A new method of analysis is presented for the comparison of ROC curves from multiple readers. This method includes a nonparametric estimation of the variances and covariances between the various areas under the curves. The method described is more appropriate than the paired t test, because it also takes the case-sample variation into account. Key words: ROC curves; ROC analysis; nonparametric estimation; area under the curve; comparison of ROC curves; EM algorithm. (Med Decis Making 1996;16: ) Receiver operating characteristic (ROC) studies are frequently used to compare the diagnostic qualities of imaging systems.l ROC studies can also be used to evaluate the performance of systems to support the diagnosis. In a conventional ROC study,&dquo; several observers read all the selected images by means of each of the imaging systems that are to be compared. The diagnosis the observer makes depends on his or her confidence that the particular image shows an abnormality or a normal state and upon the confidence threshold he adopts. The diagnosis the observer makes can be correct or incorrect (table 1). For a given confidence threshold, the fraction of abnormal images that are correctly identified as abnormal is called the true-positive fraction (TPF = sensitivity) and the fraction of the normal images that are correctly identified is called the true-negative fraction (TNF specificity). = In the same way, the false-positive fraction (FPF) and the false-negative fraction (FNF) are defined. For the actually normal and the actually abnormal images, probability distributions can be derived for the various states of Received June 22, 1994 from BAZIS, Experimental Developments, Leiden, The Netherlands (MS, FPO); the Department of Medical Statistics, Leiden University, Leiden, The Netherlands (JCVH); and the Department of Econometrics, University of Groningen, Groningen, The Netherlands (AGMS). Revision accepted for publication September 19, Dr. Steerneman was supported by the Stichting Verzekeringswetenschap and the C. R. Rao Foundation. Address correspondence and reprint requests to Dr. Ottes: HISCOM Research and Development, Schipholweg 97, 2316 XA Leiden, The Netherlands. Table 1. The Types of Correct and Incorrect Diagnoses truth. Two such distributions are shown in figure 1. The horizontal axis represents the confidence the observer has that a particular image was taken from a patient who actually has an abnormal condition. The confidence threshold separates the abnormal diagnoses from the normal diagnoses. The inherent discriminatory capacity of a system depends on the extent to which the probability distributions of perceived evidence from various states of truth are separated or overlap. The sensitivity and the specificity of a diagnostic imaging system depend upon the particular confidence threshold that the observer uses. According to this model, an image is read as positive if the observer s confidence in a positive diagnosis exceeds his confidence threshold. For a range of confidence thresholds, the ROC curve represents the link between the fraction of abnormal images that are correctly diagnosed as abnormal (TPF) and the fraction of normal images that are diagnosed as abnormal (FPF). Therefore, in an ROC study, an observer is asked to adopt several confidence thresholds at the same time, as shown in figure 2, and classify each image in one of these categories denoting his or her confidence. The diagnostic quality of an imaging system can be represented by an ROC curve. The ROC curve can be estimated as a curve through the operating 143

2 144 FIGURE 1. Probability densities of an observer s confidence in a positive diagnosis for a particular diagnostic task. points, which are obtained by the classifications of the images by a reader. Usually it is assumed that the functional form of an ROC curve is given by two underlying normal distributions. The ROC curve denotes the discriminatory capacity of the imaging system between the normal and the abnormal images. A higher ROC curve indicates greater discriminatory capacity. To compare imaging systems, the area under the ROC curve is usually taken as an index of diagnostic quality. Two imaging systems yield the same diagnostic quality when the areas under their ROC curves are equal. If the same images are read and/or the same reader reads the images presented by both imaging systems, the ROC curves (and the areas beneath them) will be correlated. Metz et al.3 and Hanley and McNeil4 have developed several methods for the comparison of two correlated ROC curves. These methods are described below. They cannot be generalized to multiple readers. A nonparametric approach to the comparison of the areas under two or more correlated ROC curves has been reported by DeLong et al. Until now, conclusions about the diagnostic qualities of imaging systems have been drawn by comparing ROC curves per reader, by comparing pooled ROC curves or the averaged areas under the curves and subsequently performing a test for two correlated ROC curves, or by using a paired t test to analyze the areas under the curves for all readers. Poolings is not a good alternative for handling correlated ROC curves, and a test for two correlated ROC curves is not designed for such pooled curves. The paired t test takes both between-reader variation and within-reader variation into account, but it does not take case-sample variation into account. Therefore, the need was felt for a more suitable method than the paired t test. We present a method that takes into account casesample variation. We provide an example of eight correlated ROC curves derived from four readers interpretations of mammograms, then discuss our proposed method. The confidence-rating model, sketched for a five-cat- FIGURE 2. egory scale.

145 Comparison of Correlated ROC Curves Under the assumption of a bivariate normal ROC curve, the curve can be plotted as a straight line on double-normal-deviate papery This straight line can be

3 145 Comparison of Correlated ROC Curves Under the assumption of a bivariate normal ROC curve, the curve can be plotted as a straight line on double-normal-deviate papery This straight line can be represented by means of two parameters, the intercept a and the slope ~3. The area under the ROC curve is equal to where 4$ denotes the cumulative standard normal distribution function. For the estimation of the area under the curve, AZ, we can substitute the estimates a and b for a and P, respectively. The statistic proposed by Metz for testing the null hypothesis Ho:Az1 This statistic has variance which is estimated by This test is executed by the software package COR- ROC2.$ The statistic W approximately follows a normal distribution, with zero mean and variance S w under Ho. This test can be used for the comparison of two ROC curves only. Therefore, this test is not applicable to an ROC study, when multiple observers read the same images and when each image is presented by means of two imaging systems. Hanley and McNeil4 9 developed a method that tries to estimate the correlation between the two areas to be compared. The variance of the difference in the areas under the curves is given by where SD denotes the standard deviation and p is the correlation coefficient. If the areas are highly positively correlated, the gain of using a paired test is big, because of the reduction in the variance. The calculation of the correlation coefficient p depends on two intermediate correlation coefficients p, and p,,, the correlation coefficients for the ratings of the positive patients and the negative patients imaged by the two modalities, respectively. Each of these can be estimated using Kendall s tau. The statistic W (equation 2) can be used again. This test is not applicable to the comparison of more than two ROC curves. DeLong and colleagues present a nonparametric approach to compare two or more correlated ROC curves. They assume that the readings of the images of two different patients are independent, but dependence of the readings of the images of one patient is allowed. Within this framework, they have to deal with correlated ROC curves. If there are, e.g., two imaging systems and four readers, then we have eight readings for each patient. However, in this situation the assumption of independence is no longer realistic, because the readings by the same reader will be correlated, even when he or she interprets the images of two different patients. In the next section we deal with correlated ROC curves in this more general case, where correlation between readings of the images of different patients occurs. We apply a nonparametric approach. Swets and Pickett, 7 and Dorfman, Berbaum, and Metz&dquo; have discussed methods for statistical comparison of ROC-curve estimates from multiple readers. We also have to deal with a similar situation: a factorial experimental design in which the readers interpret the same images in all modalities, but they do this only once. Application of the method of Swets and Pickett is not possible in this situation, because they require re-reading of images. Besides, it is difficult to obtain the appropriate variances and covariances. Swets and Pickett implicitly postulated an ANOVA type of model with a fixed effect for the modality; the other effects are random. Dorfman et al. also apply such a model, but it is of a more general type, in the sense that two additional second-order interaction terms of random type are included. An important difference is that Dorfman et al. do not postulate their model for the estimated areas under the ROC curves but for the pseudovalues derived for the jackknife method. The jackknife analysis can be suitable in the ANOVA context if the usual assumptions of normality of the error terms and the random effects are weakened. Both articles assume additivity of the various effects in the model. In our method we do not assume this. We follow a nonparametric approach and we do not require an ANOVA kind of specification for the various effects.

4 146 FIGURE 3. An ROC curve obtained without the assumption of bivariate normality. Comparison of ROC Curves from Multiple Readers with Case-Sample Variation The method for comparing ROC curves from multiple readers (the number of readers is denoted by n) is based upon the fact that the same images are read under two conditions. If the areas under the ROC curves are equal for the two imaging systems, then the imaging systems yield the same diagnostic quality. We make only minimal assumptions about the form of the ROC curve. Therefore, we do not assume a bivariate normal ROC curve. The ROC curve is obtained by connecting the operating points by straight lines. Figure 3 represents such an ROC curve, in which the variable X describes the confidence an observer has in a diagnosis of abnormality when the image is actually normal and the variable Y describes the confidence in a diagnosis of abnormality when the image is abnormal. Here h denotes an operating point, h = 0,..., c; where c denotes the number of confidence categories the observer can choose from. If X and Y are independent random variables, the area under this curve is given by the elements wy are given by For the ROC curve, the area under the curve can be estimated. For each normal image and for each abnormal image the reader can choose from c categories to classify the image. This classification process is represented by two multinomial distributions with estimated probability vectors px and py. The variance of d is estimated by where tr stands for the trace of the matrix. For the estimation of the variance-covariance matrices cov(p,) and cov(py), we make use of the underlying multinomial distribution: In the following text we work with We can estimate A from the data by d = pkwpy, where px and py are vectors denoting the estimated probabilities that a normal image and an abnormal image, respectively, are classified in category h; h = 1,..., c. W is a square matrix of dimension k, and where diag(px) and diag(p,.) denote diagonal matrices with the elements of the vector px and the vector py, respectively, at the main diagonal. Since Az is esti- = mated by 1/2 + 1/2. d, the variance of the area under the curve is estimated by

5 147 In our ROC study, multiple readers interpreted the same images under two conditions. Given the selection of readers i = 1,..., n and the imaging systems k = 1, 2, we have to deal with the areas AZ,k. Indeed, conditionally given the sample of readers, the areas A~ can differ between readers for a fixed imaging system k. The areas are estimated by AZ,k = 1/2 + V2. d,k- Since the d,k are obtained from the same case sample, we also have to deal with the covariances between the d,k, for i = 1,..., n and k = 1, 2. More formally, we have to estimate cov(d,k, d 1); i, j = 1,..., n, k, 1 = 1, 2. The covariances between two areas under the curves can be obtained by looking at the similarities and dissimilarities between the classifications of the same images. This is done for the classifications of the same readers and for the classifications of different readers as well. We also need the cross-classifications by the readers. Therefore, we look at the jointly multinomial distribution of the probability vectors. We obtain the following estimators where n, and nz, respectively, denote the number of normal images and the number of abnormal images. The squared matrix PXlk,}1 of dimension c is cal- to the culated as follows. Its (ci, c2) element is equal fraction of normal images in the sample that are classified in category C1 (and cz, respectively) by reader i (and j, respectively) under imaging system k (and 1). So, the matrix PXlk,j1 gives the relative frequency table of the normal images in the sample cross-classified according to the readings by reader i (and j) of images from imaging system k (and 1). For, e.g., the first reader of the normal mammograms, from the example that is presented below, the joint multinomial distribution is estimated by means of the following probabilities: Notice that pxii,iz is a square matrix of dimension to the fraction of the c, and element (1, 2) is equal 50 images that are placed in category 1 by reader 1 under system 1 and in category 2 by reader 1 under system 2. Similar probabilities can also be estimated for different readers. These covariances can be used for the estimation of the covariance between two areas, i.e., By means of the equations 8 and 13, the variances of the areas under the curves and the covariances between various areas under the curves can be esti- We now define a difference vector D, with D, = A.,, - Az,z 1/2(d,l - d,z). The dimension of the difference vector D is equal to n, the number of ob- = servers in the study. The assumption is made that D approximately follows a multivariate normal distribution with expectation )JLL, where L denotes the vector of dimension n filled with ones, and variancecovariance matrix ~. This assumption allows that the areas./1~ are different between readers for given k = 1, 2. It is, however, assumed that the differences AZ,1 - A,,2 are equal. In the previous analysis, it was implicitly assumed that a reader is completely certain about his classifications. In practice, it is likely that the reader is uncertain about the category into which a given image should be classified. So, if the same samples of images are presented to the readers after some time, then another value D, will be obtained. It is realistic to assume that var(d, - D,) 5~ 0. In the foregoing we have implied that var(d, - D,) = 0. The variance of the measurement or, equivalently, the within-reader variance is denoted by a2 ~ 0 for all readers. If the hypothesis of no measurement error would hold, then the analysis should obtain a negligible estimate for Q2. Testing Ho: equal diagnostic quality can now be translated into testing Ho:..L = 0, i.e., no difference in the diagnostic qualities of the two imaging systems per reader. The difference with respect to the paired t test is the calculation of I. For the paired t test it is assumed that each element of the vector D approximately follows a normal distribution with expectation }jl and a variance of Q2 and that the elements of this vector are independent. The variance Q2 is estimated by the sampling variance of the vector. However, we assume that the vector D approximately follows a multivariate normal distribution. The variance-covariance matrix I of D consists of two parts. One part of the variation in D can be explained by the fact that the same images are read multiple times, the case-sample variation. The other part of the variation is caused by a withinreaders effect. We denote the case-sample variance by the matrix C, and the within-readers variation by

6 148 Q2. Thus, the model can be written as where I is the unit matrix of dimension n. The variance-covariance matrix C is given by QZ. Subsequently, each iteration of the algorithm consists of two steps: Expectation (E) step: Compute Maximization (M) step: The harder part is the E step. algebra it can be shown that With some linear We can substitute the variances (equation 8) and covariances (equation 13) in equation 15, the formula for C, to obtain an estimate of the case-sample variance. For the estimation of the within-readers variation QZ we make use of the restricted maximum likelihood (REML) methods approach and the expectation-maximization (EM) algorithm.&dquo; The REML approach projects D on a subspace orthogonal to ii, by where Repeated application of the E and M steps ultimately leads to an estimate of QZ that maximizes the corresponding likelihood function. For the estimation of >, the maximum-likelihood 3 method is applied. We know D - Nn(fl.L, (T 21 + C), with or and C &dquo;known by estimation.&dquo; The maximum-likelihood estimator for )JL is Then D has zero expectation and variance-covariance matrix The estimated variance of the estimator of )JL is The EM algorithm is based upon the representation with independent D1 and DZ, both having expectation zero and with variance-covariance matrices Finally, the comparison of ROC curves with multiple readers can be performed. The null hypothesis stating that the diagnostic qualities of the two modalities are equal is translated into Ho : )JL = 0. Thus, the statistic If D1 were observable, an estimator of QZ could be found by The EM approach starts with an initial estimate of is defined. Under Ho, the distribution of T depends on the magnitude of the within-readers variation QZ. For small Q2, T approximately follows a standard normal distribution. For large cr2, T approximately follows a Student s-t distribution with n - 1 degrees of freedom, and the test coincides with the paired t test.

7 149 The steps to be taken in comparing ROC curves from multiple readers may be summarized as follows : Table 2. Areas under the Curves (with Standard Deviations) for the Bivariate Normal ROC Curves of Four Readers. For each reader-modality combination, estimate the area under the curve in a nonparametric way.. Compute the difference vector D. Estimate the variance-covariance matrix C by estimating the variances of d 1 and d,2, i = 1,..., n, the covariances between d,, and djlj between d 1 and d,2, between d,z and d*, and between day, and djz) i) j = 1)..., n. Estimate with the EM algorithm the within-readers variation Q 2.. Estimate > by means of the maximum-likelihood method. Estimate the variance of p,.. Compute the value of the test statistic T and compare this value with the values of the appropriate distribution. Example We illustrate the method using an ROC study of mammogram interpretations.14 For this study, 50 films with small calcifications and 50 films without small calcifications were selected. All other abnormalities in the films were ignored. The diagnostic truth of the images was determined by two radiologists who specialized in interpreting mammograms. The images were digitized by a video scanner. In this study, four readers were involved. These observers read both the conventional film and the digitized images. They were asked to use a five-point rating scale to represent their confidence in identifying the abnormality sought after. The categories of the certainty scale were: 1 Definitely no small calcifications present 2 Probably no small calcifications present 3 Unsure whether small calcifications are present 4 Probably small calcifications present 5 Definitely small calcifications present The images per imaging system were all read in one session, and after some unspecified time, the images were read again, in a completely different order, by means of the other imaging system. In table 2, the areas under the curves for the different readers are listed, assuming a bivariate normal form for the ROC curves. for all readers the areas under the Although curves are less for the readings of the digitized images than for the conventional films, it is hard to prove a difference in diagnostic quality by statistics. A paired t test produces a one-sided p value of Thus, no conclusion can be made from a test for Table 3. P-values of Testing the Equality of the Diagnostic Qualities of Two Modalities (Results of CORROC2) for Four Readers the hypothesis Ho: equal diagnostic quality, against H,: the diagnostic quality of the digitized images is less than the diagnostic quality of the conventional film, because of the level of significance used (5%) and because of the fact that the Student s-t statistic only asymptotically follows a standard normal distribution. If we test per reader the equality of the to the areas under the curves (table 3) according method of Metz,3 the null hypothesis will be rejected for the readers 1, 3, and 4, but for reader 2 the null hypothesis will be accepted. Therefore, according to the method of Metz, no overall statement about the can be made. diagnostic qualities of the two systems In such situations, the paired t test can help, but even with the paired t test it was not possible to draw an overall conclusion. In figures 4, 5, 6, and 7, the operating points and the estimated ROC curves are given, together with their 95% confidence bands obtained from ROCFIT (a program devised by C. E. Metz), but the Working-Hotelling-type bands of Ma and Hall&dquo; could also be drawn. The behaviors of the bounds Al-4 in figures 4 and 5 are rather strange. It is not clear how ROCFIT produced such bounds. One would expect confidence intervals that are narrow at the lower left and upper right ends of the ROC curve. Table 4 shows the nonparametric estimates of the areas under the curves, together with the standard deviations. The difference vector D = Az,i - AZ,Z is estimated by

8 150 FIGURE 4. The operating points from reader 1 with the estimated curve and a 95% confidence interval for the mammography study. FIGURE 5. The operating points from reader 2 with the estimated curve and a 95% confidence interval for the mammography study. FIGURE 6. The operating points from reader 3 with the estimated curve and a 95% confidence interval for the mammography study.

9 151 FIGURE 7. The operating points from reader 4 with the estimated curve and a 95% confidence interval for the mammography study. A paired t test for this difference vector gives a p value of 0.08; thus, the null hypothesis is accepted according to the paired t test. The variance-covariance matrix C that represents the case-sample variation is given by The estimated variance of the estimate of )JL is Testing the equality of the two imaging systems can be translated into testing Ho: J.L = 0. Therefore, we use the test statistic T, Approximately, D - N4(f.LL, QZI + C). We estimate Q2 by means of REML and the EM algorithm. Table 5 shows the estimates of DiDI and or obtained by the EM algorithm. The EM estimator of the between-readers variation is QZ = Substituting this value in the assumed model for D, we get D - N4(f.LL, 2), with 2 = Q 21 + C estimated by Table 4 o Areas under the Curves (with Standard Deviation) Estimated in a Nonparametrical Way for Four Readers Obviously, the estimate for QZ is not negligible in comparison with the elements of C. This means that we indeed have to take account of within-reader variation, something like measurement error, or classification uncertainty of the readers. The next step is to obtain the maximum-likelihood estimator of ~,: Table 5. Estimates of D;D,, and Q2 by the EM Algorithm

152 The magnitude of Q is large relative to the elements of C; thus, we compare the value of T with a Student s-t distribution with three degrees of freedom. The corresponding p value is less than 0.

10 152 The magnitude of Q is large relative to the elements of C; thus, we compare the value of T with a Student s-t distribution with three degrees of freedom. The corresponding p value is less than 0.5; thus, the two imaging systems differ in diagnostic quality. Discussion The method described is designed to test the differences between ROC curves obtained from multiple readers in the presence of within-reader variation in a nonparametric context. The analysis does not make use of the assumption of bivariate normal ROC curves. The areas under the ROC curves are estimated in a nonparametric way. Asymptotically the vector D, consisting of the differences in the areas under the curves per reader, follows a normal distribution with expectation fjl and variance-covariance matrix E. This variance-covariance matrix consists of a within-readers effect and of a casesample effect. The case-sample effect can be estimated directly from the variances of and the covariances between the various areas under the curves. For the estimation of the within-readers effect, we make use of the EM algorithm. The expectation of the components of the vector D can be estimated by means of the maximum-likelihood method. Testing whether the two imaging systems yield equal diagnostic quality can be translated into testing whether this expectation is equal to zero. For this test, the assumption is made that the vector D has a constant expectation for all readers, e.g., the gain of using another imaging system is the same for all readers. Therefore, this test will be the most powerful for a homogeneous group of readers. The advantage of the method is that it is more suitable than the paired t test, because it takes into account the case-sample variance. But for large cr 2, T approximately follows a Student s-t distribution with n - 1 degrees of freedom (n denoting the number of readers in the study). Hence the critical values (and the accuracy) of this test will heavily depend on the number of readers in the study. The statistical method used is currently being implemented in software. The authors thank the two reviewers for their constructive remarks. References 1. Metz CE. ROC Methodology in radiologic imaging. Invest Radiol. 1986;21: Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8: Metz CE, Wang PL, Kronman HB. A New Approach for Testing the Significance of Differences between ROC Curves Measured from Correlated Data. Proceedings of the VIIIth Conference on Information Processing in Medical Imaging, Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristics (ROC) curves derived from the same cases. Radiology. 1983;148: DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44: Metz CE. Some practical issues of experimental design and data analysis in radiologic ROC studies. Invest Radiol. 1989; 24: Swets JA, Pickett RM. Evaluation of Diagnostic Systems, Methods from Signal Detection Theory. New York: Academic Press, Metz CE, Kronman HB. "CORROC2" Program (IBM-PC Version), Hanley JA, McNeil BJ. The meaning and the use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143: Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol. 1992;27: Stewart J. Econometrics. Philip Allan, Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Statist Soc. 1977;B39: Mood AM, Graybill FA, Boes DC. Introduction to the Theory of Statistics. New York: McGraw-Hill, Klessens PM, Barneveld Binkhuysen FH, Ottes FP, Winter LHL, Willemse APP, de Valk JPJ. Diagnostic evaluation of a PACS subsystem using mammographs. An ROC analysis. J Med Informatics. 1988;13: Ma G, Hall WJ. Confidence bands for receiver operating characteristic curves. Med Decis Making. 1993;13:191-7.

Estimating Optimum Linear Combination of Multiple Correlated Diagnostic Tests at a Fixed Specificity with Receiver Operating Characteristic Curves

Journal of Data Science 6(2008), 1-13 Estimating Optimum Linear Combination of Multiple Correlated Diagnostic Tests at a Fixed Specificity with Receiver Operating Characteristic Curves Feng Gao 1, Chengjie