A. James O Malley, PhD, Kelly H. Zou, PhD, Julia R. Fielding, MD, Clare M. C. Tempany, MD. Acad Radiol 2001; 8:

Size: px

Start display at page:

Download "A. James O Malley, PhD, Kelly H. Zou, PhD, Julia R. Fielding, MD, Clare M. C. Tempany, MD. Acad Radiol 2001; 8:"

Victoria Bishop
6 years ago
Views:

1 Bayesian Regression Methodology for Estimating a Receiver Operating Characteristic Curve with Two Radiologic Applications: Prostate Biopsy and Spiral CT of Ureteral Stones 1 A. James O Malley, PhD, Kelly H. Zou, PhD, Julia R. Fielding, MD, Clare M. C. Tempany, MD Rationale and Objectives. The authors evaluated two Bayesian regression models for receiver operating characteristic (ROC) curve analysis of continuous diagnostic outcome data with covariates. Materials and Methods. Full and partial Bayesian regression models were applied to data from two studies (n 180 and 100, respectively): (a) The diagnostic value of prostate-specific antigen (PSA) levels (outcome variable) for predicting disease after radical prostatectomy (gold standard) was evaluated for three risk groups (covariates) based on Gleason scores. (b) Spiral computed tomography was performed on patients with proved obstructing ureteral stones. The predictive value of stone size (outcome) was evaluated along with two treatment options (gold standard), as well as stone location (in or not in the ureterovesical junction [UVJ]) and patient age (covariates). Summary ROC measures were reported, and various prior distributions of the regression coefficients were investigated. Results. (a) In the PSA example, the ROC areas under the full model were 0.667, 0.769, and 0.703, respectively, for the low-, intermediate-, and high-risk groups. Under the partial model, the area beneath the ROC curve was (b) The ROC areas for patients with ureteral stones in the UVJ decreased dramatically with age but otherwise were close to that under the partial model (ie, 0.774). The prior distribution had greater influence in the second example. Conclusion. The diagnostic tests were accurate in both examples. PSA levels were most accurate for staging prostate cancer among intermediate-risk patients. Stone size was predictive of treatment option for all patients other than those 40 years or older and with a stone in the UVJ. Key Words. Receiver operating characteristic (ROC) curve; Bayesian regression analysis. Acad Radiol 2001; 8: From the Department of Health Care Policy, Harvard Medical School, 180 Longwood Ave, Boston, MA (A.J.O., K.H.Z.); and the Departments of Medicine (A.J.O.) and Radiology (K.H.Z., J.R.F., C.M.C.T.), Brigham and Women s Hospital, Harvard Medical School, Boston, Mass. Received November 30, 2000; revision requested March 20, 2001; revision received and accepted April 2. Supported in part by grants U01 CA45256 and P01 CA41167 from the National Cancer Institute. Address correspondence to A.J.O. AUR, 2001 Traditionally radiologic diagnostic tests have tended to yield categorical rating data. With more precise measurement tools, however, laboratory diagnostic systems yield a continuous measurement scale for the test outcome. The most important predictor variable for diagnostic accuracy is the underlying truth about disease status (gold standard), which indicates the true disease status of each individual. Additional covariates, such as age and other risk factors, are often collected. In this article, we propose Bayesian regression methods for evaluating the accuracy of such a diagnostic system with covariates incorporated. These new methods are motivated by two radiologic studies. The first study concerns the use of prostate-specific antigen (PSA) levels for staging prostate cancer, with Gleason scores as a covariate (1). The second study concerns the use of ureteral stone size for predicting treatment option, with 713

2 O MALLEY ET AL Academic Radiology, Vol 8, No 8, August 2001 stone location in the ureterovesical junction (UVJ) and patient age as two covariates (2). The accuracy of a diagnostic test can be summarized in terms of a receiver operating characteristic (ROC) curve, which is a plot of sensitivity (or true-positive rate) versus (1 specificity) (or false-positive rate) at all possible decision threshold values (3 5). Notations and assumptions for an ROC curve will be given in the Materials and Methods section. In this article, Bayesian regression methodology will be introduced for ROC analysis derived from continuousoutcome measurement data, with patient covariate information and prior knowledge about the regression coefficients incorporated into the proposed regression models. Our Bayesian regression approach provides a new inferential framework for continuous-outcome data. Note that our approach may be extended to ordinal data, because continuously distributed test outcome data have a natural categorization that preserves all information relevant to ROC-curve fitting (6). Under the non-bayesian framework, an ordinal regression model for categorical outcomes has been developed (7), regression analysis methods for general outcomes have recently been proposed (8 10), and hierarchical random-effects models for ordinal categorical data have been discussed (11,12). Bayesian methodology has previously been developed for ordinal rating data (13 15). Whereas previous studies focused on the implementation of Bayesian methods, we also consider the sensitivity of the results to the choice of prior. Bayesian approaches are different from the more conventional frequentist approaches. In essence, given the observed data, once the prior distribution and the statistical model are specified, Bayes theorem is applied to derive posterior distributions of the parameters in the model or functions of these parameters (16). The motivations behind adopting a Bayesian approach are threefold. First, prior knowledge or pilot information can easily be incorporated into the models. In radiologic studies prior information about diagnostic tests can often be extracted from pilot studies, meta-analysis of relevant literature, or scientific theory. The incorporation of prior information often results in inferences that are more precise than obtained with frequentist methods if such prior information and experimental data cohere. Second, the parameters in the model or complex functions of them, such as summary ROC measures, may be simulated directly via methods for the exploration of posterior distributions. The most frequently used method is the Markov-chain Monte Carlo method via the Gibbs sampler and the Metropolis-Hastings algorithm (16,17). Recent advances in numerical methods and the availability of free software programs, such as BUGS (18), allow for faster and more efficient Bayesian computation. Finally, our Bayesian approaches for continuous-outcome data can also be extended to more complex designs and other scenarios. The benefit of Bayesian hierarchical approaches for ordinal data from multireader, multimodality studies has been illustrated previously (15). Note that the Bayesian treatment of the regression model in this article is a simple case of a hierarchical regression analysis. MATERIALS AND METHODS Clinical Example 1: PSA Level in Prostate Cancer As part of a multicollaborative Radiologic Diagnostic Oncology Group trial, magnetic resonance imaging was performed in 213 patients with prostate cancer, before which PSA levels and Gleason scores were obtained in 180 cases. PSA level was treated as the outcome variable. Radical prostatectomy was performed in all cases to provide the gold standard, and patients were classified into two groups, those with local disease (stage A or B) and those with advanced disease (periprostatic invasion of tumor and spread of disease to the seminal vesicles and lymph nodes) (1). Gleason score was treated as an additional covariate, with patients classified as low risk (Gleason score 6), intermediate risk (Gleason score 7), or high risk (Gleason score 8) (19). Clinical Example 2: Spiral CT of Ureteral Stones One hundred unenhanced spiral computed tomographic (CT) scans were obtained to evaluate flank pain in patients with obstructing ureteral stones documented by means of chart review (2,20). A standard protocol was used (280 ma; 12 kvp; pitch, ). The imaging thickness was 5 mm, with images reconstructed at 5-mm increments. Two radiologists initially reviewed the CT scans independently and blindly, to detect several imaging findings. The size of the stone was treated as the outcome variable. Treatment option, the gold standard, included spontaneous passage and required surgical intervention. The ROC analysis conducted previously suggested that in-plane stone size (in millimeters), followed by stone location (for simplicity, in or not in the UVJ), were the features most predictive of the need for intervention. Therefore, additional covariates were stone location in the UVJ and patient age. 714

3 Academic Radiology, Vol 8, No 8, August 2001 BAYESIAN REGRESSION METHODOLOGY FOR ROC CURVES Assumptions and Notations for an ROC Curve We assume that subjects are drawn from one of two distinct populations (ie, nondiseased and diseased), classified according to the gold standard. Furthermore, at each distinct covariate value, the diagnostic outcome measurements, or their transformed versions via the same monotone transformation, have a normal distribution. We now give mathematical notations for ROC analysis: Let Y i denote the test outcome for the ith subject (i 1,...,n). We now denote Y as a generic outcome variable. There are two categories of covariates or predictors. First, associated with each subject is the gold standard (or disease status) labeled D, where D 1 if this subject is diseased and 0 otherwise. This is the most important predictor for discriminating between nondiseased and diseased populations. Second, each patient also has an additional set of covariates labeled X ij with j 1,...,k, possibly extracted from his or her medical record. These covariates, denoted generically as X, consist of all relevant information known about this subject. Given any threshold (or cutoff value), t, let F(t D, X) be the probability that the outcome measurement is less than or equal to t, for a subject with gold standard D and covariates X. The corresponding sensitivity (or true-positive rate, q) of the test at this threshold value is defined as F t D 1, X 1 F t D 1, X. Likewise, the corresponding (1 specificity) (or falsepositive rate, p) is defined as F t D 0, X 1 F t D 0, X. At any given t, the resulting point on the ROC curve, a function of both p and q, is given by p t, q t F t D 0, X, F t D 1, X, for t R. Or equivalently, at any given p, with q viewed as a function of p, p, q p p, F F 1 p D 1, X, for p 0, 1. Two Regression Models The test outcome variable, Y, is assumed to have conditionally independent normal distributions. Its underlying mean, DX, depends on the gold standard (D) and additional covariates (X). For simplicity, the underlying variances, the 2 s, depend only on D. Two regression models are assumed to relate the expected value of the outcome to the gold standard and all other covariates. The full-regression model contains all interaction terms between disease status and covariates, as well as interactions among covariates. In comparison, the partial-regression model omits disease-covariate interaction terms. (These models are referred to hereafter as the full model and the partial model.) In the PSA example, the regression equation for the partial model is D,X * 0 D 1 x 1 2 x 2 (1) and the regression equation for the full model is D,X * 0 D 1 x 1 2 x 2 3 Dx 1 4 Dx 2, (2) where the two covariates are as follows: x 1 1ifthe Gleason score is 7 (intermediate risk) and 0 otherwise; x 2 1 if the Gleason score is 8 or higher (high risk) and 0 otherwise, as recommended by D Amico et al (19). Note that x 1 0 and x 2 0 correspond to the Gleason score of 6 or lower (low risk). Also note that in the full model, an interaction term between x 1 and x 2 is absent because each of these covariates is a dichotomized indicator variable of Gleason score. In the ureteral stone example, the regression equation for the partial model is D,X * 0 D 1 x 1 2 x 2 3 x 1 x 2 (3) and the regression equation for the full model is D,X * 0 D 1 x 1 2 x 2 3 x 1 x 2 4 Dx 1 5 Dx 2 6 Dx 1 x 2, (4) where the two covariates are x 1 1 if the kidney stone is located in the UVJ and 0 otherwise; x 2 is the age of the patient (a continuous covariate). There is a difference between the above regression models and binary regression models. The former assume that the expected value of Y depends on (D, X), whereas the latter assume that the expected value of D depends on (Y, X). As these models serve different inferential pur- 715

4 O MALLEY ET AL Academic Radiology, Vol 8, No 8, August 2001 poses, numerical results should not be compared directly (14). Choice of Prior Distributions Diffuse prior distributions. Little prior information about the parameters is first considered. This level of knowledge is introduced by specifying a prior distribution with very large values for its variance. The prior distributions for all regression coefficients, the s, are assumed to have independent normal N(0, 10 6 ) distributions. The prior distributions for the variance terms, the 2 s, are assumed to be independent inverse gamma IG(0.001, 0.001) distributions. For certain simple problems like those considered here, Bayesian inferences under a diffuse prior will be indistinguishable from those based on maximum-likelihood estimation. This will not be the case when the posterior distribution is asymmetric and a Bayesian estimator other than the posterior mode (eg, the posterior mean) is used. Informative prior distributions. The more informative the prior knowledge, the sharper is its distribution, resulting in a possibly greater effect on the posterior distribution. In this article we consider different informative prior distributions for the main factor mean effect, 0,ofthe gold standard. In our partial model (Eq [3]), 0 1,X 0,X, corresponding to the amount that the data distribution mean changes with disease status. The hyperparameters (mean and variance) of the prior distribution for 0 are assigned with realistic values. In the PSA example, the prior distributions for 0 were N(0.70, 0.05), N(1.15, 0.05), N(1.60, 0.05), and N(1.60, 0.50). In the ureteral stone examples, the prior distributions were N(0.20, 0.01), N(0.50, 0.01), N(0.80, 0.01), and N(0.80, 0.10). See the Results section for justifications for the choice of these prior distributions. Summary ROC Measures For each clinical example the posterior means and standard deviations (SDs) of all model parameters are computed. Characteristics and summary ROC measures, functions of these parameters, are also computed. These include the following: area under the ROC curve (A), partial area (A ) between 50% and 100% specificity, sensitivity (q) at 90% specificity, and specificity p MIS corresponding to maximum improvement of sensitivity (MIS, or q MIS ) over chance. See Appendix A, as well as the ROC literature (3 5) (A, A, q only), for a description of these ROC measures. Curve Fitting via Markov-Chain Monte Carlo Methods Consider the diagnostic data in the form of a triple (Y, D, X). Let ƒ denote the joint probability distribution function of any random variables. With prespecified regression models and observed data, the posterior distribution of is obtained from Bayes theorem (16): where and f, 2 Y, D, X f Y D, X,, 2 f, 2, (5) f Y D, X *, 0, 1,..., k, 2 1 2, 2 2, f Y D, X f Y D, X,, 2 f, 2 d d 2 is the marginal distribution of the data. The posterior mean of any function of the parameters g(, 2 ) is given by the following: E g, 2 Y, D, X g, 2 f, 2 Y, D, X d d 2. (6) For example, by setting g(, 2 ) 0, we obtain the posterior mean of the main factor diagnostic effect in the ROC regression models. The posterior mean sensitivity at a given specificity is obtained by simply setting g(, 2 ) equal to the corresponding ROC curve value (a function of the model parameters and 2 ). The posterior variance of the function h(, 2 ) may be evaluated by setting g(, 2 ) {h(, 2 ) E[h(, 2 ) Y, D, X]} 2, and the posterior SD is the square root of the posterior variance. The integrals involved in Equations (5) and (6) cannot generally be evaluated explicitly. We developed self-written C programs to implement Markov-Chain Monte Carlo methods via the Gibbs sampler (16,17). We used a burn-in of 2,000 iterations and a main simulation of 10,000 iterations. Furthermore, we verified our results by using the free standard software program 716

5 Academic Radiology, Vol 8, No 8, August 2001 BAYESIAN REGRESSION METHODOLOGY FOR ROC CURVES Table 1 PSA Example: Posterior Distributions of the Regression Parameters for Diffuse Prior Distributions Model * Partial Full Note. Values are means SDs. regression coefficients, where * intercept, 0 main effect of the diagnostic standard, 1 effect of intermediate risk based on Gleason scores, 2 effect of high risk based on Gleason scores, 3 effect of (diagnostic standard) (intermediate-risk category) interaction, 4 effect of (diagnostic standard) (high-risk category) interaction, 1 2 error variance for the nondiseased measurements, and 2 2 error variance for diseased measurements. Independent N(0, 10 6 ) priors were used for all regression coefficients. Independent IG(0.001, 0.001) priors were used for the variance parameters. Table 2 PSA Example: Posterior Distributions of Summary Measures of Diagnostic Accuracy for Diffuse Prior Distributions Model A A q p MIS q MIS Partial (overall) Full Low risk Intermediate risk High risk Note. Values are means SDs. A area under ROC curve, A partial area under ROC curve between specificities of 50% and 100%, q sensitivity at specificity 90%, p MIS 1 specificity corresponding to maximum improvement of sensitivity, and q MIS maximum improvement of sensitivity. BUGS run on the UNIX platform (18). Appendix B provides the BUGS code for fitting the partial model and estimating the area under the ROC curve in the ureteral stone example. RESULTS Clinical Example 1: PSA Level in Prostate Cancer Overall results and summary statistics. Sixty-six patients (36.7%) had local disease, and 114 (63.3%) had advanced disease. Their PSA levels (outcome variable) ranged from 0.1 to 58.0 ng/ml (mean SD, ng/ml 9.37). This variable was transformed by using a Box-Cox transformation to normality with coefficient of 0.33; see reference 20 for justification for a similar clinical example. On the basis of Gleason scores, 88 subjects (48.9%) were classified as low risk, 51 (28.3%) as intermediate risk, and 41 (22.8%) as high risk. Results under full versus partial models. For both regression models (Eqq [1, 2]), Table 1 presents posterior means and SDs of all model parameters with diffuse priors. In both models the posterior mean of the gold standard main effect, 0, is above 0, reflecting the predictive value of PSA levels for advanced disease stage. In the full model, the posterior mean of the (gold standard) (intermediate-risk category) interaction term, 3, is also positive, suggesting that the test is most predictive for the intermediate-risk group. Table 2 and Figure 1 present the fitted ROC curves, along with characteristics and summary measures. The overall ROC curve under the partial model is displayed, along with three curves (for the low-, intermediate-, and high-risk groups) under the full model. PSA levels are most accurate for the intermediate-risk group (A 0.769) and least accurate for the low-risk group (A 0.667). All sensitivity values, the q s, at p 10% (90% specificity), are quite low. The maximum improvement of sensitivity over chance occurs at about p MIS 40% (60% specificity). For example, under the partial model, p MIS (56% specificity), q MIS 0.314, and the sum of these two quantities yields the corresponding sensitivity q MIS In contrast, the sensitivity q (improvement of over chance) at p 10% (90% specificity). The highest maximum improvement of sensitivity occurs for the intermediate-risk group. Results under different prior distributions for 0. Table 3 provides posterior means and SDs under the partial model and five different prior distributions for 0. Diffuse priors are still assumed for all other parameters. The posterior mean and variance of 0 under the 717

6 O MALLEY ET AL Academic Radiology, Vol 8, No 8, August 2001 diffuse prior are and , respectively. To investigate the robustness of the choice of prior, we first construct a realistic prior distribution N(1.15, 0.05) for 0, because it contains information consistent with the data. We then vary the mean and variance of this prior in order to obtain the following alternative prior distributions: N(0.70, 0.05), N(1.15, 0.05), N(1.60, 0.05), and N(1.60, 0.50). Table 4 and Figure 2 present the fitted ROC curves, along with characteristics and summary measures. The simulated posterior density functions of the area A under these prior distributions are plotted in Figure 3. These ROC curves do not vary much, with A ranging from to This suggests that the models are robust with respect to the prior distributions. When the mean of the normal prior distribution for 0 is increased (eg, from 0.70 to 1.60) while the variance is fixed, the ROC curves move from conservative (eg, smaller area) to anticonservative (eg, greater area). When the variance of the normal prior distribution is increased while the mean is fixed (eg, from 0.05, 0.50, to 10 6 ), the results tend toward those under the diffuse prior. Clinical Example 2: Spiral CT of Ureteral Stones Overall results and summary statistics. Seventy-one patients passed stones spontaneously, and 29 required interventional therapy. The stone size (outcome variable) ranged from 1 to 16 mm (mean SD, 5.03 mm 2.69). This variable was transformed by using a log transformation (20). Thirty-nine subjects had stones located in the UVJ. The age range was years (mean SD, years 12.03) in the spontaneous passage group and (mean SD, years 13.24) in the surgical intervention group. Results under full versus partial models. For both regression models (Eqq [3, 4]), Table 5 presents posterior means and SDs of all model parameters with diffuse priors. In both models the posterior mean of the gold standard main effect, 0, is above 0, reflecting the predictive value of stone size for treatment option. In the full model, the posterior mean of the (gold standard) (UVJ) interaction term 4 and the (gold standard) (UVJ) (age) term 6 are different from 0, suggesting a complex structure. Table 6 and Figure 4 present the fitted ROC curves, along with characteristics and summary measures. The overall ROC curve under the partial model is displayed, along with four curves (for the combinations of UVJ status age 30 or 40 years) under the full model. Stone Figure 1. In the PSA example, the overall ROC curves (partial model) and the curves for three levels of risks based on the Gleason scores (full model). Figure 2. In the PSA example and under the partial model, the ROC curves based on different prior distributions. size is least predictive (A 0.577) for stones in the UVJ of patients 40 years of age. The predictive values for all other groups (A, ) were similar to that for 718

7 Academic Radiology, Vol 8, No 8, August 2001 BAYESIAN REGRESSION METHODOLOGY FOR ROC CURVES Table 3 PSA Example, Partial Model: Posterior Distributions of Regression Parameters Based on Different Prior Distributions Prior ( 0 ) * N(1.15, 10 6 ) N(0.70, 0.05) N(1.15, 0.05) N(1.60, 0.05) N(1.60, 0.50) Note. Values are means SDs. See Table 1 for explanation of regression parameters. Table 4 PSA Example, Partial Model: Posterior Distributions of Summary Measures of Diagnostic Accuracy Based on Different Prior Distributions Prior ( 0 ) A A q p MIS q MIS N(1.15, 10 6 ) N(0.70, 0.05) N(1.15, 0.05) N(1.60, 0.05) N(1.60, 0.50) Note. Values are means SDs. See Table 2 for expanded abbreviations. Figure 3. In the PSA example and under the partial model, the posterior distributions of the area under the ROC curve based on different prior distributions. the partial model (A 0.774). The maximum improvement of sensitivity occurs at 75% specificity, approximately. The greatest improvement ( q MIS 0.410) is for a 40-year-old subject with a stone not located in the UVJ. Results under different prior distributions for 0. Table 7 provides posterior means and SDs under the partial model and five different prior distributions for 0. Diffuse priors are still assumed for all other parameters. The posterior mean and variance of 0 under the diffuse prior are and , respectively. A realistic prior distribution that could have been assumed is N(0.50, 0.01). We then vary the mean and variance of this prior in order to obtain the following alternative prior distributions: N(0.20, 0.01), N(0.50, 0.01), N(0.80, 0.01), and N(0.80, 0.10). Table 8 and Figure 5 present the fitted ROC curves, along with characteristics and summary measures. The simulated posterior density functions of A under these prior distributions are plotted in Figure 6. These ROC curves vary, with A ranging from to 0.838, suggest- 719

8 O MALLEY ET AL Academic Radiology, Vol 8, No 8, August 2001 ing that the models are sensitive to the choice of prior distributions. The effect of mean and variance on the summary measures has similar tendencies as in the PSA example. Moreover, the posterior SDs vary depending on the precision of the prior distribution. Remarks on Markov-chain Monte Carlo Computations Convergence of the Markov-chain Monte Carlo algorithm was assessed by using CODA (21). For each model and data set the rate of convergence was rapid. The chains were monitored for convergence by using trace plots (14) and diagnostic measures such as Gelman and Rubin s (22) shrink factor. The estimated 50th percentiles of Gelman and Rubin s shrink factor were less than 1.05 in all cases, indicating that 10,000 iterations sufficed to achieve convergence. To assess whether the models were appropriate, we examined the residuals associated with the fitted posterior means of the outcomes with no evidence of lack of fit. In addition, the partial and full models were formally compared by using the deviance information criterion (DIC), a likelihood ratio test (LRT), and the pseudo-bayes factor (PSBF) (23,24). In the PSA example, the models performed similarly (the difference between the DIC of the partial and full models was 2.64, 2log-likelihood ratio 1.29, and PSBF 3.10). In the ureteral stone example, the partial model appeared more appropriate (difference of 5.64 for the DIC, 2log-likelihood ratio 0.60, PSBF 72.24). We used posterior means and SDs to summarize the location and spread of the posterior distributions because they have a unimodal and reasonably symmetric shape (Figs 3 and 6). Not surprisingly, our results under the diffuse prior were very similar to those based on maximum-likelihood estimation. If the posterior distributions were quite skewed, then the posterior medians or nonsymmetric credibility intervals, such as highest posterior density regions, would be preferable measures. DISCUSSION We have considered two Bayesian regression models, namely, the full and partial models, for ROC analysis that uses continuous diagnostic outcome data and accounts for covariates. The outcome variable was first transformed via a suitable monotone transformation to ensure that the modeling assumptions were appropriate. The posterior means and SDs of the parameters and functions of these Table 5 Ureteral Stone Example: Posterior Distributions of Regression Parameters for Diffuse Prior Distributions Model * Partial Full Note. Values are means SDs. regression coefficients, where * intercept, 0 main effect of the diagnostic standard, 1 effect of stone in UVJ, 2 effect of age, 3 effect of (UVJ) (age) interaction, 4 effect of (diagnostic standard) (UVJ) interaction, 5 effect of (diagnostic standard) (age) interaction, 6 effect of (diagnostic standard) (UVJ) (age) interaction, 1 2 error variance for the nondiseased measurements, and 2 2 error variance for diseased measurements. All coefficients of terms involving age were multiplied by 100. Independent N(0, 10 6 ) priors were used for all regression coefficients. Independent IG(0.001, 0.001) priors were used for the variance parameters. 720

9 Academic Radiology, Vol 8, No 8, August 2001 BAYESIAN REGRESSION METHODOLOGY FOR ROC CURVES Table 6 Ureteral Stone Example: Posterior Distributions of Summary Measures of Diagnostic Accuracy for Diffuse Prior Distributions Model A A q p MIS q MIS Partial (overall) Full Stone not in UVJ; pt age, 30 y Stone in UVJ; pt age, 30 y Stone not in UVJ; pt age, 40 y Stone in UVJ; pt age, 40 y Note. Values are means SDs. See Table 2 for expanded abbreviations; pt patient. Figure 4. In the ureteral stone example, the overall ROC curves (partial model) and the curves for three levels of risks based on the Gleason scores (full model). parameters were simulated via Markov-Chain Monte Carlo methods. The effect of different prior distributions was also investigated. We selected two real applications to appeal to a wide audience and to illustrate how ROC curves varied for subgroups of patients based on discrete covariates (Gleason score in the PSA example) and continuous covariates (age in the ureteral stone example). We conclude from our overall analysis (under the partial model) that both PSA levels and ureteral stone size had satisfactory accuracy. The full models yielded multiple ROC curves corresponding to different values of the covariates, a consequence of the interactions between the covariates and the gold standard. In particular, the accuracy was the highest for the intermediate-risk group in the PSA example. In the ureteral stone example, the accuracy decreased dramatically with age when a stone was located in the UVJ. On the contrary, the accuracy was fairly constant over age when a stone was not located in the UVJ. The robustness with respect to prior distributions was related to the sample size. The choice of prior distributions had greater influence on the results in the ureteral stone example (n 100) than in the PSA example (n 180). Radiologic data such as illustrated in our examples are often subject to variability and prior knowledge. Our approach has several advantages. First, pilot information and prior belief can be incorporated in our regression models. Such additional information allows for precise inferences if it is consistent with clinical data. When different prior distributions are considered, it appears that the analysis becomes more subjective with varied results. Objectivity can be achieved, however, if these distributions are derived from reliable sources, such as relevant pilot studies. Second, Bayesian inferences use explicit probability statements about the parameters or any other quantities of interest. The uncertainly of these quantities is also expressed in terms of probability distributions, for example the posterior distributions of the area under the ROC curve (Figs 3 and 6). Third, complex inferences can easily be made via direct simulations from the posterior distributions because under the Bayesian paradigm a formal procedure exists for obtaining a solution to all inference problems (25). Finally, computer codes using standard software are readily available. The analysis we performed has several potential limitations. First, we require that the test outcome variable is normally distributed or explicitly transformed to normality before analysis. Alternatively, we can extend the model so that the transformation to normality is incorporated in our model (16,20) or so that only a latent decision vari- 721

10 O MALLEY ET AL Academic Radiology, Vol 8, No 8, August 2001 Table 7 Ureteral Stone Example, Partial Model: Posterior Distributions of Regression Parameters Based on Different Prior Distributions Prior ( 0 ) * N(0.50, 10 6 ) N(0.20, 0.01) N(0.50, 0.01) N(0.80, 0.01) N(0.80, 0.10) Note. Values are means SDs. See Table 5 for explanation of regression parameters. Table 8 Ureteral Stone Example, Partial Model: Posterior Distributions of Summary Measures of Diagnostic Accuracy Based on Different Prior Distributions Prior ( 0 ) A A q p MIS q MIS N(0.50, 10 6 ) N(0.20, 0.01) N(0.50, 0.01) N(0.80, 0.01) N(0.80, 0.10) Note. Values are means SDs. See Table 2 for expanded abbreviations. able is assumed to be normal (6,7). Second, for illustration purposes, we constructed prior distributions based mainly on observed data and then arbitrarily varied the hyperparameters. If clinical studies similar to ours are to be conducted, however, or if further clinical study is called for, our posterior distributions can then be used in the future. Third, when dealing with complex models, exact Bayesian inferences may be computationally intensive. Special attention should be paid in applying our Bayesian approach, especially in constructing prior distributions. The choice of prior distributions should not be arbitrary but rather should depend on the pilot data or plausible prior belief. We recommend investigating the sensitivity of the results to the choice of prior distribution. Several future research topics in this area are under way. In this article the transformation of the outcome variable was provided before analysis. We will incorporate such a transformation in a Bayesian hierarchical generalized linear models framework (15). In addition, we will elicit prior distributions by using formal statistical rules (26). We will also consider prior distributions with valid clinical basis that impose constraints on the parameter space. For example, if it is known for certain that the accuracy of a diagnostic test is higher than chance, the Figure 5. In the ureteral stone example and under the partial model, the ROC curves based on different prior distributions. parameters should be constrained accordingly. Another possible constraint is that the variance of the outcome data is smaller for the nondiseased group than that for the diseased group, or vice versa (11,27). 722

11 Academic Radiology, Vol 8, No 8, August 2001 BAYESIAN REGRESSION METHODOLOGY FOR ROC CURVES Figure 6. In the ureteral stone example and under the partial model, the posterior distributions of the area under the ROC curve based on different prior distributions. APPENDIX A: SUMMARY ROC MEASURES FOR OUR REGRESSION MODELS Let D,X denote the mean of the diagnostic outcome Y at the gold standard disease status D and additional covariates X. For simplicity, variances of the outcomes are 12 and 22 for the nondiseased and diseased groups, respectively. Furthermore, let the ROC parameters be X ( 1,X 0,X )/ 1 and 2 / 1. Note that it can easily be shown in our partial model that X 0 / 1, free of X, where 0 is the regression coefficient for D. Thus, there is a single ROC curve under this model. 1. Sensitivity at given specificity: For any given (1 specificity), p, the underlying sensitivity is q p X 1 p. Figure 7. A hypothetical ROC curve illustrating the maximum improvement of sensitivity over chance at corresponding (1 specificity). 2. Area under the curve: A X / 1 2. Increasing demand on speed and cost-effectiveness are placing restrictions on the size and breadth of studies. Therefore, it is important to use statistical methods that incorporate all available information in the analysis, allowing for fully informed decisions. The Bayesian approach provides a coherent methodology for updating an initial knowledge base with experimental data. It is therefore well suited to the demands of the modern study. The area under the curve is equal to the probability that the outcome for a randomly drawn diseased subject is higher than for a randomly drawn nondiseased subject (28). 3. Partial area between p 1 and p 2 : p 1, p 2 : A p 1, p 2 p1 p 2 q p d p. 723

12 O MALLEY ET AL Academic Radiology, Vol 8, No 8, August 2001 Figure A1. BUGS code for the ureteral stone s example partial model. Partial area is often preferred to A, especially when only a particular range of specificity or sensitivity is of interest (29). 4. Maximum improvement of sensitivity over chance. This is the maximum difference in observed sensitivity and sensitivity at chance (lying on a 45 line in ROC space) over all values of specificity(fig 7). The corresponding (1 specificity), denoted p MIS, is found to be p MIS X X log 1/2 / 2 1 when 1, p MIS ( X /2) when ( X 1, 1), and p MIS {0, 1} when ( X 1, 1). The sensitivity corresponding to the maximum improvement of sensitivity is given by q MIS X 1 p MIS, and the maximum improvement of sensitivity itself by q MIS q MIS p MIS. Note that due to local convexity of the ROC curve when 1, there is always at least one value of p at which sensitivity is strictly less than p. In the ureteral stone example, this is evident on the upper right portion of the ROC curve for a 40-year-old subject with a stone in the UVJ (Fig 4). APPENDIX B Figure A1 depicts the BUGS code for the ureteral stone example, partial model (18). ACKNOWLEDGMENTS Appreciation is extended to Barbara McNeil, MD, PhD, for reviewing the manuscript and to Daryl Caudry, MS, for providing us with the data for the prostate example. We also thank two anonymous referees for their careful review and comments that improved the quality of this manuscript. 724

13 Academic Radiology, Vol 8, No 8, August 2001 BAYESIAN REGRESSION METHODOLOGY FOR ROC CURVES REFERENCES 1. Tempany CM, Zhou X, Zerhouni EA, et al. Staging of prostate cancer: result of radiologic diagnostic oncology group project comparison of three MR imaging techniques. Radiology 1994; 192: Fielding JR, Silverman SG, Samuel S, Zou KH, Loughlin KR. Unenhanced helical CT of ureteral stones: a replacement for excretory urography in planning treatment. AJR Am J Roentgenol 1998; 171: Swets JA, Pickett RM. Evaluation of diagnostic systems. New York, NY: Academic Press, Campbell G. General methodology. I. Advances in statistical methodology for the evaluation of diagnostic and laboratory tests. Stat Med 1994; 13: Shapiro DE. The interpretation of diagnostic tests. Stat Methods Med Res 1999; 8: Metz CE, Herman BA, Shen J. Maximum-likelihood estimation of receiver operating characteristic (ROC) curves from continuous distributed data. Stat Med 1998; 17: Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making 1988; 8: Pepe MS. A regression modeling framework for ROC curves in medical diagnostic testing. Biometrika 1997; 84: Pepe MS. Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics 1998; 54: Pepe MS. An interpretation for the ROC curve and inference using GLM procedures. Biometrics 2000; 56: Beam CA. Random-effects models in the receiver operating characteristic curve-based assessment of the effectiveness of diagnostic imaging technology: concepts, approaches, and issues. Acad Radiol 1995; 2(suppl 1):S4 S Gatsonis CA. Random-effects models for diagnostic accuracy data. Acad Radiol 1995; 2(suppl 1):S14 S Peng F, Hall WJ. Bayesian analysis of ROC curves using Markovchain Monte Carlo methods. Med Decis Making 1996; 16: Hellmich M, Abrams KR, Jones DR, Lambert PC. A Bayesian approach to a general regression model for ROC curves. Med Decis Making 1998; 18: Ishwaran H, Gatsonis CA. A general class of hierarchical ordinal regression models with applications to correlated ROC analysis. Can J Stat 2000; 28: Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. London, England: Chapman & Hall, Tanner MA. Tools for statistical inference: methods for the exploration of posterior distribution and likelihood function. 2nd ed. New York, NY: Springer-Verlag, Spiegelhalter DJ, Thomas A, Best NG, Gilks WR. BUGS: Bayesian inference using Gibbs sampling, version Cambridge, England: MRC Biostatistics Unit, D Amico AV, Whittington R, Malkowicz SB, et al. Biochemical outcome after radical prostatectomy, external beam radiation therapy, or interstitial radiation therapy for clinically localized prostate cancer. JAMA 1998; 280: Zou KH, Tempany CM, Fielding JR, Silverman SG. Original smooth receiver operating characteristic curve estimation from continuous data: statistical methods for analyzing the predictive value of spiral CT of ureteral stones. Acad Radiol 1998; 5: Best NG, Cowles MK, Vines SK. CODA manual version Cambridge, England: MRC Biostatistics Unit, Gelman A, Rubin D. Inference from iterative simulation using multiple sequences. Stat Sci 1992; 7: Spiegelhalter DJ, Best NG, Carlin BP. Bayesian deviance, the effect number of parameters, and the comparison of arbitrarily complex models Web site of the MRC Biostatistics Unit, Cambridge, England. Available at: preslid.shtml. Accessed May 21, Gelfand AE, Dey DK, Chang H. Model determination using predictive distributions with implementation via sampling-based methods. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM. eds. Bayesian statistics 4. New York, NY: Oxford University Press, 1992; Lindley DV. Bayesian inference. In: Kotz S, Johnson NL, eds. International encyclopedia of statistics. Vol 1. New York, NY: Wiley, 1982; Kass RE, Wasserman L. The selection of prior distribution by formal rules. JASA 1996; 91: Tandberg D, Deely JJ, O Malley AJ. Generalized likelihood ratios for quantitative diagnostic test scores. Am J Emerg Med 1997; 15: Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143: McClish DK. Analyzing a portion of the ROC curve. Med Decis Making 1989; 9:

Reconstruction of individual patient data for meta analysis via Bayesian approach

Reconstruction of individual patient data for meta analysis via Bayesian approach Yusuke Yamaguchi, Wataru Sakamoto and Shingo Shirahata Graduate School of Engineering Science, Osaka University Masashi