Statistical Methods in Medical Research

Size: px

Start display at page:

Download "Statistical Methods in Medical Research"

Imogene White
5 years ago
Views:

Statistical Methods in Medical Research Meta-analysis for the comparison of two diagnostic tests to a common gold standard: A generalized linear mixed model approach Journal: Statistical Methods in

diagnostic studies is still a rapidly developing area of biostatistical research.

1 Statistical Methods in Medical Research Meta-analysis for the comparison of two diagnostic tests to a common gold standard: A generalized linear mixed model approach Journal: Statistical Methods in Medical Research Manuscript ID SMM--00 Manuscript Type: Original Article Keywords: Meta-analysis, Sensitivity, Specificity, Comparison, Generalized linear mixed models Abstract: Meta-analysis of diagnostic studies is still a rapidly developing area of biostatistical research. Especially, there is an increasing interest in methods to compare different diagnostic tests to a common gold standard. Restricting to the case of two diagnostic tests, in these meta-analyses the parameters of interest are the differences of sensitivities and specificities (with their corresponding confidence intervals) between the two diagnostic tests while accounting for the various associations within single studies, between the two tests and within patients. We propose statistical models with a quadrivariate response (where sensitivity of test, specificity of test, sensitivity of test, and specificity of test are the four responses) as a sensible approach to this task. Using a quadrivariate generalized linear mixed model (GLMM) naturally generalizes the common standard bivariate model of meta-analysis for a single diagnostic test. If information on several thresholds of the tests are available, the quadrivariate model can be further generalized to yield a comparison of full ROC curves. We illustrate our model by an example where two screening methods for the diagnosis of type diabetes are compared.

2 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss Meta-analysis for the comparison of two diagnostic tests to a common gold standard: A generalized linear mixed model approach Journal Title XX(X):?? c The Author(s) 0000 Reprints and permission: sagepub.co.uk/journalspermissions.nav DOI: 0./ToBeAssigned Annika Hoyer, Oliver Kuss Abstract Meta-analysis of diagnostic studies is still a rapidly developing area of biostatistical research. Especially, there is an increasing interest in methods to compare different diagnostic tests to a common gold standard. Restricting to the case of two diagnostic tests, in these meta-analyses the parameters of interest are the differences of sensitivities and specificities (with their corresponding confidence intervals) between the two diagnostic tests while accounting for the various associations within single studies, between the two tests and within patients. We propose statistical models with a quadrivariate response (where sensitivity of test, specificity of test, sensitivity of test, and specificity of test are the four responses) as a sensible approach to this task. Using a quadrivariate generalized linear mixed model (GLMM) naturally generalizes the common standard bivariate model of meta-analysis for a single diagnostic test. If information on several thresholds of the tests are available, the quadrivariate model can be further generalized to yield a comparison of full ROC curves. We illustrate our model by an example where two screening methods for the diagnosis of type diabetes are compared.

3 Statistical Methods in Medical Research Page of Keywords Meta-analysis; Sensitivity; Specificity; Comparison; Generalized linear mixed models Introduction Statistical methods for the meta-analysis of diagnostic studies has been a vivid research area in recent years. Although it is meanwhile accepted that the bivariate logistic regression model with random effects ; should be regarded as the standard approach for such analyses, this model has been extended in several directions. In response to numerical problems when using maximum likelihood methods for estimation, more robust methods have been proposed that are guaranteed to give always estimates with confidence intervals ;. We proposed to use a model with beta-binomial marginal distributions that are linked by a copula, which results in a closed likelihood function, thus better convergence, and offers additional flexibility for modelling the correlation between sensitivity and specificity. Moreover, it has been argued to additionally account for the disease prevalence to arrive at summary estimates for sensitivity and specificity by using trivariate models ;. It is surprising that there is yet no extension that allows meta-analysis for the comparison of two diagnostic tests to a common gold standard. These studies occur more often than expected as it was shown by Takwoingi et al. which found more than systematic reviews which compared the accuracy of two or more tests. In line with this, other medical researchers have called for meta-analytic methods to this task. For example, Tatsioni et al. wrote as early as in 00, that frequently, meta-analyses assess several diagnostic tests for the same condition. In such cases, we may wish not only to report the performance of each test but also to compare performance between tests. Leeflang et al. 0 emphasized that policymakers and guideline developers may be particularly interested in comparative accuracy of diagnostic tests. In our research area of diabetes there are two systematic reviews ; that compare HbA c and fasting plasma glucose for the population-based screening of type diabetes mellitus. Both reviews German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich Heine University Düsseldorf, Institute for Biometry and Epidemiology Corresponding author: Annika Hoyer, Deutsches Diabetes-Zentrum, Institut für Biometrie und Epidemiologie, Auf m Hennekamp, 0 Düsseldorf, Germany annika.hoyer@ddz.uni-duesseldorf.de [Version: 0/0/0 v.0]

4 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss include more than 0 studies, but report results only qualitatively. Especially, they do not report differences of sensitivities or specificities which are probably the parameters of highest interest when comparing two diagnostic tests. Admittedly, methods have been proposed earlier for meta-analysis to compare two diagnostic tests. However, the method of Siadaty et al. ; confounds the information for sensitivity and specificity and their differences by combining them in a diagnostic odds ratio, a measure which is rarely used by practitioners. Another approach was given by Trikalinos et al.. This approach assumes the two tests to be independent, which would rarely be given in most clinical applications. Moreover, the method can only be applied if both, individual and aggregated proband data, are available. Finally, the Diagnostic Test Accuracy Working Group of the Cochrane Collaboration proposed an approach which allows comparisons of diagnostic tests. They suggest a meta-regression extension of the bivariate model including a binary covariate for the tests to compare. However, this model does not account for potential correlations between the two tests probably compromising the statistical properties of the method. Only recently, this model has been extended to the network meta-analysis situation, allowing comparison of more than two tests by Menten and Lesaffre. In the following, a new model is presented which compensates for the disadvantages of earlier approaches. It computes differences of sensitivities and specificities while fully accounting for all correlations within and heterogeneities between studies. The model is a natural quadrivariate extension of the standard bivariate model for meta-analyzing one diagnostic test. As such it inherits all the well-known and appreciated properties from this model. Additionally, it is possible to use information from multiple test thresholds if these are given in the single studies. In Section, we introduce the data set that motivated our research. Section introduces our model and in Section we report the results of a small simulation study that validates the proposed model in realistic situations. In Section, we come back to our data set. Finally, in Section, we summarize and discuss our findings, and point to advantages and drawbacks of the model. Data set We illustrate our method by two systematic reviews ; on population-based screening of type diabetes mellitus. In principle, three methods are available to diagnose diabetes: the oral glucose tolerance test (OGTT), measurement of HbA c and measurement of fasting plasma glucose (FPG). HbA c and FPG are less invasive than the OGTT, where

5 Statistical Methods in Medical Research Page of Journal Title XX(X) HbA c has the additional advantage that patients are not requested to refrain from eating and drinking any liquids other than water before the testing procedure, which is especially important in a screening setting. In the two reviews, the single studies use mainly the OGTT as reference standard and compare HbA c to FPG. Admittedly, the actual situation is a bit more complicated, and the study-specific reference standards sometimes also includes information from HbA c or FPG, potentially favouring one of the two tests over the other. However, we ignore these subtleties here for the sake of the presentation of our method. Just aside, differences between reference standards were also ignored in the original Kodama paper. Moreover, in both reviews no quantitative estimates were reported but results were given only narratively. For a first analysis we use data from Bennett et al. and Kodama et al. as given in Table. PLACE TABLE APPROXIMATELY HERE In a second analysis we use the same two systematic reviews, but additionally include all information on the reported thresholds of HbA c and FPG from the single studies. This was done because we noticed that a number of studies reported this additional information and we did not want it to be wasted. To this task, we re-run the search algorithm from Kodama et al., but found no additional studies. One of us (AH) then read all single studies in full text and reconstructed the four-fold tables for each reported threshold. As a result, we found that in studies pairs of sensitivity and specificity were given which used different thresholds for HbA c (ranging from. to.) and for FPG (ranging from.0 to.). That is, a standard analysis that uses only a single pair of sensitivity and specificity from each study, would use only % of the available observations. The full data set can be found in the Supporting Web Materials. Statistical methods Bivariate logistic regression model As our model is a straightforward extension of the bivariate standard model, we shortly reiterate this model. To this task, we use the following notation. We assume that each individual study (indexed by i =,..., I) in the meta-analysis reports a four-fold table with the number of true positives (T P i ), true negatives (T N i ), false positives (F P i ), and false negatives (F N i ). The sensitivity of the i-th study is defined as Se i = T P i /(T P i +

6 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss F N i ) and the specificity as Sp i = T N i /(T N i + F P i ). The numbers of true positives and true negatives are assumed to be binomially distributed: T P i Se i Binomial(T P i + F N i, Se i ), () T N i Sp i Binomial(T N i + F P i, Sp i ). () To model potential across study correlation and heterogeneity of sensitivity and specificity, a generalized linear mixed model is used: logit(se i ) = µ + ϕ i, logit(sp i ) = ν + ψ i () with logit(p) = log(p/( p)) and random effects ϕ i and ψ i. The random effects are assumed to follow a bivariate normal distribution ( ) [ (0 ) ( )] ϕi σϕ ρσ ϕ σ ψ N,. () ψ i 0 ρσ ϕ σ ψ That is, σϕ and σ ψ model the heterogeneity (on the logit scale) in sensitivities and specificities across studies, and ρ the across study correlation. As noted in the introduction, the Cochrane model extends the bivariate model by a single binary covariate for the tests under comparison, resulting in logit(se i ) = µ + α + ϕ i, logit(sp i ) = ν + β + ψ i, () but again, this model assumes that the two diagnostic tests are independent. Quadrivariate logistic regression model As written before, the quadrivariate model for comparing two tests is an extension of the bivariate model. We now assume that each study reports two four-fold tables with the number of true positives (T P ij ), true negatives (T N ij ), false positives (F P ij ), and false negatives (F N ij ) for the i-th study and the j-th diagnostic test (j =, ). Note that we assume that the gold standard is the same for both tests, so that each individual contributes three binary pieces of information: its result for test, its result for test and its true disease status. Analogous to the bivariate approach, we assume that the true positives and the true negatives of the i-th study and the j-th test are binomially distributed, given the σ ψ

7 Statistical Methods in Medical Research Page of Journal Title XX(X) sensitivities and the specificities of test j and study i. T P ij Se ij Binomial(T P ij + F N ij, Se ij ), () T N ij Sp ij Binomial(T N ij + F P ij, Sp ij ), () The corresponding logit transformations are logit(se ij ) = µ j + ϕ ij, logit(sp ij ) = ν j + ψ ij where logit(p) = log(p/( p)). The random effects (ϕ i, ψ i, ϕ i, ψ i ) T are now assumed to follow a quadrivariate normal distribution ϕ i ψ i ϕ i ψ i 0 σϕ ρ ϕ ψ σ ϕ σ ψ ρ ϕ ϕ σ ϕ σ ϕ ρ ϕ ψ σ ϕ σ ψ N 0 0, σ ψ ρ ψϕ σ ψ σ ϕ ρ ψψ σ ψ σ ψ σϕ ρ ϕ ψ σ ϕ σ ψ. () 0 The four variance parameters σϕ, σψ, σϕ, σψ are used to describe possible between-study heterogeneity of sensitivities (Se, Se ) and specificities (Sp, Sp ). The parameters ρ ϕ ψ, ρ ϕ ϕ, ρ ϕ ψ, ρ ψ ϕ, ρ ψ ψ, ρ ϕ ψ capture the corresponding correlation among the random effects. Assuming the the four correlation parameters ρ ϕ ϕ, ρ ϕ ψ, ρ ψ ϕ and ρ ψ ψ to be zero is equivalent to fitting two independent bivariate models for both diagnostic tests separately. Finally, the differences of sensitivities and specificities as our main parameters of interest are estimated as follows: Se = exp(ˆµ ) + exp(ˆµ ) exp(ˆµ ) () + exp(ˆµ ) for the difference of sensitivities, and analogously for Sp, the difference of specificities, through replacing the ˆµ j by ˆν j. Accounting for multiple thresholds Results from diagnostic tests frequently originate from dichotomizing a continuous marker at certain thresholds. The single studies in a meta-analysis thus might report several four-fold tables, one for each threshold. These additional information is frequently ignored in meta-analyses and we saw this waste of information also in σ ψ

8 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss our example meta-analysis. However, it is straightforward to include the threshold information as a covariate in our model (and of course, also in the bivariate standard model) by using logit(se ij ) = µ j + X ij α j + ϕ ij, logit(sp ij ) = ν j + X ij β j + ψ ij where µ j and ν j are intercepts for logit(se ij ) and logit(sp ij ) and X ij is a vector containing the threshold values from each study and each test. The threshold values themselves and also the number of them can differ for every study. To model the random effects (ϕ i, ψ i, ϕ i, ψ i ) T, a quadrivariate normal distribution is assumed as before. It should be noted that accounting for thresholds simply corresponds to a meta-analysis of full ROC curves from the single studies. As such we propose here also a method for the meta-analysis of differences of ROC curves. Simulation To assess the statistical properties of our model in realistic situations, a simulation study was conducted. The simulation program was written in SAS. (SAS Institute Inc., Cary, NC, USA). Setting Being inspired by our two example meta-analyses ; and another data set from cardiology, the following parameters were varied: True sensitivities and specificities: The true sensitivity and specificity of test was held constant with 0% and 0%, respectively. The true sensitivity and specificity of test were varied between (%, 0%, 0%) and (%, 0%, 0%), respectively. Following this, we achieved true differences in sensitivities of -0 percentage points (pp), 0 pp, and pp and true differences in specificities of -0 pp, 0 pp, and pp. The true association between sensitivities and specificities of both tests:

9 Statistical Methods in Medical Research Page of Journal Title XX(X) To this task, we assumed the following three random effect matrices (as in ()), here given as their corresponding correlation matrices: Γ none = , Γ neg = , Γ mix = Γ none assumes that sensitivities and specificities across studies and even between two tests, that is, within the same patient, are completely independent. For Γ neg we chose a negative correlation of -0. between sensitivity and specificity of each test because negative correlations are frequently observed and actually expected in reality. The correlation between the two sensitivities and the two specificities is assumed to be -0.. The matrix Γ mix denotes a mixed correlation structure. Based on Γ neg we now assume a positive correlation of 0. between sensitivities and specificities, because such positive values for the correlations were seen in our diabetes data set. We did not vary the true random effect variances σ ϕ, σ ψ, σ ϕ, σ ψ, but kept them constant at the value 0. on the logit scale. This value was inspired by our previous work and corresponds to a variance of sensitivity (and specificity) of 0.0 on the [0,]- scale. Data generation After combining the design parameters we got different simulation scenarios. For each of them,,000 meta-analyses were generated. The simulated number of studies within each meta-analysis was uniformly distributed and varied between 0 and 0. The study sizes were also generated from a uniform distribution and varied between 0 and 00. Finally, the number of diseased persons in each study and for a given study size was also sampled from a uniform distribution that varied between 0 and the sampled size. These choices were based on different meta-analyses reported in practice, for example

10 Page 0 of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss by Menke et al. or Kodama et al.. To generate the observed numbers of true positives and true negatives in the single studies, the VNORMAL call in SAS/IML was used to create quadrivariate normally distributed random vectors following the specifications for the respective Γ. These random numbers were used to calculate logit-transformed values for the two sensitivities and specificities with respect to their true values. After this, an expit-transformation led to the values for Se, Sp, Se, and Sp. These were multiplied by the number of diseased and non-diseased probands and rounded to the nearest integer to get the final numbers of true positives and true negatives. Estimation methods For each of the simulated meta-analyses, parameters have to be estimated for the quadrivariate model. These are the two sensitivities, the two specificities, and the 0 values in the random effects covariance matrix. Parameter estimation via the maximum likelihood principle in generalized linear mixed models is complicated by the fact that integrals which can not be solved analytically, appear in the likelihood function. Well-established methods that address this problem and yield exact maximum likelihood estimates are Gaussian quadrature or Markov Chain Monte Carlo (MCMC). Approximate methods like penalized quasi-likelihood (PQL) are also available. We restrict here to Gaussian quadrature and PQL estimation because both methods can be conveniently coded in SAS procedures NLMIXED and GLIMMIX. The GLIMMIX code is given in the Supporting Web Materials. Actually, with respect to estimation methods, we compared implementations: Penalized quasi-likelihood using the logit link (PROC GLIMMIX) Penalized quasi-likelihood using the identity link (PROC GLIMMIX) Gaussian quadrature using the logit link (PROC NLMIXED). We included a model with an identity link, because in this model the raw difference in sensitivities ˆµ ˆµ and specificities ˆν ˆν originate directly from the natural parameters. Opposed to this and as seen in equation (), for the standard logit link, differences in sensitivities and specificities are linear combinations of model parameters and their confidence intervals have to be computed, with some extra effort, by the multivariate delta method. All procedures were run with their default options to ensure a fair comparison between models. Starting values for the GLIMMIX procedures are automatically generated within the procedure. In case of NLMIXED, where starting values should be given, we computed them as raw proportions of sensitivities and

11 Statistical Methods in Medical Research Page of Journal Title XX(X) specificities. The starting values for the variances and correlations were also generated using the corresponding raw values and appropriate transformations of them. As a reference method we also included the model of the Diagnostic Test Accuracy Working Group of the Cochrane Collaboration, henceforth denoted as the Cochrane method. Parameters from this model were estimated by Gaussian quadrature via the SAS NLMIXED code from Macaskill et al. Additionally, we implemented SAS GLIMMIX code yielding PQL estimates as in the quadrivariate case. For comparison of the estimation methods, mean bias and empirical coverage (to the % level) were calculated. Confidence intervals were calculated assuming t- distributions where we used the default numbers of degrees of freedom from the respective SAS procedure. To address the problem of numerical robustness, we report the number of converged runs, too. Results Our parameters of interest are the differences between sensitivities and specificities of the two tests. Therefore, in reporting our results we restrict to them. Tables, and give the simulation results for the situation that is most similar to our diabetes example where the test with the higher sensitivity has a lower specificity as compared to the other test. However, the description of our outcomes is based on the complete simulation results which can be found in the Supporting Web Materials. Bias In terms of bias all estimation methods performed nearly similar, except in a few situations. Averaging the different correlation structures, the overall bias from the model using Gaussian quadrature was slightly higher compared to the other estimation methods. Referring explicitly to the underlying correlation matrices, the differences of sensitivities and specificities were overestimated in case of mixed as compared to the none correlation structures. Comparing the different estimation methods, the most difficult situations, resulting in a higher bias, were these with negative underlying correlations. This occurred especially for the model using Gaussian quadrature and the model with the identity link. The quadrivariate model using PQL and the logit link was the most robust in terms of bias without huge outliers and only small deviations from the true values. Both implementations of the Cochrane approach led to biased estimates in the same range. The magnitude was a bit higher as compared to the quadrivariate model using PQL and the logit link. That means our new proposed model seems to perform better than the proposed Cochrane model in terms of bias.

12 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss PLACE TABLE APPROXIMATELY HERE Coverage In terms of coverage it is important to note that due to random error, values between.% and.% (%-Wald confidence interval for a binomial proportion of successes out of,000 trials) are still compatible with the hypothesis of a correct coverage. In case of our quadrivariate models, all estimation methods obtained results near the expected %. The best results were obtained in cases where no correlation is present. Thereby, Gaussian quadrature had a small advantage over the other quadrivariate models. In case of non-zero correlations, the models using PQL performed similar and better than the model using Gaussian quadrature. The simulation results showed obviously that the Cochrane models led to worse results compared to all implemented estimation methods of the quadrivariate model. That is, our proposed model performs frequently better than the Cochrane approach in terms of coverage. PLACE TABLE APPROXIMATELY HERE Convergence In terms of convergence none of the models reaches,000 converged runs and worst results were observed in cases with the negative underlying correlation structure. The methods using the the logit link were always superior and the methods using Gaussian quadrature were always inferior. The performance of the model using the identity link depended on the underlying simulation setting. It seemed to be fragile in the scenarios where the specificity of the first test is lower than the specificity of the second test. With respect to convergence, the Cochrane approach led in most cases to better results than the quadrivariate models. This was expected, as the Cochrane model is a simpler model including only two random effects. PLACE TABLE APPROXIMATELY HERE Examples In this section we return to our example on population-based screening of type diabetes mellitus. As noted previously, we report two analyses, the first one using the original data from the two systematic reviews, the second one using the full information from all reported thresholds of HbA c and FPG.

13 Statistical Methods in Medical Research Page of Journal Title XX(X) First analysis using a single threshold per study The estimated sensitivities, specificities and their corresponding differences are shown in Table. Using Gaussian quadrature we found a difference of about pp between the sensitivities of the two tests, favouring HbA c. The model using PQL and the logit link finds that FPG has a higher sensitivity than HbA c, but with a high uncertainty as can be seen from the wide confidence interval. Both models judge FPG to have a higher specificity than HbA c, but again, confidence intervals are wide. Although we have seen in our simulation that the non-canonical identity link is not necessarily inferior in terms of convergence, the model with the identity link did not converge for the example data set. PLACE TABLE APPROXIMATELY HERE Second analysis using multiple thresholds per study In the second analysis, we proceed to use the full information on all possible thresholds for comparing HbA c and FPG. This is equivalent to perform a meta-analysis on the differences of ROC curves. This comes with the technical difficulty that HbA c and FPG, in order to compare them, must be measured on the same scale. To this task, we fit a simple linear regression model with the observed HbA c threshold values as dependent and the observed FPG threshold values as independent variable from our data set. The resulting equation HbA c =. + 0.FPG is used to model the relationship between the thresholds. The results are illustrated in Figures and where observed and estimated differences between sensitivities and specificities are given. PLACE FIGURES AND APPROXIMATELY HERE We only use the GLMM with PQL estimation and the logit link, because the simulation has shown that this model performs best if one threshold is available. The biggest difference between HbA c and FPG can be seen in the range between. and.0, where (in case of sensitivity) differences up to pp, favouring FPG, can be found. On the other hand, the largest confidence intervals correspond to these differences. The estimated differences of sensitivities show that FPG is judged to perform better in higher ranges of thresholds. Only when lower thresholds were used, HbA c is preferred. At the threshold of. which is recommended by the American Diabetes Association 0 as well as by the WHO for diagnosing diabetes, FPG performs better in terms of sensitivity while the estimated specificity of both tests is nearly identical at this value.

14 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss The example shows that it is highly beneficial from a clinical viewpoint to explicitly model the information from different thresholds: Only then sensitivities and specificities can be compared at specific thresholds. In a standard meta-analysis using only two pairs of sensitivity and specificity from each study and test, only one pair of overall differences between sensitivities and specificity would have been available, ignoring all information from the different thresholds. Discussion In this paper, we propose a new model for the meta-analysis of diagnostic studies that compare two diagnostic tests to a common gold standard, situations which are not that rare in medical research. Up to now it was not possible to summarize the results in a meta-analytic way, at least if one was interested in reporting differences in sensitivity and specificity between the two tests while accounting for all potential correlations between tests and populations. Our model constitutes a quadrivariate generalized linear mixed model and is thus just a straightforward extension of the current bivariate standard model as proposed by Reitsma et al. and Chu and Cole. As such, all the well established statistical theory and software implementations for generalized linear mixed model with a multivariate outcome can be used. In a small simulation study we showed that the standard logit link and the PQL principle for parameter estimation worked well and better than the Cochrane approach in a variety of realistic scenarios. By simply adding a covariate to the linear predictor we were able to meta-analyse studies with multiple thresholds corresponding to the meta-analysis of differences of ROC curves. This is a straightforward generalization of previous methods that proposed estimation of summary ROC curves while using information from several thresholds, however, for just one single diagnostic test. ; While introducing our model, we proposed to estimate the random effects covariance matrix () in its full unrestricted form. However, this might not always be necessary and restricting variances to the same value or covariances to zero might result in improved fits. Fits for different matrices could be compared by the BIC and by the - Log Likelihoods (-LogL) of nested models, however, only if exact maximum likelihood estimates (e.g., by Gaussian quadrature) are calculated. In case of our diabetes example we achieve the best results in terms of BIC indeed not for the full model with parameters, but for a smaller model with parameters (BIC=,.). It is of interest here that the quadrivariate model which closely approximates the bivariate

15 Statistical Methods in Medical Research Page of Journal Title XX(X) Cochrane model in terms of the random effects matrix is judged inferior with respect to the BIC (BIC=,.). On the other hand, some limitations of the model should be pointed out. First, though the PQL method for parameter estimation was more robust than Gaussian quadrature, there are still some problems concerning numerical robustness. This was expected, because the number of estimated parameters is large in the quadrivariate model, especially as compared to the number of observations, i.e. the numbers of sensitivities and specificities across studies and tests. Models without random effects like copulabased ones as proposed in our previous work ; could be an alternative. In any case, the Cochrane model which is simpler from a statistical viewpoint, behaved well in the simulation, especially concerning robustness, and is a good alternative when the quadrivariate model has converge problems. It should be noted that our model assumes only the two standard aggregated fourfold tables to be available from each single study. Especially it does not need individual proband data where the three binary outcomes for each individual (result for tests and, and the true disease status) would be explicitly given. We do not consider this a real limitation of our model, because in our experience individual-proband data are rarely accessible. On the other hand, if such information were actually available we could introduce an additional hierarchical (that is, proband) level to adequately adjust for within-proband correlation. The resulting, more complex model would still be a quadrivariate GLMM. Thinking further, methods for comparing more than two diagnostic tests while fully accounting for correlations between tests are definitely needed. For example, in a subsample of larger studies in Takwoingi et al., only one third of all studies compared two tests, but two thirds compared two or more tests. As such, network meta-analyses of diagnostic tests or multiple-test (not multiple-treatment) comparisons will be a fruitful area in future research. References. Reitsma JB, Glas AS, Rutjes AW, Scholten RJ, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. Journal of Clinical Epidemiology 00; ():-0. DOI: 0.0/j.jclinepi

16 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss. Chu H, Cole SR. Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. Journal of Clinical Epidemiology 00; ():-. DOI: 0.0/j.jclinepi Chen Y, Liu Y, Ning J, Nie L, Zhu H, Chu H. A composite likelihood method for bivariate meta-analysis in diagnostic systematic reviews. Statistical Methods in Medical Research 0 Dec.. Zapf A, Hoyer A, Kramer K, Kuss O. Nonparametric meta-analysis for diagnostic accuracy studies. Statistics in Medicine 0; ()-. DOI: 0.00/sim.. Kuss O, Hoyer A, Solms A. Meta-analysis for diagnostic accuracy studies: A new statistical model using beta-binomial distributions and bivariate copulas. Statistics in Medicine 0; ():-0. DOI: 0.00/sim.0. Chu H, Nie L, Cole SR, Poole C. Meta-analysis of diagnostic accuracy studies accounting for disease prevalence: alternative parameterizations and model selection. Statistics in Medicine 00; ():-. DOI: 0.00/sim.. Hoyer A, Kuss O. Statistical methods for meta-analysis of diagnostic tests accounting for prevalence - A new model using trivariate copulas. Statistics in Medicine 0; ():-. DOI: 0.00/sim.. Takwoingi Y, Leeflang MM, Deeks JJ. Empirical Evidence of the Importance of Comparative Studies of Diagnostic Test Accuracy. Annals of Internal Medicine 0; ():-. DOI: 0./ Tatsioni A, Zarin DA, Aronson N, Samson DJ, Flamm CR, Schmid C, Lau J. Challenges in Systematic Reviews of Diagnostic Technologies. Annals of Internal Medicine 00; (): Leeflang MM, Deeks JJ, Gatsonis C, Bossuyt PM; Cochrane Diagnostic Test Accuracy Working Group. Systematic Reviews of Diagnostic Test Accuracy. Annals of Internal Medicine 00; ():-.. Bennett CM, Guo M, Dharmage SC. HbA(c) as a screening tool for detection of Type diabetes: a systematic review. Diabetic Medicine 00; ():-.. Kodama S, Horikawa C, Fujihara K, Hirasawa R, Yachi Y, Yoshizawa S, Tanaka S, Sone Y, Shimano H, Iida KT, Saito K, Sone H. Use of high-normal levels of haemoglobin A C and fasting plasma glucose for diabetes screening and for prediction: a meta-analysis. Diabetes/ Metabolism Research and Reviews 0; ():0-. DOI: 0.00/dmrr.. Siadaty MS, Philbrick JT, Heim SW, Schectman JM. Repeated-measures modeling improved comparison of diagnostic tests in meta-analysis of dependent studies. Journal of Clinical Epidemiology 00; ():-. DOI: 0.0/j.jclinepi

17 Statistical Methods in Medical Research Page of Journal Title XX(X). Siadaty MS, Shu J. Proportional odds ratio model for comparison of diagnostic tests in metaanalysis. BMC Medical Research Methodology 00; :.. Trikalinos TA, Hoaglin DC, Small KM, Terrin N, Schmid CH. Methods for the joint meta-analysis of multiple tests. Research Synthesis Methods 0; ():-. DOI: 0.00/jrsm.. Macaskill P, Gatsonis C, Deeks JJ, Harbord RM, Takwoingi Y. Chapter 0: Analysing and Presenting Results. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors) Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version.0 The Cochrane Collaboration, 00. Available from: Menten J, Lesaffre E. A general framework for comparative Bayesian meta-analysis of diagnostic studies. BMC Medical Research Methodology 0; :0. DOI: 0./s Picano E, Bedetti G, Varga A, Cseh E. The comparable diagnostic accuracies of dobutaminestress and dipyridamole-stress echocardiographies: a meta-analysis. Coronary Artery Disease 000. ():-.. Menke J. Bivariate random-effects meta-analysis of sensitivity and specificity with SAS PROC GLIMMIX. Methods of Information in Medicine 00; ():-, -. DOI: 0./ME American Diabetes Association. Classification and diagnosis of diabetes. Diabetes Care 0; (Suppl. ):S - S.. World Health Organization. Use of glycated haemoglobin (HbAc) in the diagnosis of diabetes mellitus. 0. (Available at diseases/reporthbac 0 edited.pdf), [Assessed 0 November 0]. Riley RD, Takwoingi Y, Trikalinos T, Guha A, Biswas A, Ensor J, Morris RK, Deeks JJ. Meta- Analysis of Test Accuracy Studies with Multiple and Missing Thresholds: A Multivariate- Normal Model. J Biomet Biostat 0; :.. Martínez-Camblor P. Fully non-parametric receiver operating characteristic curve estimation for random-effects meta-analysis. Statistical Methods in Medical Research 0; 0(0):-. (Epub ahead of print)

18 Page of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss Table. Type diabetes dataset from Bennett et al. and Kodama et al., first test: HbA c, second test: fasting plasma glucose Study TP FN FP TN TP FN FP TN Badings et al. 0 0 Choi et al. 0 Li et al. 0 Schöttker et al Tahrani et al. 0 0 Wang et al. Hu et al. 0 Zhang et al. 0 Zhou et al. 0 0 Kim et al. Nakagami et al. 0 0 Salmasi et al. 0 Glümer et al. Anand et al., South Asia 0 Anand et al., China Anand et al., Europe Jesudason et al. 0 Tavintharan et al. 0 Ko et al. 0 0 Papoz et al Choi et al. 0 0 Heianza et al. Law et al. 0 0 Mukai et al Soulimane et al., Denmark Soulimane et al., Australia 0 Soulimane et al., France Cederberg et al. Nakagami et al. Sato et al. 0 0 Inoue et al. 0 Inoue et al. Norberg et al. Takahashi et al. Ko et al. 0 Mannucci et al. Wiener et al. 0 0 Tanaka et al. 0

19 Statistical Methods in Medical Research Page of Journal Title XX(X) Table. Bias (multiplied by 00) for the differences of sensitivity and specificity on the [0, ]- scale. Abbreviations: Se=Difference of sensitivities, Sp=Difference of specificities, corr=correlation between Se, Sp, Se and Sp, SN=GLMM using GQ, SI=GLMM using PQL and the identity link, SL=GLMM using PQL and the logit link, CM=Cochrane model using GQ and the logit link, CA=Cochrane model using the PQL and the logit link Estimated model True Se True SN SI SL CM CA and Sp corr Se Sp Se Sp Se Sp Se Sp Se Sp -0%/% none negative mixed %/-0% none negative mixed Table. Empirical coverage (in %) for the % confidence intervals for the differences of sensitivity and specificity on the [0, ]- scale. Abbreviations: Se=Difference of sensitivities, Sp=Difference of specificities, corr=correlation between Se, Sp, Se and Sp, SN=GLMM using GQ, SI=GLMM using PQL and the identity link, SL=GLMM using PQL and the logit link, CM=Cochrane model using GQ and the logit link, CA=Cochrane model using the PQL and the logit link Estimated model True Se True SN SI SL CM CA and Sp corr Se Sp Se Sp Se Sp Se Sp Se Sp -0%/% none negative mixed %/-0% none negative mixed

20 Page 0 of 0 Statistical Methods in Medical Research A. Hoyer, O. Kuss Table. Number of converged runs from 000 simulation runs. Abbreviations: Se=Difference of sensitivities, Sp=Difference of specificities, corr=correlation between Se, Sp, Se and Sp, SN=GLMM using GQ, SI=GLMM using PQL and the identity link, SL=GLMM using PQL and the logit link, CM=Cochrane model using GQ and the logit link, CA=Cochrane model using the PQL and the logit link Estimated model True Se True SN SI SL CM CA and Sp corr Se Sp Se Sp Se Sp Se Sp Se Sp -0%/% none 0 0 negative 0 0 mixed %/-0% none negative 0 0 mixed Table. Results using the different GLMMs Model Sensitivity HbAc Specificity HbAc Sensitivity FPG Specificity FPG Difference of Difference of [% CI] (in %) [% CI] (in %) [% CI] (in %) [% CI] (in %) sensitivities specificities [% CI] (in pp) [% CI] (in pp) GLMM. [.;.]. [0.;.].0 [.;.]. [.;.]. [-0.;.] -. [-.; -.] Gaussian Quadrature (logit link) GLMM PQL - [-; -] - [-; -] - [-; -] - [-; -] - [-; -] - [-; -] (identity link) GLMM PQL. [.;.] 0. [.;.]. [.0;.].0 [.0;.0] -.0 [-.;.] -. [-.;.0] (logit link) Figure. Difference of sensitivities Threshold HbAc Threshold FPG Estimated difference of sensitivities with respect to different thresholds

21 Statistical Methods in Medical Research Page of Journal Title XX(X) Difference of specificities Threshold HbAc Threshold FPG Figure. Estimated difference of specificities with respect to different thresholds

22 Page of 0 Statistical Methods in Medical Research Supporting Web Materials for Hoyer and Kuss: Meta-analysis for the comparison of two diagnostic tests to a common gold standard: A generalized linear mixed model approach In this supplementary material the complete diabetes data set is given. The bold entries are the original included thresholds. Study Threshold HbA c TP FN FP TN Threshold FPG TP FN FP TN Badings et al... 0 Choi et al Li et al Schöttker et al Tahrani et al Wang et al..0. Hu et al Zhang et al Zhou et al Kim et al Nakagami et al Salmasi et al.. 0. Glümer et al Anand et al., South Asia.. 0 Anand et al., China.. Anand et al., Europe Jesudason et al Tavintharan et al

23 Statistical Methods in Medical Research Page of Ko et al Papoz et al Choi et al Heianza et al... Law et al Mukai et al Soulimane et al., Denmark Soulimane et al., Australia Soulimane et al., France Cederberg et al... Nakagami et al Sato et al Inoue et al... 0 Inoue et al Norberg et al Takahashi et al... Ko et al... 0 Mannucci et al...0 Wiener et al Tanaka et al

24 Page of 0 Statistical Methods in Medical Research Supporting Web Materials for Hoyer and Kuss: Meta-analysis for the comparison of two diagnostic tests to a common gold standard: A generalized linear mixed model approach 0. Bias Table 0.: Bias (multiplied by 00) for the differences of sensitivity and specificity on the [0, ]- scale. Abbreviations: Se=Difference of sensitivities, Sp=Difference of specificities, corr=correlation between Se, Sp, Se and Sp, SN=GLMM using GQ, SI=GLMM using PQL and the identity link, SL=GLMM using PQL and the logit link, CM=Cochrane model using GQ and the logit link, CA=Cochrane model using the PQL and the logit link Estimated model True Se True SN SI SL CM CA and Sp corr Se Sp Se Sp Se Sp Se Sp Se Sp 0%/0% none negative mixed %/-0% none negative mixed %/% none negative mixed %/0% none negative mixed %/-0% noen negative mixed %/% none negative mixed %/0% none negative mixed %/-0% none negative mixed %/% none negative mixed

25 Statistical Methods in Medical Research Page of Coverage Table 0.: Empirical coverage (in %) for the % confidence intervals for the differences of sensitivity and specificity on the [0, ]- scale. Abbreviations: Se=Difference of sensitivities, Sp=Difference of specificities, corr=correlation between Se, Sp, Se and Sp, SN=GLMM using GQ, SI=GLMM using PQL and the identity link, SL=GLMM using PQL and the logit link, CM=Cochrane model using GQ and the logit link, CA=Cochrane model using the PQL and the logit link Estimated model True Se True SN SI SL CM CA and Sp corr Se Sp Se Sp Se Sp Se Sp Se Sp 0%/0% none negative mixed %/-0% none negative mixed %/% none negative mixed %/0% none negative mixed %/-0% none negative mixed %/% none negative mixed %/0% none negative mixed %/-0% none negative mixed %/% none negative mixed

26 Page of 0 Statistical Methods in Medical Research Number of converged runs Table 0.: Number of converged runs from 000 simulation runs. Abbreviations: Se=Difference of sensitivities, Sp=Difference of specificities, corr=correlation between Se, Sp, Se and Sp, SN=GLMM using GQ, SI=GLMM using PQL and the identity link, SL=GLMM using PQL and the logit link, CM=Cochrane model using GQ and the logit link, CA=Cochrane model using the PQL and the logit link Estimated model True Se True SN SI SL CM CA and Sp corr Se Sp Se Sp Se Sp Se Sp Se Sp 0%/0% none 0 0 negative mixed 0%/-0% none 0 0 negative 0 0 mixed 0 0 0%/% none negative 0 0 mixed %/0% none negative mixed -0%/-0% none 0 0 negative mixed %/% none 0 0 negative 0 0 mixed %/0% none negative 0 0 mixed %/-0% none negative 0 0 mixed %/% none negative mixed 0 0

27 Statistical Methods in Medical Research Page of Supporting Web Materials for Hoyer and Kuss: Meta-analysis for the comparison of two diagnostic tests to a common gold standard: A generalized linear mixed model approach In this supplementary material the SAS code to fit the generalized linear mixed model for the diabetes data set is given. * HbAc and FPG data set; * First test: HbAc, second test: FPG; DATA diabetes; INPUT study tp fn fp tn tp fn fp tn; s=tp+fn; h=tn+fp; s=tp+fn; h=tn+fp; DATALINES;

28 Page of 0 Statistical Methods in Medical Research ;RUN; * Quadruplicate the data set; DATA glimmix; SET diabetes; DO temp= TO ; OUTPUT;END; RUN; * Assign the corresponding outcome; DATA glimmix; SET glimmix; IF temp= THEN DO; test=; outcome="sens"; outcomenum=0; outcomenum0=; outcomenum=0; outcomenum=0; outcomenum=0; num=tp; den=s; END; IF temp= THEN DO; test=; outcome="spec"; outcomenum=; outcomenum0=0; outcomenum=; outcomenum=0; outcomenum=0; num=tn; den=h; END; IF temp= THEN DO; test=; outcome="sens"; outcomenum=; outcomenum0=0; outcomenum=0; outcomenum=; outcomenum=0; num=tp; den=s; END; IF temp= THEN DO; test=; outcome="spec"; outcomenum=; outcomenum0=0; outcomenum=0; outcomenum=0; outcomenum=; num=tn; den=h; END; RUN; PROC GLIMMIX DATA=glimmix METHOD=rspl MAXOPT=000; CLASS study outcomenum outcome;

29 Statistical Methods in Medical Research Page of MODEL num/den=outcomenum / NOINT DIST=binomial LINK=logit SOLUTION; RANDOM outcomenum / SUBJECT=study TYPE=un; ESTIMATE "Sensitivity, HbAc" outcomenum 0 0 0/ ILINK CL DF=0000; ESTIMATE "Specificity, HbAc" outcomenum 0 0 0/ ILINK CL DF=0000; ESTIMATE "Sensitivity, FPG" outcomenum 0 0 0/ ILINK CL DF=0000; ESTIMATE "Specificity, FPG" outcomenum / ILINK CL DF=0000; ESTIMATE "Difference of Sensitivities" outcomenum 0-0/ ILINK CL; ESTIMATE "Difference of Specificities" outcomenum 0 0 -/ ILINK CL; ODS OUTPUT Estimates=MuEstimates(keep=LABEL Mu); ODS OUTPUT Estimates=StdErrMuEstimates(keep=LABEL StdErrMu); ODS OUTPUT Estimates=PDiff(keep=LABEL Probt); ODS OUTPUT Estimates=glimmixestimates(drop=Estimate Statement DF tvalue Probt StdErr Alpha Lower Upper rename=(mu=estimate LowerMu=KILower UpperMu=KIUpper StdErrMu=SE)); NLOPTIONS TECH=newrap MAXITER=000; RUN; * Calculate % confidence intervals for the estimated differences on the original [0,]-scale; PROC TRANSPOSE DATA=MuEstimates(where=(Label in ("Sensitivity, HbAc", "Specificity, HbAc", "Sensitivity, FPG", "Specificity, FPG"))) OUT=TransMuEstimates(rename=(COL=Sens COL=Spec COL=Sens COL=Spec) drop=_name LABEL_); RUN; PROC TRANSPOSE DATA=StdErrMuEstimates(where=(Label in ("Sensitivity, HbAc", "Specificity, HbAc", "Sensitivity, FPG", "Specificity, FPG"))) OUT=TransStdErrMuEstimates(rename=(COL=SE_Sens COL=SE_Spec COL=SE_Sens COL=SE_Spec) drop=_name LABEL_); RUN; PROC TRANSPOSE DATA=PDiff(where=(Label in ("Difference of Sensitivities", "Difference of Specificities"))) OUT=TransPDiff(rename=(COL=PValue_DiffSens COL=PValue_DiffSpec)

Nonparametric meta-analysis for diagnostic accuracy studies Antonia Zapf

Nonparametric meta-analysis for diagnostic accuracy studies Antonia Zapf joint work with A. Hoyer, K. Kramer, and O. Kuss Table of contents Motivation Nonparametric approach Simulation study Application