UNIVERSITY OF CALGARY. Measuring Observer Agreement on Categorical Data. Andrea Soo A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

Size: px

Start display at page:

Download "UNIVERSITY OF CALGARY. Measuring Observer Agreement on Categorical Data. Andrea Soo A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES"

Estella Wells
5 years ago
Views:

1 UNIVERSITY OF CALGARY Measuring Observer Agreement on Categorical Data by Andrea Soo A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY GRADUATE PROGRAM IN COMMUNITY HEALTH SCIENCES CALGARY, ALBERTA April, 2015 c Andrea Soo 2015

2 Abstract In order for a patient to receive proper and appropriate health care, one requires errorfree assessment of clinical measurements. For example, a diagnostic test that assesses whether an individual will be classified as having the disease or not having the disease needs to produce accurate and reliable results in order to ensure that an individual who needs treatment receives the correct therapy. Agreement and reliability studies aim to evaluate the accuracy and consistency of diagnostic tests or measurement tools. A model developed by Shoukri and Donner allows for the concurrent assessment of inter-rater (between rater) agreement and intra-rater (within rater) reliability, by incorporating two measurements per rater per subject. The main purpose of this research was to develop methods for the maximum likelihood (ML) approach using the Shoukri-Donner model and compare those methods to the method of moments (MM) approach using Monte Carlo computer simulation studies. Little differences between ML and MM were observed in point estimation. In general, the MM Wald test and MM confidence interval (CI) performed better than any of the other methods. In fact, the goodness of fit (GOF) test and GOF CI (for both ML and MM) were shown to have high empirical type I errors and low coverage levels, respectively, for the inter-rater agreement parameter in some parameter combinations for the 3 parameter case and all considered parameter combinations in the 4 parameter case. Further investigation as to why there is poor performance with the GOF approach needs to be done before one could recommend this approach as a better alternative to the MM approach. Also, it does not appear that the ML approach is necessarily better than the MM approach. Lastly, extending this research to a more general 5 parameter model requires the resolution of several issues before it can be evaluated in point estimation, hypothesis testing, and CI construction. ii

3 Acknowledgements I would like to thank my supervisors, Dr. Misha Eliasziw and Dr. Gordon Fick, for their guidance and mentorship during my studies. I d also like to thank Dr. Fick for the many invaluable breakthrough sessions and discoveries we had leading up to the completion of this thesis. I would like to thank my committee members, Dr. Brenda Hemmelgarn and Dr. Alberto Nettel-Aguirre, for providing their invaluable feedback and advice. I would also like to thank Dr. Xuewen Lu and Dr. Julie Horrocks for participating in my examination and providing me with additional suggestions that improved this thesis. I am grateful for receiving an Alberta Innovates - Health Solutions (AIHS) PhD studentship which supported this research. Lastly, I would like to thank my family and friends for their support and encouragement throughout this process. I could not have done this without you. Especially to my dad, mom and brother, thank you for your patience these past few years and for supporting me in more ways than you can imagine. And to Nathan, thank you for your patience, encouragement, and believing in me. iii

4 Table of Contents Abstract Acknowledgements Table of Contents List of Tables List of Figures and Illustrations List of Symbols, Abbreviations and Nomenclature ii iii iv vii ix xi 1 Introduction to Agreement and Reliability Association Reliability and Agreement Concurrent Assessment of Intra-rater Reliability and Inter-rater Agreement Example: Percentage of Diffusion-Perfusion Mismatch Research Objectives Measuring Agreement and Reliability Quantitative Measures Concordance Correlation Coefficient Intraclass Correlation Coefficient Categorical Measures Adjusted Measures Chance-Corrected Measures Other Measures Models for Reliability and Agreement Log-Linear and Latent Class Models Common Correlation Model iv

5 3 Shoukri and Donner Model The 4 Parameter Shoukri and Donner Model The 3 Parameter Shoukri and Donner Model Method of Moments Point Estimation Maximum Likelihood Point Estimation Confidence Intervals and Hypothesis Testing Method of Moments Variance Estimation Maximum Likelihood Variance Estimation Wald-Type Confidence Intervals Goodness of Fit Approach for Confidence Intervals Profile Likelihood Confidence Intervals Wald-Type Hypothesis Testing Goodness of Fit Approach for Hypothesis Testing Simulation Studies Evaluation Generating the Data Point Estimates Hypothesis Testing Confidence Intervals Monte Carlo Simulation Results For the 3 Parameter Model Point Estimates Hypothesis Testing Confidence Intervals Monte Carlo Simulation Results For the 4 Parameter Model Point Estimates Hypothesis Testing v

6 3.8.3 Confidence Intervals Special Cases Relative Log-Likelihood Example of Perfect Reliability for One of the Raters Extended Shoukri-Donner Model - Beginning Steps Sarmanov Distribution The 5 Parameter Shoukri and Donner Model Method of Moments Point Estimation Method of Moments Variance Estimation Application to Data Example 1: Intra-rater Reliability and Inter-rater Agreement of Assessing Percentage of Diffusion-Perfusion Mismatch Example 2: Intra-rater Reliability and Inter-rater Agreement of Assessing Presence of Dysplasia Example 3: Intra-rater Reliability and Inter-rater Agreement of Assessing Mammograms Example 4: Intra-rater Reliability and Inter-rater Agreement of Lumen Narrowing Conclusions and Recommendations Major Findings and Discussion Shoukri and Donner Model for 3 and 4 Parameters Special Cases Shoukri and Donner Model for 5 Parameters Recommendations Conclusion and Future Directions Bibliography vi

7 List of Tables 1.1 Basic 2x2 contingency table x2 table for diagnostic tests Example of high disagreement and high association Binary data for 13 subjects being rated by 2 raters twice Mismatch data at first time point Mismatch data at second time point Strength of agreement benchmarks for CCC Estimate of the intraclass correlation for 2 raters each rating n subjects Basic 2x2 table Strength of agreement benchmarks for kappa Common correlation model Study design for 2 raters each rating n subjects twice Joint probability distribution of the sum of ratings from rater 1 (X i1 ) and rater 2 (X i2 ) Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) Method of moments estimates for the 3 parameter Shoukri and Donner model Method of moments estimates for the 4 parameter Shoukri and Donner model Category groupings for goodness of fit test for 3 parameter model Category groupings for goodness of fit test for 4 parameter model Observed and expected frequencies for goodness of fit test for mammogram data Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) for n 01 = n 11 = n 21 = Parameter estimates for data in Table Method of moments estimates for the 5 parameter Shoukri and Donner model Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) for mismatch data Parameter estimates for mismatch data Confidence intervals for mismatch data P-values for mismatch data Joint frequency table of the sum of the ratings from pathologist 1 (X i1 ) and pathologist 2 (X i2 ) for dysplasia data Parameter estimates for dysplasia data Confidence intervals for dysplasia data P-values for dysplasia data vii

8 5.9 Joint frequency table of the sum of the ratings from radiologist 1 (X i1 ) and radiologist 2 (X i2 ) for mammogram data Parameter estimates for mammogram data Confidence intervals for mammogram data P-values for mammogram data Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) for lumen narrowing data Parameter estimates for lumen narrowing data Confidence intervals for lumen narrowing data P-values for lumen narrowing data viii

9 List of Figures and Illustrations 3.1 General example of a log profile likelihood depending on parameter θ Empirical bias when π = Empirical bias when π = Empirical bias when π = Empirical bias when π = Empirical bias when π = Empirical bias when π = Distribution of deviation from the true value for ρ w and ρ b when π = Distribution of deviation from the true value for ρ w and ρ b when π = Distribution of deviation from the true value for ρ w and ρ b when π = Plot of MME vs MLE for n = 75 subjects when π = Plot of MME vs MLE for n = 75 subjects when π = Plot of MME vs MLE for n = 75 subjects when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = ix

10 3.43 Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical bias when π = 0.1 for the 4 parameter model Empirical bias when π = 0.1 for the 4 parameter model Empirical bias when π = 0.3 for the 4 parameter model Empirical bias when π = 0.3 for the 4 parameter model Empirical RMSE when π = 0.1 for the 4 parameter model Empirical RMSE when π = 0.1 for the 4 parameter model Empirical RMSE when π = 0.3 for the 4 parameter model Empirical RMSE when π = 0.3 for the 4 parameter model Empirical RMSE when π = 0.5 for the 4 parameter model Empirical RMSE when π = 0.5 for the 4 parameter model Empirical type I error levels when π = 0.1 for the 4 parameter model Empirical type I error levels when π = 0.1 for the 4 parameter model Empirical type I error levels when π = 0.3 for the 4 parameter model Empirical type I error levels when π = 0.3 for the 4 parameter model Empirical type I error levels when π = 0.5 for the 4 parameter model Empirical type I error levels when π = 0.5 for the 4 parameter model Empirical coverage levels when π = 0.1 for the 4 parameter model Empirical coverage levels when π = 0.1 for the 4 parameter model Empirical coverage levels when π = 0.3 for the 4 parameter model Empirical coverage levels when π = 0.3 for the 4 parameter model Empirical coverage levels when π = 0.5 for the 4 parameter model Empirical coverage levels when π = 0.5 for the 4 parameter model Special case example: Contour plots for n 01 = n 11 = n 21 = 0 (perfect reliability for rater 2) Special case example: ρ c2 vs partial derivative of log-likelihood for varying values of ρ c Special case example: ρ c2 vs partial derivative of log-likelihood for varying values of π Special case example: ρ c2 vs partial derivative of log-likelihood for varying values of ρ b Profile log-likelihood graphs of ρ b, ρ w1 and ρ w2 for n 01 = n 11 = n 21 = Profile log-likelihood graphs for mismatch data x

11 List of Symbols, Abbreviations and Nomenclature Abbreviation CCC CDF CI CIA CIE DWI FN FP GOF ICC LRT ML MLE MM MME MR MRI MSD MSE MSR MSS NPV OR Definition Concordance Correlation Coefficient Cumulative Distribution Function Confidence Interval Coefficient of Individual Agreement Coefficient of Individual Equivalence Diffusion-Weighted MRI False Negative False Positive Goodness of Fit Intraclass Correlation Coefficient Likelihood Ratio Test Maximum Likelihood Maximum Likelihood Estimate Method of Moments Method of Moments Estimate Magnetic Resonance Magnetic Resonance Imaging Mean Squared Deviation Mean Square Error Mean Square Rater Mean Square Subject Negative Predictive Value Odds Ratio xi

12 PABAK PL PPV PWI RD RMSE rmtt RR TN TP Prevalence Adjusted Bias Adjusted Kappa Profile Likelihood Positive Predictive Value Perfusion-Weighted MRI Risk Difference Root Mean Square Error Relative Mean Transit Time Risk Ratio True Negative True Positive xii

13 Chapter 1 Introduction to Agreement and Reliability In order for a patient to receive proper and appropriate health care, one requires error-free assessment of clinical measurements. These clinical measurements are used to determine the appropriate course of action, such as whether or not a patient should receive a treatment or therapy. A diagnostic test that assesses whether an individual will be classified as having the disease or not having the disease needs to have accurate (close or identical to the true value) and reliable (consistent) results in order to ensure that an individual who needs treatment receives the correct therapy. These results need to be the same whether it is recorded by one specific physician at different visits on the same patient or by two different physicians on the same patient. An example of the need for agreement and reliability studies would be introduction of a new diagnostic test or new measurement tool which may be preferred over a gold standard or old test due to practical considerations such as cost, efficiency, and invasiveness. Another example of the need for these studies is comparing the agreement between administrative data and chart data. Researchers often utilize administrative data to determine the presence or absence of a medical condition instead of examining patient charts which is much more time-consuming. However, it is necessary to determine the performance of the new diagnostic test, measurement tool, or administrative data before implementation by quantifying the level of accuracy and reliability. Agreement and reliability studies are investigations in which diagnostic tests or measurement tools are evaluated to determine their accuracy and reliability, respectively. The terms agreement and reliability are part of the broader category of association. Thus, when we have reliability and/or agreement between variables or mea- 1

14 surements, we also have association. However, when there is an association between variables or measurements, we do not necessarily have agreement and/or reliability. Differences between association, agreement and reliability will be further discussed below. 1.1 Association There are many ways to analyze data from a basic 2x2 study. In a 2x2 study, association measures such as the odds ratio, sensitivity, specificity, positive predictive value, negative predictive value, risk ratios, and risk differences [5, 6, 12, 20, 26, 33, 53, 62] are used to describe the relationship between two variables that have 2 categories each. For example, you can describe the relationship between an exposure (eg. smoking status) and disease (whether or not you develop heart disease). Consider Table 1.1, which summarizes the results collected on disease and exposure by this 2x2 study for a total of n subjects. This is a basic 2x2 contingency table. Frequencies of each cell are shown. We define the association measures mentioned above. Disease Total Yes No Exposure Yes a b a + b No c d c + d Total a + c b + d n = a + b + c + d Table 1.1: Basic 2x2 contingency table The first measure we discuss is the odds ratio [12, 20]. The estimated odds of exposed individuals developing the disease is odds 1 = a b while the estimated odds of unexposed individuals developing the disease is odds 2 = c d. 2

15 Thus, the odds ratio (OR) is defined as the ratio of these two odds [62]: odds ÔR = 1 = a/b odds 2 c/d = ad bc An odds ratio can range from 0 to [12]. For example, if there is no difference between the odds of an individual developing the disease for both exposure statuses, then the estimated OR will be 1. An OR of 1 indicates no association or relationship between disease and exposure. The further the deviation from 1, the higher the association or stronger the relationship. If you have an OR of 2, for example, then the odds of developing the disease for exposed individuals is 2 times the odds of developing the disease for unexposed individuals. An OR < 1 indicates exposed individuals have lower odds, while a OR > 1 indicates exposed individuals have higher odds. Next we discuss the risk ratio [12, 62]. The estimated risk (or proportion) of smokers developing the disease is risk 1 = a a + b while the estimated risk (or proportion) of non-smokers developing the disease is risk 2 = c c + d The risk ratio (RR) or relative risk is defined as the ratio of the proportion of smokers developing the disease to the proportion of non-smokers classified as having the disease [62], ie. RR = risk 1 risk 2 = a/(a + b) c/(c + d). It can also range from 0 to. Similar to an OR, a RR of 1 indicates no association or relationship between disease and exposure. A RR < 1 indicates smokers have a lower risk, while a RR > 1 indicates smokers have a higher risk. The risk difference (RD) is defined as the difference between the proportions RD = risk 1 risk 2 = 3 a a + b c c + d.

16 A risk difference can range from -1 to 1. A risk difference of 0 indicates no association or relationship between disease and exposure. The further the deviation from 0, the higher the association or relationship. Sensitivity and specificity [5, 53] are commonly used to assess the accuracy of a new test for diagnosing the presence or absence of a disease with respect to the true disease status as determined by a traditionally used test and a test accepted as the gold standard. They can be estimated from the 2x2 table below (Table 1.2), which is Table 1.1 with a = T P, b = F P, c = F N, and d = T N. Sensitivity is the probability Gold standard Total Yes No Test Positive (+) T P F P T P + F P Negative (-) F N T N F N + T N Total T P + F N F P + T N n = T P + F P + F N + T N TP = true positive, FP = false positive FN = false negative, TN = true negative Table 1.2: 2x2 table for diagnostic tests that the new test indicates presence of the disease when the gold standard indicates that it is present and sensitivity = P (+ disease) = T P T P + F N where TP = true positives and FN = false negatives. Specificity is the probability that the new test indicates absence of the disease when the gold standard indicates that it is absent and can be estimated as: specificity = P ( no disease) = where TN = true negatives and FP = false positives. T N T N + F P Positive and negative predictive values are similarly used in assessing the accuracy of a new test with respect to a gold standard [6]. The positive predictive value (PPV) 4

17 is the probability of the presence of disease given a positive test result and the negative predictive value (NPV) is the probability of the absence of disease given a negative test result. They are estimated as P P V = P (disease +) = T P T P + F P and NP V = P (no disease ) = T N T N + F N Sensitivity, specificity and positive and negative predictive values are all proportions, so their values range from 0 to 1. The closer these values are to 1, the better the test. Sensitivity and specificity are both properties of the test. However, positive and negative predictive values are dependent on the prevalence of disease. 1.2 Reliability and Agreement The above association measures should not be used to quantify agreement or reliability. A reliable measurement gives consistent results every time it is repeated under the same conditions, or on the same subject, for example. If a measurement is reliable, subjects can be distinguished or differentiated from each other despite measurement errors [21]. Reliability is typically defined as the ratio of variability between ratings of the same subjects (ie. by different raters or at different times) to the total variability of all ratings [21, 31], i.e.: variability between ratings of the same subjects variability between ratings of the same subjects + measurement error. (1.1) From this, we can see that reliability estimates will be low when there is little variability among the ratings obtained from the measurement instrument under evaluation, which happens when the range of ratings is small (little or no variability between ratings of the same subjects) or when prevalence is very high or very low [31]. Agree- 5

18 ment measures how close or accurate the results of repeated measurements are, or the degree to which scores or ratings are identical [21, 32]. In this thesis, we will focus on inter-rater agreement and intra-rater reliability. Intra-rater (within rater) reliability is the degree to which a measurement instrument is able to distinguish or differentiate among subjects when the same rater is repeating the measurement two or more times [21, 32]. Here, we will consider two measurements (i.e. two time points) per rater. Perfect intra-rater reliability will occur when a rater gives the same rating at both time points on all subjects. Inter-rater agreement is the degree to which two or more raters achieve identical or congruent results under similar assessment conditions [32]. To illustrate that having high association does not necessarily imply high agreement, we note that perfect agreement between raters occurs when both raters yield the exact same rating on all subjects. As an example, you could have perfect (or high) disagreement between raters which would still mean perfect (or high) association. Table 1.3 shows an example of a 2x2 study in which the raters have high disagreement, but still have high association. The 2 raters only agree on 45 out of the 200 subjects. They agree that 25 subjects have the disease and 20 subjects do not have the disease while they disagree on the remaining 155 subjects. We will introduce the simplest measure of agreement to illustrate. It is the observed proportion of agreement p 0, which can be calculated as p 0 = a+d. Further measures of agreement will be discussed n in the next chapter. Rater 1 Total Disease No disease Rater 2 Disease No disease Total Table 1.3: Example of high disagreement and high association 6

19 From the data in Table 1.3, we get ÔR = 25/75 80/20 = and RR = 25/100 80/100 = Since these are far from a OR or RR of 1, this suggests high association between the ratings of rater 1 and the ratings of rater 2 (an OR or RR equal to 1 would indicate no association). However, the observed proportion of agreement p 0 = = 0.225, which is relatively low agreement (or high disagreement). This provides an illustration to show that association measures can not necessarily be used as measures of agreement or reliability. 1.3 Concurrent Assessment of Intra-rater Reliability and Inter-rater Agreement Clustered binary data occur frequently in medical research. For example, in ophthalmology, pairs of eyes are evaluated through diagnostic procedures for the presence of certain diseases. Another example occurs in twin studies where one may evaluate whether a trait of interest has an inherent genetic and/or environmental component by assessing or measuring the trait in twins. In this setting, standard methods, such as a logistic regression, ignore the correlation between outcomes by assuming independence between individual observations [46]. However, an individual s right eye will tend to have higher correlation with an individual s left eye than with other individuals eyes. In agreement or reliability studies, a rating from rater 1 on an individual will tend to have higher correlation with a rating from rater 2 on the same individual than with other individual s ratings. Thus, these standard methods are inadequate. Significant research has been done in the analysis of correlated binary data. In agreement and reliability studies, researchers typically only use half of the data to evaluate inter-rater agreement and the other half of the data to evaluate intra-rater reliability which results in less efficient estimates [7]. However, a model developed by Shoukri and Donner [50, 51, 52] allows for the concurrent assessment 7

20 of inter-rater agreement and intra-rater reliability, by incorporating multiple measurements per rater per subject (which also increases the available sample size for estimating inter-rater agreement). This may require less patients to be enrolled in a study, for example, and thus reduce costs of enrolling patients when the cost of recruiting additional subjects is much higher than the cost of an additional measurement on existing subjects. Shoukri and Donner [50] showed that the gain in precision obtained from increasing the number of measurements per rater from 1 to 2 may allow fewer subjects to be recruited into a study with no net loss in efficiency for estimating inter-rater agreement. Slater [55] assessed the properties of the method of moments estimates from the Shoukri and Donner model [50] for some 3 parameter (inter-rater agreement, intra-rater reliability, probability of a positive rating) combinations and compared them to estimates from analysis of variance. Slater found that the estimates were negatively biased and appeared to be consistent. Slater also extended the goodness-of-fit approach used in the common correlation model [22] to the Shoukri and Donner model [50] and compared inference properties (Type I error rates) to a Wald test using a large sample variance estimate. In general, the goodness-of-fit test provided Type I error rates consistently closer to nominal type I error rates than those from a Wald test using an estimated large sample variance. There were a few exceptions of empirical type I errors being too high (ie. more than 2.5% from α = 0.05). Properties of maximum likelihood estimation were not assessed. In the case of continuous data, comparison of method of moments estimation and maximum likelihood estimation have been developed and evaluated for the concurrent assessment of intra-rater reliability and inter-rater agreement [7]. For point estimation, Amuah [7] found that method of moments and maximum likelihood estimates perform adequately with negligible empirical bias and comparable spread that is moderate. Estimates from both methods of inter-rater agreement and intra-rater 8

21 reliability are root mean square error consistent. For hypothesis testing, empirical type I errors were within nominal levels (α = 5%) and comparable. The methods used for confidence interval construction were also comparable and within nominal levels (95%). It was recommended that any of the methods could be used, unless convergence problems were encountered. Methods used in the case of binary data need to be further evaluated to include comparison of method of moments estimation to maximum likelihood estimation as well as expansion of the goodness-of-fit approach for the agreement parameter. 1.4 Example: Percentage of Diffusion-Perfusion Mismatch One example of data where subjects are assessed by more than one rater twice is from a study assessing the intra-rater reliability and inter-rater agreement of percentage of diffusion-perfusion mismatch [7, 18, 55]. Magnetic resonance (MR) images for 13 patients with acute strokes were independently assessed twice by 6 raters. The raw data is available and presented in Amuah s thesis [7]. The mismatch data is presented as a percentage (continuous data), but can be dichotomized at cutoff points of 10% or 20% as the study did in their further analyses [18]. Data from 2 of the raters will be used since the focus of this thesis is 2 raters rating n subjects twice. This study will be further described in Chapter 5. The continuous data obtained from the study [7, 18] for the 2 raters would be the binary or dichotomized data as presented in Table 1.4 if the cutoff point of 10% were chosen. So those with mismatch percentages greater than 10% are classified as having mismatch (1), those with a mismatch percentage of 10% or less are classified as having no mismatch (0). One could analyze the data separately at each time point or by each rater, depending on the question one is trying to answer. For example, analyzing the data at each time point allows you to examine inter-rater 9

22 Rater 1 Rater 2 Measurement Measurement Subject Table 1.4: Binary data for 13 subjects being rated by 2 raters twice agreement. However, typically you would only have data from one time point to use for determining agreement between raters. If you only had the data from time point 1, you would have the results as presented in Table 1.5. If you only had data from time point 2, you would have the results in Table 1.6. Using equation 2.25 from section 2.3.2, where the data follows a common correlation model (CCM), the data from time point 1 would provide an agreement estimate of 1, indicating perfect agreement. The data from time point 2 would provide an agreement estimate of 0.567, which indicates moderate agreement between raters (see Table 2.4, which provides some guidelines on benchmarks for agreement values). The agreement estimates are fairly different, which shows that this study could potentially benefit from more subjects being enrolled in the study, or 2 measurements per subject where the Shoukri and Donner model would allow concurrent estimation of inter-rater agreement and intrarater reliability. 10

23 Rater 2 Total Mismatch (1) No mismatch (0) Rater 1 Mismatch (1) No mismatch (0) Total 9 4 n = 13 Table 1.5: Mismatch data at first time point Rater 2 Total Mismatch (1) No mismatch (0) Rater 1 Mismatch (1) No mismatch (0) Total 11 2 n = 13 Table 1.6: Mismatch data at second time point 11

24 1.5 Research Objectives The objectives of this thesis are as follows: 1. In a 2x2 study (2 raters classifying n subjects into one of two categories), evaluate the methodology for the concurrent assessment of intra-rater reliability and inter-rater agreement in the 4 parameter case (assuming a common probability of a positive rating by the 2 raters) and a further reduction to the 3 parameter case (further assuming a common intra-rater reliability) by deriving: a) point estimates for intra-rater reliability and inter-rater agreement, b) hypothesis tests for the point estimates, and c) confidence intervals using maximum likelihood and comparing these results to the method of moments approach. 2. Describe the beginning steps of the methodology required when estimates of intra-rater reliability and inter-rater agreement are on the boundary. 3. Extend the methodology for the concurrent assessment of intra-rater reliability and inter-rater agreement to the more general 5 parameter case (allowing a different probability of a positive rating for each rater). 4. Provide recommendations to the medical research community on the usage of these models without jeopardizing the statistical validity of the methods when assessing intra-rater reliability and inter-rater agreement on binary data. Chapter 2 will discuss summary measures used to assess reliability and agreement in the categorical case with a brief discussion in the quantitative case. In addition models used to assess for reliability and agreement will be discussed. Chapter 3 will 12

25 discuss the 4 parameter Shoukri-Donner model and estimation from the model using method of moments estimation and maximum likelihood estimation, including derivation of point estimates, confidence intervals, and hypothesis tests. In addition, the methods used for the Monte Carlo simulation studies will be discussed and results will be provided for the 4 parameter model. The 3 parameter model which assumes a common intra-rater reliability will also be discussed and similar simulation study results will be provided. This chapter will also present special cases where the maximum likelihood procedure needs to be modified to obtain estimates within the range of possible values for agreement and reliability. Similarly, Chapter 4 will present the beginning steps and development of the 5 parameter Shoukri-Donner model. Chapter 5 will present an application of the methods to the several data sets and finally, Chapter 6 will provide a discussion of the major findings, recommendations and future directions. 13

26 Chapter 2 Measuring Agreement and Reliability The focus of this chapter will be the development of agreement and reliability measures for 2 raters rating n subjects each with a brief introduction to 2 measures commonly used in the quantitative (or continuous) case. 2.1 Quantitative Measures The two main measures of reliability and agreement in the case where the outcome is quantitative are the concordance correlation coefficient [38] and the intraclass correlation [13] Concordance Correlation Coefficient When agreement is defined as the difference of the ratings of different raters, the concordance correlation coefficient (CCC) ρ C is used. This measure was introduced by Lin [38]. When there are two raters, it measures agreement by assessing the variation of the linear relationship from the 45 degree line through the origin (the concordance line) [53]. Let X ij denote the rating for the ith subject by rater j (i = 1,..., n; j = 1, 2) and assume that E(X ij ) = µ i and var(x ij = σ 2 i. Lin [38] defined the CCC as ˆρ C = 1 E[(X i1 X i2 ) 2 ] σ1 2 + σ2 2 + (µ 1 µ 2 ) 2 2σ 12 = σ1 2 + σ2 2 + (µ 1 µ 2 ), (2.1) 2 where E[(X i1 X i2 ) 2 ] is the expected squared perpendicular deviation from the 45 degree line through the origin and σ σ (µ 1 µ 2 ) 2 is the expected squared 14

27 perpendicular deviation from the 45 degree line through the origin when X i1 and X i2 are uncorrelated. Lin [38] suggested a moment estimate of ρ C given by ˆρ C = 2S 12 S S ( x 1 x 2 ) 2, (2.2) where x j = ( ) 1 n n i=1 x ij, Sj 2 = ( ) 1 n n i=1 (x ij x j ) 2 and S 12 = ( 1 n n) i=1 (x i1 x 1 )(x i2 x 2 ). It is simple to use and its estimate using the sample counterparts is consistent and has asymptotic normality for bivariate normal data. The CCC ranges from -1 to 1. A CCC of -1 corresponds to perfect negative agreement, while a CCC of 1 corresponds to perfect positive agreement. McBride [41] proposed strength of agreement criteria for the absolute value of Lin s CCC as in Table 2.1. CCC Estimate Strength of Agreement < 0.90 Poor Moderate Substantial > 0.99 Almost perfect Table 2.1: Strength of agreement benchmarks for CCC Intraclass Correlation Coefficient The intraclass correlation ρ has emerged as a universal and widely accepted reliability index for continuous data [53] and is the ratio of the covariance among repeated observations to the total observed variance. Table 2.2 summarizes the ways in which one can estimate the intraclass correlation (ICC) from a one-way or two-way analysis of variance [7, 53, 54]. The one-way random effects model is not usually used in the cases we are considering, as subjects are usually rated by the same set of raters. As with the CCC, the ICC ranges from -1 to 1. The estimate of the intraclass kappa ˆκ I can be obtained from the formula for the intraclass correlation using the one-way random effects model with the 0-1 (binary) 15

28 Model One-way Two-way Two-way Random Effects Mixed Effects Random Effects Assumptions No random rater Raters are fixed, Raters are a random effect, each subject (only raters of sample from a is rated by a interest), each population of raters, different set of subject is rated each subject is rated raters by each rater by each rater MSS MSS MSE Estimate MSS+MSR MSS+MSE MSS = between subject mean square or subject mean square MSR = within subject mean square or rater mean square MSE = mean square error (MSS MSE)/2 MSS+MSE+2(MSR MSE)/n Table 2.2: Estimate of the intraclass correlation for 2 raters each rating n subjects data [53]. The intraclass kappa will be further discussed in section Categorical Measures The development of measures of reliability and agreement for categorical data are well traced by Shoukri [53] and will be summarized below with some additional measures also discussed. Consider Table 2.3, a basic 2x2 table similar to the 2x2 contingency table in Chapter 1 (Table 1.1) but with different notation. Here we consider the case where two raters classified n subjects into one of two categories: disease or no disease. Rater 1 Disease No disease Total Rater 2 Disease n 11 n 10 n 1. No disease n 01 n 00 n 0. Total n.1 n.0 n Table 2.3: Basic 2x2 table As mentioned in section 1.2, the observed proportion of agreement is p 0 where in the 2x2 case (using the notation from Table 2.3, p 0 = n 00 + n 11. (2.3) n 16

29 It is the proportion of the total number of subjects in which the raters are in agreement and assumes no guessing by the raters Adjusted Measures Variations similar to p 0 are presented as adjusted indices of agreement [53]. The Jacquard coefficient is the proportion of (1, 1) matches (agreement as positive ) in a set of comparisons that ignores (0, 0) matches (agreement as negative ) and is estimated as Ĵ = Another measure, the Ĝ coefficient [28], is estimated as n 11 n 11 + n 10 + n 01. (2.4) Ĝ = (n 00 + n 11 ) (n 10 + n 01 ) n (2.5) which will indicate perfect agreement when Ĝ = 1 and occurs if n 10 = n 01 = 0. It will indicate perfect disagreement when Ĝ = 1 and occurs if n 00 = n 11 = 0. Another adjusted measure of agreement is based on the concordance ratio and is estimated as Ĉ = 2n 11 n 10 + n 01 = 1 (2.6) 2n 11 + n 10 + n 01 2n 11 + n 10 + n 01 and like Ĵ, it ignores the (0, 0) cell and gives twice the weight to the (1, 1) cell [53]. The above measures include agreement which occurs from guessing. The intraclass correlation is one of the most widely known measures of agreement in a 2x2 study and is estimated by ˆρ = 4n 11 n 00 (n 10 + n 01 ) 2 (2n 11 + n 10 + n 01 )(2n 00 + n 10 + n 01 ). (2.7) The intraclass correlation will be further discussed in section when the common correlation model is introduced. 17

30 2.2.2 Chance-Corrected Measures The observed proportion of agreement and the above measures do not correct for agreement by chance. If raters were to randomly assign their ratings, they would sometimes agree by chance [61]. For example, by chance alone, one would expect better agreement on a two-category measurement scale than a five-category measurement scale. To correct for the bias in favor of measurement instruments with a smaller number of categories, Bennett proposed an index of consistency S [11], which takes into account the number of categories k. It is estimated as Ŝ = k ( p 0 1 ) k 1 k or in the 2x2 study (k = 2), ( Ŝ = 2 p 0 1 ). (2.8) 2 As the number of categories k increases, Ŝ increases for a fixed p 0 (as defined previously). However, this measure assumes that all categories are equally likely to be used by the raters, and in many cases, there may be only two or three categories that the raters use more often, or the subjects may only fall into two or three different categories. In 1955, Scott [49] proposed an index of inter-coder agreement π s, which can be estimated as where in the case of a 2x2 study, ˆπ s = p 0 p e 1 p e (2.9) ( ) 2 ( ) 2 (n0. + n.0 )/2 (n1. + n.1 )/2 p e = +. n n Here, p 0 is the observed proportion of agreement between the two raters as defined previously and p e is the proportion of agreement expected by chance. Thus, π s is the ratio of the actual difference between observed and chance agreement to the 18

31 maximum difference between observed and chance agreement. This measure corrects for the number of categories as well as the frequency with which each category is used by fixing the marginals at the observed proportion for which each category was used. It assumes that the distribution of proportions over the categories is known and is taken to be equal for all raters. Cohen [16] criticized the assumption that the distribution of proportions over the categories is equal for all raters (ie. the marginal distributions of the raters are equal). He proposed Cohen s kappa, which takes on a similar form to Scott s π s. The difference lies in how p e is defined. Cohen does not assume that the marginal distributions are equal (ie. n 1. n.1 and n 0. n.0 or p 1. p.1 and p 0. p.0 from Table 1). An estimate of Cohen s kappa [16] is where in the case of a 2x2 study, p e = ( n0. n ˆκ = p 0 p e 1 p e (2.10) ) ( n.0 ) + n ( n1. n ) ( n.1 ). n Landis and Koch [34] provided benchmarks for the kappa statistic (Table 2.4). These benchmarks vary from the benchmarks proposed by McBride [41] in the continuous case (Table 2.1). Kappa Estimate Strength of Agreement < 0.0 Poor (less than chance agreement) Slight Fair Moderate Substantial Almost perfect Table 2.4: Strength of agreement benchmarks for kappa Cohen s kappa treats all disagreements equally. Thus, Cohen developed the 19

32 weighted kappa [17], motivated by studies in which some disagreements between ratings are of greater gravity than others. The weighted kappa is a chance-corrected proportion of weighted agreement. The weights 0 w ij 1 (agreement weights for i = 1,..., k; j = 1,..., k) are assigned on rational or clinical grounds to the k 2 cells. Then, exact agreement is given maximal weight (w ii = 1), all disagreements are given less than maximal weight (0 w ij < 1 for i j), and the two raters are considered symmetrically (w ij = w ji ). The observed weighted proportion of agreement is p ow = k i=1 k j=1 w ij n ij n and the expected weighted proportion of agreement is p ew = 1 n k i=1 k j=1 w ij ( ni. n ) ( n.j ). n Thus, an estimate of the overall weighted kappa is ˆκ w = p ow p ew 1 p ew. (2.11) The overall weighted kappa is equivalent to Cohen s kappa when w ii = 1 for all i and w ij = 0 for i j. Although Aickin s α is model-based [2], it provides a summary measure of agreement. It is based on agreement for a cause and the assumption that only some units are subject to classification by chance. Part of the purpose of the constant predictive probability model is to separate the two sources of random agreement and agreement for a cause that tend to be both included in the proportion of agreement expected by chance p e, where agreement for a cause is not intended to be captured. The model is a mixture of two populations of which the first is difficult to classify (where observers will be more likely to guess) and the second is easy to classify [2]. In the first population the observers/ratings agree by chance and in the second population they always yield concordant ratings. Note that a concordant rating occurs when both observers 20

33 classify the subject into the same category. Aickin s α is defined as the fraction of the entire population which consists of subjects that are classified identically for a cause, rather than by chance. It follows the pattern of the kappa-like measures described above: α = d(i, j)p(i, j) d(i, j)p1 (i)p 2 (j) 1, (2.12) d(i, j)p 1 (i)p 2 (j) where p 1 (i) and p 2 (j) are not the overall marginal distributions, but the marginal distributions of the subpopulation of hard-to-classify subjects. Also, d(i, j) = 1 if i = j and d(i, j) = 0 otherwise, and p(i, j) is the joint probability distribution governing the classification of a subject in category i by rater 1 and category j by rater 2. The term d(i, j)p 1 (i)p 2 (j) is similarly defined (as in the kappa-like measures) under the assumption that the raters are independent, but in kappa and kappa-like measures, it is generally defined in terms of certain marginal distributions that occur in a model in which both chance and causal agreement are present. The correction in Aickin s method is made with use of the marginal distributions of the subjects that are hard to classify. Aickin notes that the proportion of chance-corrected agreement is the same for each of the agreement cells [2], which is the reason why he terms his approach the constant predictive probability model Other Measures Several measures have been developed to attempt to deal with issues related to kappa and kappa-like measures. The main issues related to kappa are its dependence on prevalence or the marginal distribution. Taking a closer look at Cohen s kappa, for example, ˆκ = p 0 p e 1 p e, for fixed p 0, kappa gets its highest value when p e is as small as possible [53]. The kappa for identical values of p 0 = 0.85 can be more than twice as high in one instance 21

34 (0.70 when p e = 0.50) as compared to the other (0.32 when p e = 0.78). The next few observations are noted by Feinstein and Cicchetti [15, 25, 53]. A low value of kappa despite a high value of p 0 will occur only when the marginal totals are highly symmetrically unbalanced. This occurs when n 0. is very different from n 1. or when n.0 is very different from n.1. They also note that unbalanced marginal totals can produce higher values of kappa than more balanced totals when n 1. is much higher than n 0. while n.1 is much smaller than n.0. Another example discussed by Shoukri [53] attributed a paradoxical difference between the good overall agreement and the weak chance-corrected agreement to the high prevalence of negative cases. A similar situation could occur if there is a high prevalence of positive cases. These examples from illustrate how kappa is dependent on prevalence (of disease, for example), or the marginal distribution. In Aickin s approach [2], the proportion of agreement attributed to chance is smaller than in Cohen s approach and thus, Aickin s α is always larger than Cohen s kappa. Aickin s α is less sensitive to unbalanced marginal distributions. Another issue is when the null hypothesis of observer independence (used when calculating the proportion of agreement expected by chance, p e ) is not met. The proportion of agreement expected by chance is determined assuming the two observers are independent (multiplying the marginal proportion for observer 1 by the marginal proportion for observer 2). When this assumption is not met, the estimate of chance agreement is inaccurate and possibly inappropriate. We are assuming both observers are guessing for every subject, which is unlikely. In addition, the proportion with which each category is used are the observed marginal proportions and may be (likely will be) different in different studies. This may make it hard to compare estimates of measures of agreement across studies. In other words, the proportion of agreement expected by chance estimates the proportion of times observers would agree if they 22

Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total

Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total 2.5. Kappa Coefficient and the Paradoxes. - 31-2.5.1 Kappa s Dependency on Trait Prevalence On February 9, 2003 we received an e-mail from a researcher asking whether it would be possible to apply the