Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total

Size: px

Start display at page:

Download "Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total"

Jared Miller
5 years ago
Views:

1 2.5. Kappa Coefficient and the Paradoxes Kappa s Dependency on Trait Prevalence On February 9, 2003 we received an from a researcher asking whether it would be possible to apply the AC 1 coefficient 12 to the data in Table The motivation behind this request is presumably the failure of Kappa to provide a reasonable estimation of the extent of agreement between the two laboratories. Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category Test Reference Laboratory Laboratory + Total Total A look at Table 2.14 suggests that the test and the reference laboratories agree almost perfectly with respect to the scoring of the 125 participating subjects, with the exception of 5 subjects classified as positive and negative by the test and reference laboratories respectively. The high agreement on one category is an indication of high prevalence of the positive trait in the subject population being tested. The Kappa coefficient associated with this data is obtained as follows : } p a = ( )/125 = 0.96, p e = ( )/125 2 γ = 0.96 κ = p a p e = 0. 1 p e Here is an example of a situation where a researcher would normally expect a near perfect agreement between observers, regardless of how it is measured. Yet Kappa yields a 0 coefficient, suggesting a total absence of agreement between laboratories. We want to make the following 2 comments : (i) Regardless of how the notion of agreement between laboratories is defined in this particular instance, Kappa does not quantify it well. Therefore, this situation is a paradox. (ii) The magnitude of the overall agreement probability p a of 0.96 is as largely expected as the magnitude of chance-agreement probability of 0.96 is unexpected. This raises questions about the very nature of the concept being represented by the chance-agreement probability p e. According to Cohen (1960) 12 Note that AC 1 is an agreement statistic to be discussed in chapter 4, and that we recommend as an alternative to Kappa (see Gwet (2008a)).

2 Chapter 2 : The Kappa Coefficient : A Review the inventor of the Kappa coefficient, p e represents... the proportion of units for which agreement is expected by chance. An examination of the expression of p e suggests that it measures agreement probability under the following 2 assumptions : Both laboratories classify all 125 subjects randomly, Each laboratory s random classification is performed according to the observed marginal probabilities (1.0 and 0 for the test laboratory, and 0.96 and 0.04 for the reference laboratory). It is this second condition that is unreasonable, and causing the paradox. The observed marginals indicate that the test laboratory classifies all subjects into the + category with certainty (i.e. with a probability of 1), while the reference laboratory classifies 96% of the subjects into the same category. Implementing the mechanism for generating random ratings according to these probabilities will dramatically and perhaps artificially increase the proportion of agreement by chance. The use of observed marginals to define chance agreement (or random rating) may not be reasonable if these marginals are very unbalanced toward one category. Moreover, p a and p e (as defined by Cohen (1960)) may not even have a common base for the difference p a p e to have a clear meaning (more on this in chapter 4). The use of marginal probabilities as an objective means for quantifying chanceagreement probability p e is questionable, and is at the origin of Kappa paradoxes. On this issue, Feinstein and Cicchetti (1990, p.548) say the following : The reasoning makes the assumption that each observer has a relatively fixed prior probability of making positive or negative responses. The assumption does not seem appropriate, however for most clinical observers. If unbiased, the observers will usually respond to whatever is presented in each particular instance of challenge. The observers may develop a fixed prior probability if they know in advance that the challenge population is predominantly normal or abnormal, positive or negative - but there is no reason to assume that such probabilities will be established in advance if the observers are blind to the characteristics of the challenge population. This paradox problem in our views is not created by a high trait prevalence (which is often unknown) as often suggested in the literature. Instead, it finds its origin in the way the researcher defines agreement by chance. In the current formulation of Kappa, all or most ratings associated with one category could be used in the calculation of p e as if they were assigned randomly. This may lead to a chance-agreement probability that is higher that the overall agreement probability, producing a negative value

3 2.5. Kappa Coefficient and the Paradoxes for the Kappa coefficient. Even in the case of Kappa, trait prevalence will distort the coefficient s value only if it is high, and the raters assumed to have classified the subjects correctly. This problem is discussed extensively by Gwet (2008a). Regardless of what causes Kappa paradoxes, they pose a serious underestimation problem in the extreme situations where the table is very unbalanced in its marginals. Kraemer et al. (2002, p. 2114) attempted to defend the merit of Kappa and its paradoxical behavior in the situation illustrated in Table 2.14, by saying the following : It is useful to note that κ = 0 indicates either that the heterogeneity of the patients in the population is not well detected by the raters or ratings, or that the patients in the population are homogeneous. Consequently, it is well known that it is very difficult to achieve high reliability of any measure (binary or not) in a very homogeneous population (P near 0 or 1 for binary measures). That is not a flaw in kappa... or any other measure of reliability, or a paradox. It merely reflects the fact that it is difficult to make clear distinctions between the patients in a population in which those distinctions are very rare or fine. In such populations, noise quickly overwhelms the signals. This statement is very convoluted, and does not tell why a homogeneous population (i.e. high prevalence) should prevent two raters from agreeing on the classification of subjects that appear to have everything in common. Is Kappa measuring the extent of agreement among raters then? Moreover, what is known in statistical science is not the difficulty to achieve high reliability of very homogeneous populations (P = 0, or P = 1), it is rather the difficulty to achieve high reliability of very heterogeneous populations (i.e. P 0.5) ; it is the situation in which the variance will reach its maximum Kappa dependency on Marginal Homogeneity Feinstein and Cicchetti (1990) presented Tables 2.15 and 2.16 to illustrate the second Kappa paradox, which is characterized by a low Kappa value associated with good agreement among raters on marginal counts. These authors argue that good agreement on marginals should translate into a good interrater agreement 13. The Kappa value associated with Table 2.15 data is γ κ = 0.13, which is twice smaller than the Kappa value of γ κ = 0.26 associated with Table 2.16 data. The point here is that observers A and B should not be penalized for having marginal probabilities that are similar as in Table Although reasonable, this statement could be disputable and may be inaccurate if the overall agreement p a among raters is low.

4 Chapter 2 : The Kappa Coefficient : A Review The root cause of the second Kappa paradox is similar to that of the first Kappa paradox, which is high trait prevalence. That is, to obtain Cohen s chance-agreement probability p e for Table 2.15, one must make the (strong and often unreasonable) assumption that the observed marginal probabilities - (0.7,0.3) for Observer A and (0.6,0.4) for Observer B - are fixed and predetermined yes and no classification propensities specific to each observer, which they will remain associated with whether they classify a subject randomly or not. A major implication of this assumption for Table 2.15 is the unduly high chance-agreement probability on the yes category alone of = 0.42, which will lead to a low Kappa value. Table 2.15 : Table 2.16 : Distribution of 100 subjects by Distribution of 100 subjects by rater : symmetrical imbalance rater : asymmetrical imbalance Observer Observer A Observer Observer A B Total Yes No B Yes No Total Yes Yes No No Total Total We indicated in the beginning of this section that the use of 3 nominal or ordinal categories or more may create different types of disagreements, some potentially more serious than others. Researchers may want to considered the less serious disagreements as partial agreements and treat them accordingly. This problem has been resolved by weighting the Kappa coefficient, and is briefly discussed in the next section. 2.6 Weighting of the Kappa Coefficient Tables 2.17 and 2.18 contain pregnancy type data collected from 100 women who presented themselves in an Emergency Room with a positive pregnancy test and a second condition, which is either abdominal pain or vaginal bleeding. After reviewing their medical records, three reviewers (also referred to as abstractors) classified them into one of the following 3 pregnancy categories : Ectopic Pregnancy (Ectopic), Abnormal Intrauterine pregnancy (ABN IUP), and Normal Intrauterine Pregnancy (NOR IUP).

5 2.6. Weighted Kappa : A Review Table 2.17 : Distribution of 100 pregnant women by pregnancy type and abstractor, as classified by abstractors 1 and 2 Abstractor 2 Abstractor 1 Ectopic ABN IUP NOR IUP Total Ectopic ABN IUP NOR IUP Total Table 2.18 : Distribution of 100 pregnant women by pregnancy type and abstractor, as classified by abstractors 1 and 3 Abstractor 3 Abstractor 1 Ectopic ABN IUP NOR IUP Total Ectopic ABN IUP NOR IUP Total The extent of agreement between abstractors 1 and 2 measured by the Kappa coefficient is given by : γ κ (1, 2) = (p a p e )/(1 p e ) = ( )/( ) = , and is identical to the extent of agreement 14 between abstractors 1 and 3. However, the fundamental issue for a clinician in this chart review is whether the pregnancy is ectopic or intrauterine. That is, a disagreement over the nature (normal versus abnormal) of the intrauterine pregnancy is of secondary importance. This fact suggests that abstractors 1 and 2 are more in agreement than abstractors 1 and 3 are. But the Kappa coefficient does not reflect this reality. This example features a reliability experiment where certain disagreements are more serious than others. Those less serious disagreements such as abnormal versus normal intrauterine pregnancies are actually partial agreements that should be treated as such. To resolve this problem Cohen (1968) proposed the Weighted Kappa coefficient that typically assigns a weight of 1 to full (or diagonal) agreements, and assigns to the disagreements a weight whose magnitude decreases proportionally to their seriousness. Cohen(1968) proposed the weighted Kappa in the context of 2 raters and q- item response categories. A set of weight w kl (k, l = 1,, q) between 0 and 1 14 The overall agreement probability p a and chance-agreement probability p e are obtained as follows : p a = ( )/100 = 0.89, and p e = (13/100) 2 +(27/100)(24/100)+(60/100)(63/100) =

6 Chapter 2 : The Kappa Coefficient : A Review must be assigned to the q 2 cells of the contingency table similar to Table 2.7, and prior to collecting rating data. This a-priori assignment of weights to cells aims at ensuring their independence from the observations, which will make them an integral part of the definitional formulation of the Weighted Kappa coefficient. For the classical unweighted Kappa of Cohen (1960) the weights are defined as w kk = 1 for k = 1,, q, and w kl = 0 if k l. Once the weights are assigned, the weighted Kappa can be defined as follows : γ κw = p a p e 1 p e, (2.15) where the weighted overall agreement probability p a, and weighted chance-agreement probability p e are respectively given by : q q q q p a = w kl p kl, and p e = w kl p k+ p +l. (2.16) k=1 l=1 k=1 k=1 Cohen(1968) suggests that the weights could be assigned in an arbitrary fashion either by a group of experts or by a particular investigator. Two sets of weights that have been proposed in the literature are the Linear Weights defined for all cell (k, l) by w kl = 1 k l /(q 1) and the Quadratic Weight defined by w kl = 1 (k l) 2 /(q 1) 2. Example 2.5 Let us label the 3 response categories {Ectopic, ABN IUP, NOR IUP} of Tables 2.17 numerically as categories 1, 2, and 3. Table 2.19 : Linear and Quadratic Weights for the Pregnancy Classification Example Linear Weights Quadratic Weights Table 2.20 : Weighted Cell Proportions (w kl p kl ) of Table 2.17 using Linear and Quadratic Weights Linear Weights Quadratic Weights

Agreement Coefficients and Statistical Inference

CHAPTER Agreement Coefficients and Statistical Inference OBJECTIVE This chapter describes several approaches for evaluating the precision associated with the inter-rater reliability coefficients of the