Inter-Rater Agreement

Similar documents
Textbook Examples of. SPSS Procedure

UNIT 4 RANK CORRELATION (Rho AND KENDALL RANK CORRELATION

Using SPSS for One Way Analysis of Variance

Chapter 19. Agreement and the kappa statistic

Types of Statistical Tests DR. MIKE MARRAPODI

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Spearman Rho Correlation

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

BIOL 4605/7220 CH 20.1 Correlation

Chapter 13 Correlation

Statistical. Psychology

Unit 14: Nonparametric Statistical Methods

Worked Examples for Nominal Intercoder Reliability. by Deen G. Freelon October 30,

Chi-Square. Heibatollah Baghi, and Mastee Badii

REVIEW 8/2/2017 陈芳华东师大英语系

Chi Square Analysis M&M Statistics. Name Period Date

Readings Howitt & Cramer (2014) Overview

Readings Howitt & Cramer (2014)

SPSS LAB FILE 1

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Nonparametric Independence Tests

Statistics Introductory Correlation

THE PEARSON CORRELATION COEFFICIENT

Reliability Coefficients

PSY 216. Assignment 9 Answers. Under what circumstances is a t statistic used instead of a z-score for a hypothesis test

Non-parametric tests, part A:

NON-PARAMETRIC STATISTICS * (

Measuring relationships among multiple responses

Chapter 7: Correlation

The concord Package. August 20, 2006

Much of the material we will be covering for a while has to do with designing an experimental study that concerns some phenomenon of interest.

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Nonparametric statistic methods. Waraphon Phimpraphai DVM, PhD Department of Veterinary Public Health

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

10/31/2012. One-Way ANOVA F-test

Measures of Agreement

Multiple Linear Regression

Statistics Handbook. All statistical tables were computed by the author.

HYPOTHESIS TESTING. Hypothesis Testing

Measures of Association for I J tables based on Pearson's 2 Φ 2 = Note that I 2 = I where = n J i=1 j=1 J i=1 j=1 I i=1 j=1 (ß ij ß i+ ß +j ) 2 ß i+ ß

The t-statistic. Student s t Test

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

T. Mark Beasley One-Way Repeated Measures ANOVA handout

Title. Description. Quick start. Menu. stata.com. icc Intraclass correlation coefficients

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Statistics for Managers Using Microsoft Excel

4/22/2010. Test 3 Review ANOVA

Slide 7.1. Theme 7. Correlation

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Advanced Experimental Design

Example. χ 2 = Continued on the next page. All cells

Hypothesis Testing for Var-Cov Components

Chapter 18 Resampling and Nonparametric Approaches To Data

1 A Review of Correlation and Regression

Bivariate statistics: correlation

The Chi-Square Distributions

χ test statistics of 2.5? χ we see that: χ indicate agreement between the two sets of frequencies.

Frequency Distribution Cross-Tabulation

Can you tell the relationship between students SAT scores and their college grades?

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Rama Nada. -Ensherah Mokheemer. 1 P a g e

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Binary Logistic Regression

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

In Class Review Exercises Vartanian: SW 540

Inferences for Correlation

Finding Relationships Among Variables

Inferential statistics

Chapter Eight: Assessment of Relationships 1/42

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Lecture 25: Models for Matched Pairs

Psych Jan. 5, 2005

The Flight of the Space Shuttle Challenger

General Certificate of Education Advanced Level Examination June 2014

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

psychological statistics

Intraclass Correlations in One-Factor Studies

Rank-Based Methods. Lukas Meier

Chs. 16 & 17: Correlation & Regression

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

16.400/453J Human Factors Engineering. Design of Experiments II

Non-parametric (Distribution-free) approaches p188 CN

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Statistics: revision

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Ordinal Variables in 2 way Tables

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

Nonparametric Methods

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

16.3 One-Way ANOVA: The Procedure

POLI 443 Applied Political Research

LOOKING FOR RELATIONSHIPS

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Transcription:

Engineering Statistics (EGC 630) Dec., 008 http://core.ecu.edu/psyc/wuenschk/spss.htm Degree of agreement/disagreement among raters Inter-Rater Agreement Psychologists commonly measure various characteristics by having a rater assign scores to observed people, other animals, other objects, or events. When using such a measurement technique, it is desirable to measure the extent to which two or more raters agree when rating the same set of things. This can be treated as a sort of reliability statistic for the measurement procedure. Continuous Ratings, Two Judges Let us first consider a circumstance where we are comfortable with treating the ratings as a continuous variable. For example, suppose that we have two judges rating the aggressiveness of each of a group of children on a playground. If the judges agree with one another, then there should be a high correlation between the ratings given by the one judge and those given by the other. Accordingly, one thing we can do to assess interrater agreement is to correlate the two judges' ratings. Consider the following ratings (they also happen to be ranks) of ten subjects: Subject 3 4 5 6 7 8 9 0 Judge 0 9 8 7 6 5 4 3 Judge 9 0 8 7 5 6 4 3 Here is the dialog window from Analyze, Correlate, Bivariate:

Correlations judge judge Correlations Pearson Correlation Sig. (-tailed) Pearson Correlation Sig. (-tailed) judge judge.964**.000.964**.000 **. Correlation is significant at the 0.0 level ( t il d) onparametric Correlations Correlations judge judge Kendall's tau_b judge Correlation Coefficient.000.867** Sig. (-tailed)..000 judge Correlation Coefficient.867**.000 Sig. (-tailed).000. Spearman's rho judge Correlation Coefficient.000.964** Sig. (-tailed)..000 judge Correlation Coefficient.964**.000 Sig. (-tailed).000. **. Correlation is significant at the 0.0 level (-tailed).

The Pearson correlation is impressive, r =.964. If our scores are ranks or we can justify converting them to ranks, we can compute the Spearman correlation coefficient or Kendall's tau. For these data Spearman rho is.964 and Kendall's tau is.867. We must, however, consider the fact that two judges scores could be highly correlated with one another but show little agreement. Consider the following data: Subject 3 4 5 6 7 8 9 0 Judge 4 0 9 8 7 6 5 4 3 Judge 5 90 00 80 70 50 60 40 30 0 0 The correlations between judges 4 and 5 are identical to those between and, but judges 4 and 5 obviously do not agree with one another well. Judges 4 and 5 agree on the ordering of the children with respect to their aggressiveness, but not on the overall amount of aggressiveness shown by the children. One solution to this problem is to compute the intraclass correlation coefficient. For the data above, the intraclass correlation coefficient between Judges and is.967 while that between Judges 4 and 5 is.0535. SPSS Steps: Analyze Scale Reliability Analysis 3

Intraclass Correlation Coefficient Intraclass 95% Confidence Interval F Test with True Value 0 Correlation a Lower Bound Upper Bound Value df df Sig Single Measures.967 b.873.99 54.000 9.0 9.000 Average Measures.983 c.93.996 54.000 9.0 9.000 Two-way mixed effects model where people effects are random and measures effects are fixed. a. Type A intraclass correlation coefficients using an absolute agreement definition. b. The estimator is the same, whether the interaction effect is present or not. c. This estimate is computed assuming the interaction effect is absent, because it is not estimable otherwise. What if we have more than two judges, as below? We could compute Pearson r, Spearman rho, or Kendall tau for each pair of judges and then average those coefficients, but we still would have the problem of high coefficients when the judges agree on ordering but not on magnitude. We can, however, compute the intraclass correlation coefficient when there are more than two judges. For the data from three judges below, the intraclass correlation coefficient is.88. Subject 3 4 5 6 7 8 9 0 Judge 0 9 8 7 6 5 4 3 Judge 9 0 8 7 5 6 4 3 Judge 3 8 7 0 9 6 3 4 5 4

Intraclass Correlation Coefficient Single Measures Average Measures Intraclass 95% Confidence Interval F Test with True Value 0 Correlation a Lower Bound Upper Bound Value df df Sig.88 b.698.966.03 9.0 8.000.957 c.874.989.03 9.0 8.000 Two-way mixed effects model where people effects are random and measures effects are fixed. a. Type A intraclass correlation coefficients using an absolute agreement definition. b. The estimator is the same, whether the interaction effect is present or not. c. This estimate is computed assuming the interaction effect is absent, because it is not estimable otherwise. The intraclass correlation coefficient is an index of the reliability of the ratings for a typical, single judge. We employ it when we are going to collect most of our data using only one judge at a time, but we have used two or (preferably) more judges on a subset of the data for purposes of estimating inter-rater reliability. SPSS calls this statistic the single measure intraclass correlation. If what we want is the reliability for all the judges averaged together, we need to apply the Spearman-Brown correction. The resulting statistic is called the average measure intraclass correlation in SPSS and the inter-rater reliability coefficient by some others (see MacLennon, R.., Interrater reliability with SPSS for Windows 5.0, The American Statistician, 993, 47, 9-96). For our data, j icc 3(.88) = =.9573, where j is the number of judges and icc is the + ( j ) icc + (.88) intraclass correlation coefficient. This statistic appropriate when the data for our main study involves having j judges rate each subject. Rank Data, More Than Two Judges When our data are rankings, we don't have to worry about differences in magnitude. In that case, we can simply employ Spearman rho or Kendall tau if there are only two judges or Kendall's coefficient of concordance if there are three or more judges. Consult page 90-9 in David Howell's Statistics for Psychology, 6 th edition, for an explanation of Kendall's coefficient of concordance. Friedman chi-square testing the null hypothesis that the patches differ significantly from one another with respect to how well they are liked. This null hypothesis is equivalent to the hypothesis that there is no agreement among the judges with respect to how pleasant the patches are. To convert the Friedman chi-square to Kendall's coefficient of concordance, we simply substitute into this equation: χ W = J(n ) 5

where J is the number of judges and n is the number of things being ranked. If the judges gave ratings rather than ranks, you must first convert the ratings into ranks in order to compute the Kendall coefficient of concordance. Remember that ratings could be concordant in order but not in magnitude. Kendall's Coefficient of Concordance. Take k columns with n items in each and rank each column from to n. The null hypothesis is that the rankings disagree. Compute a sum of ranks SR i for each row. Then S = ( SR) n( SR), where ( n + ) k SR = is the mean of the SRi s. If H 0 is disagreement, S can be checked against a table for this test. If S > Sα reject H 0. For n too large for the table use S S χ ( ) = k ( n n ) W =, where W = is the Kendall Coefficient kn ( n + 3 ) k ( n n) of Concordance and must be between 0 and. Example: n = 6 applicants are rated by k = 3 officers. The ranks are below. Applicant Rater Rater Rater 3 Rank Sum SR A 6 8 64 B 6 5 3 4 96 C 3 6 SR = D 4 5 E 5 4 F 4 3 8 64 63 687 63 6 = 0.5 = ( n + ) k 7() 3 = ote that if we had complete disagreement, every applicant would have a rank sum of 0.5. S = ( SR) n( SR) = 687 6( 0.5) = 5. 5. The Kendall Coefficient of Concordance says that the degree of agreement on a zero to one scale is W = k S 5.5 = 3 3 ( n n) ()( 3 6 6) disagreement ( α =.05) = 0.6. To do a test of the null hypothesis of, look up S α in the table giving Critical values of Kendall s s as a Measure of Concordance for k = 3 and n = 6, S. 05 = 03.9, so that we accept the null hypothesis of disagreement.. Example: For n = 3 and k = 3 we get W = 0. 0, and wish to test H H 0 : Disagreement : Agreement Since n = 3 is too large for the table, use χ = k( n ) W = 3( 30)( 0.0) = 9. 000. Using a χ α;n =χ 0. 05; 30 = 43. 733. Since 9 is below the table χ table, look up ( ) ( ) value, do not reject H 0. 6

Categorical Judgments Please re-read pages 58 and 59 in David Howell s Statistical Methods for Psychology, 6 th edition. Run the program Kappa.sas, from my SAS programs page, as an example of using SAS to compute kappa. It includes the data from page 58 of Howell. ote that Cohen s kappa is appropriate only when you have two judges. If you have more than two judges you may use Fleiss kappa. 7