The Benefits of a Model of Annotation

Size: px

Start display at page:

Download "The Benefits of a Model of Annotation"

Hillary Nicholson
5 years ago
Views:

1 The Benefits of a Model of Annotation Rebecca J. Passonneau and Bob Carpenter Columbia University Center for Computational Learning Systems Department of Statistics LAW VII, August 2013

2 Conventional Approach To Corpus Quality Create gold-standard corpus A = B C B C Every instance in B labeled by all (2 to 4) annotators for interannotator reliability (consistency) Every instance in C labeled by a single annotator for ground truth dataset Assumptions There is only one way to be accurate If annotators are consistent, they must be accurate Corpus quality is good enough if annotators are consistent on B LAW VII 1

3 Word Sense Judgments: Adjectival Fair 6 of 10 WordNet senses used, 6 annotators Sometimes even many annotators agree 100% To be fair, I want to pass on to you a complaint that I do find valid, so you can better judge the situation. {sense1, sense1, sense1, sense1, sense1, sense1} High praise is tidy middling and middling is very fair! {sense5, sense5, sense5, sense5, sense5, sense5} What is the true label if annotators do not agree? And our ideas of what constitutes a fair wage or a fair return on capital are historically contingent. {sense1, sense1, sense1, sense2, sense2, sense2}... the federal government... is wrangling for its fair share of the dividend. {sense1, sense1, sense2,sense2, sense8, sense8} LAW VII 2

4 Problems Agreement measures consistency, not quality There is no way to distinguish > average from < average annotators B may not be a representative sample No information about quality of individual instances No method to infer ground truth label on instances where annotators do not agree LAW VII 3

5 Outline Related Work Limitations of Agreement Measures Probabilistic Model MASC word sense corpus Crowdsourced version Model results Class prevalence Annotator accuracy and bias Ground truth labels LAW VII 4

6 Related Work Dawid and Skene, 1979; Albert and Dodd, 2008; Smyth, 1995 Bruce and Wiebe, 1999; Snow et al., 2008 Rzhetsky et al., 2009; Whitehill et al., 2009 Hovy et al., 2013 (NAACL) LAW VII 5

7 Pairwise Agreement A [0, 1] Considers I Items, J Annotators Sum all pairs of annotators who agree on the label Normalize by the total number of annotator pairs No reference to rate of labels k 1 : K LAW VII 6

8 Agreement Coefficients IA [ 1, 1] Considers I Items, J Annotators, K Label Classes Chance-adjusted IA : A C 1 C C based on label proportions ψ m,k for each m J For annotators m, n, m n Kappa: C m,n = K k=1 ψ m,k ψ n,k Alpha: C m,n = K k=1 ψ2 k LAW VII 7

9 Drawbacks to Agreement Metrics 1. Intrinsically pairwise 2. Agreement on error indistinguishable from correct agreement 3. When chance agreement is high, chance-adjusted agreement is low (high prevalence categories) 4. Annotators can have identical bias on a category 5. Item-level effects (difficulty) can inflate agreement-in-error 6. Decision boundaries for agreement quality are arbitrary 7. Confidence intervals (rarely computed) can be wide enough to cross decision boundaries LAW VII 8

10 MASC Word Sense Sentence Corpus 116 lemmas: 29 adjectives, 46 nouns, 41 verbs 1,000 example sentences per lemma drawn from the MASC corpus, a heterogeneous corpus of 19 genres 2,392,873 words counting every sentence once 3,328,815 words counting every sentence once for each annotated word 7.2 WordNet senses per word LAW VII 9

Word Sense Annotation Procedures Annotators College students from Vassar, Columbia, Barnard Trained using guidelines from Christiane Fellbaum SATANiC Graphical User Interface, with SVN Procedures

11 Word Sense Annotation Procedures Annotators College students from Vassar, Columbia, Barnard Trained using guidelines from Christiane Fellbaum SATANiC Graphical User Interface, with SVN Procedures Annotation rounds of 10 words per round, 2 to 4 annotators Pre-annotation sample: 50 sentences, to learn and review sense inventory 900 sentences annotated with one annotator per sentence 100 sentences annotated by all annotators for agreement LAW VII 10

12 Agreement Results Go from High to Low Word Pos Senses α Agreement late adj high adj severe adj strike noun date noun success noun mature verb add verb ask verb One label per instance may not be enough Model-based annotation evaluation can improve results LAW VII 11

13 Crowdsourced Word Sense Annotation Amazon Mechanical Turk 45 of the 116 words Same sentences, 20 to 25 labels per sentence 1 HIT: 10 sentences (100 HITs per word) Extensive piloting of HIT design/pricing/qualifications 90% lifetime approval rating 20,000 approved hits U.S. domain 228 turkers LAW VII 12

14 Sample HIT LAW VII 13

15 Data Format Each of the n observed labels is a tuple of an item ii I an annotator jj J a label y K n ii n jj n y n LAW VII 14

16 Dawid and Skene Model Joint probability of: true labels z i 1 : K; z Categorical(π) prevalence π k K-simplex annotator probability θ j,k,k of assigning k when true label is k (also a simplex, i.e., probabilities must be 0 and sum to 1) observed labels y n Categorical(θ jj[n],z[ii[n]] ) annotator jj[n] s responses to items ii[n] where true category is zz[ii[n]] LAW VII 15

17 Inference Additively smoothed Maximum Likelihood Estimation Equivalent to maximum a posteriori estimation in a Bayesian model with Dirichlet priors θ j,k Dirichet(α k ) π Dirichlet(β) LAW VII 16

18 Prevalence: Add (v); α = 0.55; A = add-v MASC Freq MASC Maj MASC MLE AMT Maj AMT MLE 0.00 Other Sense 1 Sense 2 Sense 3 Sense 4 Sense 5 Sense 6 LAW VII 17

19 Prevalence: Help (v); α = 0.26; A = help-v MASC Freq MASC Maj MASC MLE AMT Maj AMT MLE Other Sense 1 Sense 2 Sense 3 Sense 4 Sense 5 Sense 6 Sense 7 Sense 8 LAW VII 18

20 Annotator Heatmaps Add (v); α = 0.55; A = 0.72 LAW VII 19

21 Annotator Heatmaps Help (v); α = 0.26; A = 0.58 LAW VII 20

22 Ground Truth Labels Add (verb); α = 0.55; A = 0.72 Sense k 0.99 Prop make an addition; join or state or say further bestow a quality on constitute an addition SubTot Rest LAW VII 21

23 Ground Truth Labels Help (verb); α = 0.26; A = 0.58 Sense k 0.99 Prop give assistance; be of service improve the condition of be of use contribute to the furtherance of SubTot Rest LAW VII 22

24 Cost Comparison MASC AMT Lemmas Creating infrastructure 1 year 0.05 year Annotation period 5 years 0.05 year Cost of annotators $80,000 $15,000 Cost per ground truth label $0.70 $0.33 LAW VII 23

25 Summary of Contributions More information about the annotated data Higher quality corpus Lower cost Valuable corpus on moderately fine-grained word sense LAW VII 24

26 Future Work Richer models, e.g., add a parameter for item difficulty Annotate more of the 71 remaining MASC lemmas Investigate/monitor utility of the corpus to Train WSD Study WordNet sense inventories Evaluate WSD using probability distribution of senses per item Develop overall quality measure of the corpus (entropy) LAW VII 25

Statistical Models of the Annotation Process

Statistical Models of the Annotation Process Bob Carpenter 1 Massimo Poesio 2 1 Alias-I 2 Università di Trento LREC 2010 Tutorial 17th May 2010 Many slides due to Ron Artstein Annotated corpora Annotated corpora are needed for: Supervised learning