Evaluation Strategies

Similar documents
Statistical Models of the Annotation Process

Measures of Agreement

Day 6: Classification and Machine Learning

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Small n, σ known or unknown, underlying nongaussian

Measuring Variability in Sentence Ordering for News Summarization

Evaluation. Albert Bifet. April 2012

Stephen Scott.

Worked Examples for Nominal Intercoder Reliability. by Deen G. Freelon October 30,

Classification, Linear Models, Naïve Bayes

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Stats notes Chapter 5 of Data Mining From Witten and Frank

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Chapter 26: Comparing Counts (Chi Square)

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Practical Statistics for the Analytical Scientist Table of Contents

Modern Information Retrieval

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Generative Models for Classification

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Online Passive-Aggressive Algorithms. Tirgul 11

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

Testing Research and Statistical Hypotheses

Chi-Squared Tests. Semester 1. Chi-Squared Tests

EasySDM: A Spatial Data Mining Platform

Inter-Rater Agreement

Lecture 5 Neural models for NLP

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Machine Learning for natural language processing

Topics in Natural Language Processing

Statistics for Managers Using Microsoft Excel/SPSS Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests

Applied Natural Language Processing

CS 6375 Machine Learning

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Information Retrieval Tutorial 6: Evaluation

The Benefits of a Model of Annotation

16.400/453J Human Factors Engineering. Design of Experiments II

Statistics Toolbox 6. Apply statistical algorithms and probability models

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

More Smoothing, Tuning, and Evaluation

UNIVERSITY OF BOLTON WESTERN INTERNATIONAL COLLEGE FZE BENG (HONS) CIVIL ENGINEERING SEMESTER ONE EXAMINATION 2015/2016

Feature selection. Micha Elsner. January 29, 2014

EECS 349:Machine Learning Bryan Pardo

Prenominal Modifier Ordering via MSA. Alignment

Deconstructing Data Science

12.10 (STUDENT CD-ROM TOPIC) CHI-SQUARE GOODNESS- OF-FIT TESTS

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 5: ANOVA and Correlation

Sampling Strategies to Evaluate the Performance of Unknown Predictors

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

11. Statistical Evidence of Theomatics demonstrated at Luke 15:10-32

Loss Functions, Decision Theory, and Linear Models

Practical Statistics

Linear and Logistic Regression. Dr. Xiaowei Huang

Agreement Coefficients and Statistical Inference

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Unit 14: Nonparametric Statistical Methods

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Comparative Document Analysis for Large Text Corpora

Introduction to Machine Learning Midterm Exam

Chapter Three. Hypothesis Testing

Modern Information Retrieval

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

Institute of Actuaries of India

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Support Vector Machines

Uncertainty and Graphical Analysis

Empirical Risk Minimization, Model Selection, and Model Assessment

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

C A R I B B E A N E X A M I N A T I O N S C O U N C I L REPORT ON CANDIDATES WORK IN THE CARIBBEAN ADVANCED PROFICIENCY EXAMINATION MAY/JUNE 2009

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

General Certificate of Education Advanced Level Examination June 2014

Machine Translation Evaluation

Are Declustered Earthquake Catalogs Poisson?

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

An introduction to biostatistics: part 1

Testing for Poisson Behavior

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

Topic Models and Applications to Short Documents

STAT 515 fa 2016 Lec Statistical inference - hypothesis testing

Chi Square Analysis M&M Statistics. Name Period Date

The Components of a Statistical Hypothesis Testing Problem

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Probabilistic Context-free Grammars

Mining Classification Knowledge

Classification Based on Logical Concept Analysis

Least Squares Classification

Transcription:

Evaluation Intrinsic Evaluation Comparison with an ideal output: Challenges: Requires a large testing set Intrinsic subjectivity of some discourse related judgments Hard to find corpora for training/testing Lack of standard corpora for most of the tasks Different evaluation methodology for different tasks Especially suited for classification tasks Typical measures: precision, recall and F-measure Confusion metric can help to understand the results Statistical significance tests used to validate improvement Must include baselines, including a straw baseline (majority class, or random) and comparable methods Evaluation Strategies 1/32 Evaluation Strategies 3/32 Evaluation Evaluation Strategies Regina Barzilay April 26, 2004 Intrinsic: comparison with an ideal output subjective quality evaluation Extrinsic (task-based): impact of the output quality on the judge s performance in a particular task Evaluation Strategies 2/32

Intrinsic Evaluation Subjective quality evaluation Task-based Evaluation Advantages: Doesn t require any testing data Gives an easily understandable performance evaluation Advantages: Doesn t require any testing data Gives an easily understandable performance evaluation Disadvantages: Disadvantages: Requires several judges and a mechanism for dealing with disagreement Tightly depends on the quality of instructions Hard to isolate different components Hard to find a task with good discriminative power Requires multiple judges Hard to reproduce Hard to reproduce Evaluation Strategies 5/32 Evaluation Strategies 7/32 Intrinsic Evaluation Task-based Evaluation Comparison with an ideal output: Advantages: Results can be easily reproducible Allows to isolate different factors contributing to system performance Examples: Dialogue systems: Book a flight satisfying some requirements Disadvantages: In the presence of multiple ideal outputs, penalizes alternative solutions Distance between an ideal and a machine-generated output may not be proportional to the human perception of quality Summarization systems: Retrieve a story about X from a collection of summaries Summarization systems: Determine if a paper X should be part of related work for a paper on topic Y Evaluation Strategies 4/32 Evaluation Strategies 6/32

Large Annotation Efforts Basic Scheme Dialogue acts Coreference Discourse relations Preliminary categories that seem to cover the range of phenomena of interest Summarization The first three are available through LDC, the last one is available through DUC Different categories functionally important and/or easy to distinguish Evaluation Strategies 9/32 Evaluation Strategies 11/32 Today Developing an Annotation Scheme Main steps: Basic of annotations; agreement computation Estimating difference in the distribution of two sets Significance of the method s improvement Impact of a certain change on the system s performance Comparing rating schemes Basic scheme Preliminary Annotation Informal evaluation Scheme revision and re-coding Coding manual Formal evaluation: inter-code reliability Ready to code real data Evaluation Strategies 8/32 Evaluation Strategies 10/32

Example: Dialogue Act Classification Taxonomy principles: Activity-specific Must cover activity features Make crucial distinctions Avoid irrelevant distinctions General Informal Evaluation and Development Analysis of problematic annotations Are some categories missing? Are some categories indistinguishable for some coding decisions? Do categories overlap? Meetings between annotators and scheme designers and users Aim to cover all activities Specific activities work in a sub-space Activity-specific clusters as macros Revision of annotation guidelines More annotations Result: Annotation manual Evaluation Strategies 13/32 Evaluation Strategies 15/32 Example: Dialogue Act Classification Preliminary Annotation Informativeness: Difference in conditions/effects vs. confidence in label Generalization vs. distinctions Example: state, assert, inform, confess, concede, affirm, Algorithm Automated annotation if possible Semi-automated (partial, supervised decisions) Decision trees for human annotators claim Definitions, guidelines Granularity: Complex, multi-functional acts vs. simple acts (the latter relies on multi-class classification) Trial run with multiple annotators Ideally following official guidelines or algorithm rather than informally taught Evaluation Strategies 12/32 Evaluation Strategies 14/32

Reliability of Annotations The performance of an algorithm has to be evaluated against some kind of correct solution, the key Agreement: Balanced Distribution For most linguistic tasks correct can be defined using human performance (not linguistic intuition) A B C 1 2 0 0 2 2 0 0 If different humans get different solutions for the same task, it is questionable which solution is correct and whether the task can be solved by humans at all 3 2 0 0 4 0 2 0 5 0 2 0 6 0 2 0 7 0 0 2 Measures of reliability have to be used to test whether human performance is reliable 8 0 0 2 9 0 0 2 10 1 1 0 If human performance is indeed reliable, the solution produced by human can be used as a key against which an algorithm can be evaluated p(a) = 9/10 = 0.9 Evaluation Strategies 17/32 Evaluation Strategies 19/32 Formal Evaluation Reliability of Annotations Controlled coding procedures Individuals coding unseen data Kowtko et al. (1992) and Litman&Hirschberg use pairwise agreement between naive annotators Coding on the basis of manual No discussion between coders Evaluation of inter-code reliability Confusion matrix Silverman et al. (1992) have two groups of annotators: a small group of experienced annotators and a large group of naive annotators. Assumption: the annotations are reliable, of there is only a small difference between groups. Statistical measure of agreement However, what does reliability mean in these cases? Evaluation Strategies 16/32 Evaluation Strategies 18/32

Agreement: Balanced Distribution Kappa A B C 1 2 0 0 2 2 0 0 3 2 0 0 4 0 2 0 The kappa statistics can be used when multiple annotators have to assign markables to one of a set of non-ordered classes 5 0 2 0 6 0 2 0 Kappa is defined as: 7 0 0 2 8 0 0 2 9 0 0 2 10 1 1 0 K = P (A) P (E) 1 p(e) p(a) = 9/10 = 0.9 p(e) = (A/T ) 2 + (B/T ) 2 + (C/T ) 2 = 0.335 where P(A) is the actual agreement between annotators, and P(E) is the agreement by chance Evaluation Strategies 21/32 Evaluation Strategies 23/32 Agreement: Skewed Distribution Agreement: Skewed Distribution A B C 1 2 0 0 2 2 0 0 3 2 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 10 1 1 0 p(a) = 9/10 = 0.9 A B C 1 2 0 0 2 2 0 0 3 2 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 10 1 1 0 p(a) = 9/10 = 0.9 p(e) = (A/T ) 2 + (B/T ) 2 + (C/T ) 2 = 0.815 Evaluation Strategies 20/32 Evaluation Strategies 22/32

Today Paired Data Basic of annotations; agreement computation Goal: determine the impact of a certain fact on a given distribution Estimating difference in the distribution of two sets Test is performed on the same sample Significance of the method s improvement Impact of a certain change on the system s performance Example scenario: we want to test whether adding parsing information improves performance of a summarization system on a predefined set of texts Comparing rating schemes Null hypothesis: the actual mean difference is consistent with zero Evaluation Strategies 25/32 Evaluation Strategies 27/32 Kappa Interpretation Complete agreement: K = 1; random agreement: K = 0 random agreement In our example: K for balance set is 0.85, and for skewed one is 0.46 Student s t-test Goal: determine whether two distributions are different Samples are selected independently Example scenario: we want to test whether adding parsing information improves performance of a summarization system Typically, K > 0.8 indicates good reliability Many statisticians do not like Kappa! (alternative: interclass agreement) Null hypothesis: the difference is due to chance For N = 10, X avg ± 2.26 σ/n 1 2 (with 95% confidence) Statistical significance : the probability that the difference is due to chance Evaluation Strategies 24/32 Evaluation Strategies 26/32

Chi-squared test Today Goal: compare expected counts Example scenario: we want to test whether number of backchannels in a dialogue predicted by our algorithm is consistent with their distributions in real text Assume normal distribution with mean µ and standard deviation σ: Basic of annotations; agreement computation Estimating difference in the distribution of two sets Comparing rating schemes χ 2 = k (x i µ) 2 σ 2 i=1 Evaluation Strategies 29/32 Evaluation Strategies 31/32 Anova Chi-squared test Goal: determine the impact of a certain fact on several distributions (assumes cause-effect relation) Samples are selected independently Null hypothesis: the difference is due to chance Computation: F = foundvariationofthegroupaverages expectedvariationof thegroupaverages Interpretation: if F = 1 null hypothesis is correct, while large values of F confirm the impact In some cases, we don t know the standard deviation for each count: X 2 = k (x i E i ) 2 i=1 E i = p i N Assume the Poisson distribution (the standard deviation equals the square of the expected counts) E i Restrictions: not applicable for small E i Evaluation Strategies 28/32 Evaluation Strategies 30/32

Kendall s τ Goal: estimate the agreement between two orderings Computation: I τ = 1 2 N(N 1)/2, where N is a sample size and I is the minimal number of interchanges required to map the first order into the second Evaluation Strategies 32/32