Information Retrieval Tutorial 6: Evaluation

Similar documents
INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Performance Evaluation

Performance evaluation of binary classifiers

Machine Learning for natural language processing

Performance Evaluation

Model Accuracy Measures

Performance Metrics for Machine Learning. Sargur N. Srihari

Knowledge Discovery and Data Mining 1 (VO) ( )

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Software Testing Lecture 5. Justin Pearson

CS 188: Artificial Intelligence Spring Today

Performance Evaluation

Stephen Scott.

PV211: Introduction to Information Retrieval

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Evaluation Strategies

naive bayes document classification

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Introduction to Supervised Learning. Performance Evaluation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Stats notes Chapter 5 of Data Mining From Witten and Frank

More Smoothing, Tuning, and Evaluation

Day 6: Classification and Machine Learning

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Least Squares Classification

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CS395T Computational Statistics with Application to Bioinformatics

CSC 411: Lecture 03: Linear Classification

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Performance Evaluation and Comparison

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Math 231E, Lecture 25. Integral Test and Estimating Sums

Computational Statistics with Application to Bioinformatics. Unit 18: Support Vector Machines (SVMs)

Information Retrieval and Organisation

Midterm Examination Practice

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

Classification, Linear Models, Naïve Bayes

Conceptual Similarity: Why, Where, How

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

CC283 Intelligent Problem Solving 28/10/2013

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Information Retrieval

Linking the Smarter Balanced Assessments to NWEA MAP Tests

University of Illinois at Urbana-Champaign. Midterm Examination

Final Exam December 12, 2017

Linear Classifiers as Pattern Detectors

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Information Retrieval

Overview of IslandPick pipeline and the generation of GI datasets

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Probabilistic Information Retrieval

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation & Credibility Issues

Bayesian Logic Thomas Bayes:

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

OPTIMIZING SEARCH ENGINES USING CLICKTHROUGH DATA. Paper By: Thorsten Joachims (Cornell University) Presented By: Roy Levin (Technion)

1 Information retrieval fundamentals

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Notes and Solutions #6 Meeting of 21 October 2008

Evaluation requires to define performance measures to be optimized

Sums of Squares (FNS 195-S) Fall 2014

CS 151 Complexity Theory Spring Solution Set 5

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Logic As Algebra COMP1600 / COMP6260. Dirk Pattinson Australian National University. Semester 2, 2017

Models of Computation,

Chapter DM:II (continued)

Naïve Bayes, Maxent and Neural Models

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p

Equations and Solutions

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

The Bayes Theorema. Converting pre-diagnostic odds into post-diagnostic odds. Prof. Dr. F. Vanstapel, MD PhD Laboratoriumgeneeskunde UZ KULeuven

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Naive Bayes classification

Hypothesis tests

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

arxiv: v1 [cs.ds] 3 Feb 2018

Linear Classifiers as Pattern Detectors

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Learning Methods for Linear Detectors

LING 473: Day 13. START THE RECORDING Evaluation. Lecture 13: Evaluation

Transcription:

Information Retrieval Tutorial 6: Evaluation Professor: Michel Schellekens TA: Ang Gao University College Cork 2012-11-30 IR Evaluation 1 / 19

Overview IR Evaluation 2 / 19

Precision and recall Precision (P) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) IR Evaluation 3 / 19

Precision and recall Precision (P) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Recall (R) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) IR Evaluation 3 / 19

Precision and recall Precision (P) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Recall (R) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) IR Evaluation 3 / 19

Precision and recall Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) P = TP/(TP + FP) R = TP/(TP + FN) IR Evaluation 4 / 19

Precision and recall Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) P = TP/(TP + FP) R = TP/(TP + FN) accuracy = (TP + TN)/(TP + FP + FN + TN). IR Evaluation 4 / 19

Accuracy Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. IR Evaluation 5 / 19

Accuracy Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). IR Evaluation 5 / 19

Accuracy Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web information retrieval? IR Evaluation 5 / 19

Accuracy Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web information retrieval? Ans: In IR system normaly only a small fraction of documents in the collection are relevance, as a result TN >> TP, even we have a good IR system which only retrieve relevant documents, the accuracy between this good IR system with a poor system(such as always return nothing) is small, thus this measurement can t help us evaluate IR system. IR Evaluation 5 / 19

Exercise The snoogle search engine below always returns 0 results ( 0 matching results found ), regardless of the query. Why does snoogle demonstrate that accuracy is not a useful measure in IR? IR Evaluation 6 / 19

Why accuracy is a useless measure in IR IR Evaluation 7 / 19

Why accuracy is a useless measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing IR Evaluation 7 / 19

Why accuracy is a useless measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing You then get 99.99% accuracy on most queries. IR Evaluation 7 / 19

Why accuracy is a useless measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing You then get 99.99% accuracy on most queries. Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk. IR Evaluation 7 / 19

Why accuracy is a useless measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing You then get 99.99% accuracy on most queries. Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk. It s better to return some bad hits as long as you return something. IR Evaluation 7 / 19

Why accuracy is a useless measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing You then get 99.99% accuracy on most queries. Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk. It s better to return some bad hits as long as you return something. We use precision, recall, and F for evaluation, not accuracy. IR Evaluation 7 / 19

Precision/recall tradeoff You can increase recall by returning more docs. IR Evaluation 8 / 19

Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. IR Evaluation 8 / 19

Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! IR Evaluation 8 / 19

Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It s easy to get high precision for very low recall. IR Evaluation 8 / 19

Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It s easy to get high precision for very low recall. Which is better: IR sytem1 P: 63% R: 57%, IR system2 P: 69% R:60% IR Evaluation 8 / 19

A combined measure: F F allows us to trade off precision against recall. IR Evaluation 9 / 19

A combined measure: F F allows us to trade off precision against recall. F = 1 α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α IR Evaluation 9 / 19

A combined measure: F F allows us to trade off precision against recall. F = 1 α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α α [0, 1] and thus β 2 [0, ] IR Evaluation 9 / 19

A combined measure: F F allows us to trade off precision against recall. F = 1 α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α α [0, 1] and thus β 2 [0, ] Most frequently used: balanced F with β = 1 or α = 0.5 IR Evaluation 9 / 19

A combined measure: F F allows us to trade off precision against recall. F = 1 α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α α [0, 1] and thus β 2 [0, ] Most frequently used: balanced F with β = 1 or α = 0.5 This is the harmonic mean of P and R: 1 F = 1 2 ( 1 P + 1 R ) IR Evaluation 9 / 19

A combined measure: F F allows us to trade off precision against recall. F = 1 α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α α [0, 1] and thus β 2 [0, ] Most frequently used: balanced F with β = 1 or α = 0.5 1 This is the harmonic mean of P and R: F = 1 2 ( 1 P + 1 R ) F = 2PR P+R IR Evaluation 9 / 19

F: Exercise relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 IR Evaluation 10 / 19

F: Exercise relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20/(20 + 40) = 1/3 IR Evaluation 10 / 19

F: Exercise relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4 IR Evaluation 10 / 19

F: Exercise relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4 F 1 = 2 1 1 = 2/7 + 1 1 4 1 3 IR Evaluation 10 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? IR Evaluation 11 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean P+R 2 IR Evaluation 11 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean P+R 2 The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. IR Evaluation 11 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean P+R 2 The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. IR Evaluation 11 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean P+R 2 The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. IR Evaluation 11 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean P+R 2 The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. But minimum is not smooth and hard to weight. IR Evaluation 11 / 19

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean P+R 2 The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. But minimum is not smooth and hard to weight. F (harmonic mean) is a kind of smooth minimum. IR Evaluation 11 / 19

Difficulties in using precision, recall and F We need relevance judgments for information-need-document pairs but they are expensive to produce. IR Evaluation 12 / 19

Difficulties in using precision, recall and F We need relevance judgments for information-need-document pairs but they are expensive to produce. For alternatives to using precision/recall and having to produce relevance judgments IR Evaluation 12 / 19

Framework for the evaluation of an IR system test collection consisting of (i) a document collection, (ii) a test suite of information needs and (iii) a set of relevance judgements for each doc-query pair gold-standard judgement of relevance classification of a document either as relevant or as irrelevant wrt an information need IR Evaluation 13 / 19

Assessing relevance How good is an IR system at satisfaying an information need? Needs an agreement between judges computable via the kappa statistic: kappa = P(A) P(E) 1 P(E) where: P(A): the proportion of agreements within the judgements P(E): what agreement would we get by chance IR Evaluation 14 / 19

Assessing relevance: an example Consider the following judgements (from Manning et al., 2008): Judge 2 Yes No Total Judge 1 Yes 300 20 320 No 10 70 80 Total 310 90 400 P(A) is the proportion of agreements within the judgements P(E) is the proportion of expected agreements IR Evaluation 15 / 19

Assessing relevance: an example Consider the following judgements (from Manning et al., 2008): Judge 2 Yes No Total Judge 1 Yes 300 20 320 No 10 70 80 Total 310 90 400 P(A) = 370 P(E) = P(rel) 2 + P(notrel) 2 400 P(rel) = 1 320 2 400 + 1 310 320 + 310 80 + 90 = P(notrel) = 2 400 800 800 P(A) is the proportion of agreements within the judgements P(E) is the proportion of expected agreements IR Evaluation 15 / 19

Assessing relevance: an example Consider the following judgements (from Manning et al., 2008): Judge 2 Yes No Total Judge 1 Yes 300 20 320 No 10 70 80 Total 310 90 400 P(A) = 370 P(E) = P(rel) 2 + P(notrel) 2 400 P(rel) = 1 320 2 400 + 1 310 320 + 310 80 + 90 = P(notrel) = 2 400 800 800 P(A) P(E) kappa = 1 P(E) k = 0.776 P(A) is the proportion of agreements within the judgements P(E) is the proportion of expected agreements IR Evaluation 15 / 19

Exercise Consider the following judgements: Judge 2 Yes No Total Judge 1 Yes 120 30 150 No 30 20 50 Total 150 50 200 IR Evaluation 16 / 19

Exercise Consider the following judgements: Judge 2 Yes No Total Judge 1 Yes 120 30 150 No 30 20 50 Total 150 50 200 P(rel) = 150 + 150 400 120 + 20 P(A) = = 0.7 200 50 + 50 = 0.75 P(notrel) = = 0.25 400 IR Evaluation 16 / 19

Exercise Consider the following judgements: Judge 2 Yes No Total Judge 1 Yes 120 30 150 No 30 20 50 Total 150 50 200 P(rel) = 150 + 150 400 120 + 20 P(A) = = 0.7 200 50 + 50 = 0.75 P(notrel) = = 0.25 400 P(E) = P(rel) 2 + P(notrel) 2 = 0.75 2 + 0.25 2 = 0.625 k = P(A) P(E) 1 P(E) = 0.7 0.625 1 0.625 = 0.2 IR Evaluation 16 / 19

Assessing relevance (continued) Interpretation of the kappa statistic k: Values of k in the interval [2/3, 1.0] are seen as acceptable. With smaller values: need to redesign relevance assessment methodology used etc. Note that the kappa statistic can be negative if the agreements between judgements are worse than random In case of large variations between judgements, one can choose an assessor as a gold-standard considerable impact on the absolute assessment little impact on the relative assessment IR Evaluation 17 / 19

Exercise Below is a table showing how two human judges rated the relevance of a set of 12 documents to a particular information need (0 = nonrelevant, 1 = relevant). Let us assume that you have written an IR system that for this query returns the set of documents {4, 5, 6, 7, 8}. Calculate the kappa measure between the two judges. Calculate precision, recall, and F1 of your system if a document is considered relevant only if the two judges agree. Calculate precision, recall, and F1 of your system if a document is considered relevant if either judge thinks it is relevant. IR Evaluation 18 / 19

Solution Part a. P(A) = 12 4 P(rel) = 12 P(notrel) = 12 24 24 P(E) = P(rel) 2 + P(notrel) 2 = 1 2 k = P(A) P(E) 1 P(E) = 1 3 Part b. Relevant = {3,4} Retrieved = {4,5,6,7,8} P = 1 5 R = 1 2 F 1 = 2PR P+R = 2 7 Part c. Relevant = {3,4,5,6,7,8,9,10,11,12} Retrieved = {4,5,6,7,8} P = 5 5 = 1 R = 5 10 = 1 2 F 1 = 2PR P+R = 2 3 IR Evaluation 19 / 19