Advances in IR Evalua0on. Ben Cartere6e Evangelos Kanoulas Emine Yilmaz

Similar documents
Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

Information Retrieval

Measuring the Variability in Effectiveness of a Retrieval System

Hypothesis Testing with Incomplete Relevance Judgments

Some Review and Hypothesis Tes4ng. Friday, March 15, 13

One- factor ANOVA. F Ra5o. If H 0 is true. F Distribu5on. If H 1 is true 5/25/12. One- way ANOVA: A supersized independent- samples t- test

Introduc)on to the Design and Analysis of Experiments. Violet R. Syro)uk School of Compu)ng, Informa)cs, and Decision Systems Engineering

T- test recap. Week 7. One- sample t- test. One- sample t- test 5/13/12. t = x " µ s x. One- sample t- test Paired t- test Independent samples t- test

Query Performance Prediction: Evaluation Contrasted with Effectiveness

The Axiometrics Project

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Research Methodology in Studies of Assessor Effort for Information Retrieval Evaluation

Two sample Test. Paired Data : Δ = 0. Lecture 3: Comparison of Means. d s d where is the sample average of the differences and is the

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Low-Cost and Robust Evaluation of Information Retrieval Systems

Sociology 301. Hypothesis Testing + t-test for Comparing Means. Hypothesis Testing. Hypothesis Testing. Liying Luo 04.14

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

Markov Precision: Modelling User Behaviour over Rank and Time

Power of a hypothesis test

Data Mining. CS57300 Purdue University. March 22, 2018

Valida&on of Predic&ve Classifiers

A Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments

A Neural Passage Model for Ad-hoc Document Retrieval

Data Processing Techniques

STAT Chapter 8: Hypothesis Tests

Reproducibility and Power

Sta$s$cs for Genomics ( )

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Short introduc,on to the

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

A Study of the Dirichlet Priors for Term Frequency Normalisation

Single Sample Means. SOCY601 Alan Neustadtl

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Analysis of Variance (ANOVA)

Bias/variance tradeoff, Model assessment and selec+on

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Score Aggregation Techniques in Retrieval Experimentation

Do students sleep the recommended 8 hours a night on average?

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

Inferential statistics

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

BEWARE OF RELATIVELY LARGE BUT MEANINGLESS IMPROVEMENTS

Sampling and Sample Size. Shawn Cole Harvard Business School

CS 188: Artificial Intelligence. Outline

Ranked Retrieval (2)

Introduction to Statistical Data Analysis III

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

HYPOTHESIS TESTING. Hypothesis Testing

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CHAPTER 10 Comparing Two Populations or Groups

POLI 443 Applied Political Research

Behavioral Data Mining. Lecture 2

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

The Purpose of Hypothesis Testing

Linear Regression and Correla/on. Correla/on and Regression Analysis. Three Ques/ons 9/14/14. Chapter 13. Dr. Richard Jerz

Linear Regression and Correla/on

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

Section 10.1 (Part 2 of 2) Significance Tests: Power of a Test

Statistics for IT Managers

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Common models and contrasts. Tuesday, Lecture 2 Jeane5e Mumford University of Wisconsin - Madison

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Modeling Controversy Within Populations

Lecture 4 January 23

Counterfactual Evaluation and Learning

The Maximum Entropy Method for Analyzing Retrieval Measures. Javed Aslam Emine Yilmaz Vilgil Pavlu Northeastern University


Correlation, Prediction and Ranking of Evaluation Metrics in Information Retrieval

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

REGRESSION AND CORRELATION ANALYSIS

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

ANALYTICAL COMPARISONS AMONG TREATMENT MEANS (CHAPTER 4)

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Performance Evaluation

Geographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich

Sociology 6Z03 Review II

Semiparametric Regression

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Probabilistic Field Mapping for Product Search

Kruskal-Wallis and Friedman type tests for. nested effects in hierarchical designs 1

Correla'on. Keegan Korthauer Department of Sta's'cs UW Madison

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Class Notes. Examining Repeated Measures Data on Individuals

Hypothesis tests

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Stat 529 (Winter 2011) Experimental Design for the Two-Sample Problem. Motivation: Designing a new silver coins experiment

Sources of Uncertainty

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Distribution-Free Procedures (Devore Chapter Fifteen)

Courtesy of Jes Jørgensen

Transcription:

Advances in IR Evalua0on Ben Cartere6e Evangelos Kanoulas Emine Yilmaz

Course Outline Intro to evalua0on Evalua0on methods, test collec0ons, measures, comparable evalua0on Low cost evalua0on Advanced user models Web search models, novelty & diversity, sessions Reliability Significance tests, reusability Other evalua0on setups 2

Judgment Effort Results Search Engines Cost vs Reliability Judges 3

Average? INQ604-0281 ok8alx - 0324 CL99XT - 0373 4

How much is the average A product of your IR system chance? Slightly different set of topics? Would the average change? 5

Reliability Reliability : the extent to which results reflect real difference (not due to chance) Variability in effec0veness scores due to 1 Differen0a0on in nature of documents (corpus) 2 Differen0a0on in nature of topics 3 Incompleteness of relevance judgments 4 Inconsistency among assessors 1 2 3 k z sys 1 R R??? sys 2? N? R? sys 3 sys M R R? R N??? R? 6

Reliability Variability in effec0veness scores due to 1 Differen0a0on in nature of documents 2 Differen0a0on in nature of topics 3 Incompleteness of judgments 4 Inconsistency among assessors Document corpus size [Hawking and Robertson J of IR 2003] Topics vs assessors [Voorhees SIGIR98, Banks et al Inf Retr99, Bodoff and Li SIGIR07] Effec0ve size of the topic set for reliable evalua0on [Buckley and Voorhees SIGIR00/SIGIR02, Sanderson and Zobel SIGIR05] Topics vs documents (Million Query track) [Allan et al TREC07, Cartere6e et al SIGIR08] 7

Is C really be6er than A? $"!#,"!"#$%&'"(#)"%*("+,-,$%!#+"!#*"!#)"!#("!#'"!#&"!#%" -" " /"!#$"!"!" (" $!" $(" %!" %(" 8

Comparison and Significance Variability in effec0veness scores When observing a difference in effec0veness scores across two retrieval systems Does this difference occur by random chance? Significance tes0ng Es0mates the probability p of observing a certain difference in effec0veness given that H 0 is true In IR evalua0on H 0 : the two systems are in effect the same and any difference in scores is by random chance 9

Significance Tes0ng Significance tes0ng framework: Two hypotheses, eg H 0 : μ = 0 H a : μ 0 System performance measurements over a sample of topics A test sta0s0c T computed from those measurements The distribu0on of the test sta0s0c A p- value, which is the probability of sampling T from a distribu0on obtained by assuming H 0 is true 10

Student s t- test Parametric test Assump0ons 1 effec0veness score differences are meaningful 2 effec0veness score differences follow normal distribu0on Sta0s0c : t- test performs well even when the normality assump0on is violated [Hull SIGIR93] 11

Student s t- test Query A B B- A 1 25 35 +10 2 43 84 +41 3 39 15-24 4 75 75 0 5 43 68 +25 6 15 85 +70 7 20 80 +60 8 52 50-02 9 49 58 +09 10 50 75 +25 B A = 0214 σ B A = 0291 t = B A σ B A n = 233 12

Student s t- test B A = 0214 σ B A = 0291 t = B A σ B A n = 233 p value = 002 13

Sign Test Non- parametric test Ignores magnitude of differences Null hypothesis for this test is that P(B > A) = P(A > B) = ½ Sta0s0c : number of pairs where B>A 14

Wilcoxon Signed- Ranks Test Non- parametric test Sta0s0c: R i is a signed rank of absolute differences N is the number of differences 0 15

Wilcoxon Signed- Ranks Test Query A B B- A Sorted Signed- rank 1 25 35 +10-02 - 1 2 43 84 +41 +09 +2 3 39 15-24 +10 +3 4 75 75 0-24 - 4 5 43 68 +25 6 15 85 +70 7 20 80 +60 +25 +25 +41 +5 +6 +7 w = 35 => p = 025 8 52 50-02 +60 +8 9 49 58 +09 +70 +9 10 50 75 +25 16

Randomisa0on test Loop for many 0mes { 1 Load topic scores for 2 systems 2 Randomly swap values per topic 3 Compute average for each system 4 Compute difference between averages 5 Add difference to array } Sort array If actual difference outside 95% differences in array Two systems are significantly different 17

Inference from Hypothesis Tests We owen uses tests to make an inference about the popula0on based on a sample eg infer that H 0 is false in the popula0on of topics That inference can be wrong for various reasons: Sampling bias Measurement error Random chance There are two classes of errors: Type I, or false posi0ves: we reject H 0 even though it is true Type II, or false nega0ves: we fail to reject H 0 even though it is false 18

Errors in Inference A significance test is basically a classifier H 0 false H 0 true p < 005 (reject H 0 ) correct Type I error p 005 (do not reject H 0 ) Type II error correct We can t actually know whether H 0 is true or not If we could, we wouldn t need the test But we can set up the test to control the expected Type I and Type II error rates 19

Expected Type I Error Rate Test parameter α is used to decide whether to reject H0 or not if p < α, then reject H 0 Choosing α is equivalent to sta0ng an expected Type I error rate eg if p < 005 is considered significant, we are saying that we expect that we will incorrectly reject H 0 5% of the 0me Why? Because when H 0 is true, every p- value is equally likely to be observed 5% of the 0me we will observe a p- value less than 005 and therefore there is a 5% Type I error rate 20

Typical distribu0on of test sta0s0c assuming H 0 true Distribu0on of p- values assuming H 0 true 5% chance of incorrectly concluding H 0 false 21

Expected Type II Error Rate What about Type II errors? False nega0ves are bad: if we can t reject H 0 when it s false, we may miss out on interes0ng results What is the distribu0on of p- values when H 0 is false? Problem: there is only one way H 0 can be true, but there are many ways it can be false 22

Typical distribu0on of test sta0s0c assuming H 0 true One of many possible distribu0ons if H 0 is false Distribu0on of p- values for that alterna0ve 064 probability of rejec0ng H 0 036 Type II error rate 23

Power and Expected Type II Error Rate Power is the probability of rejec0ng H 0 when it is false Equivalent to (1 Type II error rate) Power parameter is β Choosing a value of β therefore entails se ng an expected Type II error rate But what does it mean to choose a value of β? In the previous slide, β was calculated post hoc, not chosen a priori 24

Effect Size A measure of the magnitude of the difference between two systems Effect size is dimensionless; intui0vely similar to % change in performance Bigger popula0on effect size more likely to find a significant difference in a sample Before star0ng to test, we can say I want to be able to detect an effect size of h with probability β eg If there is at least a 5% difference, the test should say the difference is significant with 80% probability h = 005, β = 08 25

Sample Size Once we have chosen α, β, h (Type I error rate, power, and desired effect size), we can determine the sample size needed to make the error rates come out as desired n = f(α, β, h) Usually involves a linear search There are sowware tools to do this Basically: Sample size n increases with β if other parameters held constant If you want more power, you need more samples 26

Implica0ons for Low- Cost Evalua0on First consider how Type I and Type II errors can happen due to experimental design rather than randomness Sampling bias: usually increases Type I error Measurement error: usually increases Type II error When judgments are missing, measurements are more errorful And therefore Type II error is higher than expected 27

Implica0ons for Low- Cost Evalua0on What is the solu0on? If Type II error increases, power decreases To get power back up to the desired level, sample size must increase Therefore: deal with reduced numbers of judgments by increasing the number of topics Of course, each new topic requires its own judgments Cost- benefit analysis finds the sweet spot where power is as desired within available budget for judging 28

Tradeoffs in Experimental Design 1 1 Power of experiment 08 06 04 02 0 Number of judgments 14000 12000 10000 0 50 100 150 200 8000 Number of queries 6000 4000 2000 08 06 04 02 0 0 50 100 150 200 Number of judgments 50 queries 100 queries 0 0 100 200 300 400 500 Number of queries 29

Cri0cism on Significance Tests Significance tes0ng Note: The probability p is not the probability that H 0 is true p = P(Data H 0 ) P(H 0 Data) 30

Cri0cism on Significance Tests Are significance tests appropriate for IR evalua0on? Is there any single best to be used? [Saracevic CSL68, Van Rijsbergen 79, Robertson IPM90, Hull SIGIR93, Zobel SIGIR98, Sanderson and Zobel SIGIR05, Cormac and Lynam SIGIR06, Smucker et al SIGIR09 ] Are they any useful? With a large number of samples any difference in effec0veness will be sta0s0cally significant eg 30,000 queries in new Yahoo! and MSR collec0ons Strong form of hypothesis tes0ng [Meehl PhilSci67, Popper 59, Cohen AmerPsy94] 31

Cri0cism on Significance Tests What to do? Improve our data (make them as representa0ve as possible) Report confidence intervals [Cormack and Lynam SIGIR06, Yilmaz et al SIGIR08, Cartere6e et al SIGIR08] Examine whether results are prac0cally significant [Allan et al SIGIR05, Thomas and Hawking CIKM06, Turpin and Scholer SIGIR06, Joachims SIGIR02, Radlinski et al CIKM08, Radlinski and Craswell SIGIR10, Sanderson et al SIGIR10] 32

Judgment Effort Results Search Engines Cost vs Reusability Judges 33

Reusability Why reusable test collec0ons? High cost of construc0ng test collec0ons Amor0ze cost by reusing test collec0ons Make retrieval system results comparable Test collec0ons used by systems that did not par0cipate while collec0ons were constructed Low- cost vs Reusability If not all relevant documents are iden0fied the effec0veness of a system that did not contribute to the pool may be underes0mated 34

Reusability Relevance Judgments New System S Ranked List on Topic1 Topic 1 Topic 2 Topic N? N?? N??? N? 35

Reusability Reusability is hard to guarantee Can we test for reusability? Leave- one- out [Zobel SIGIR98, Voorhees CLEF01, Sanderson and Joho SIGIR04, Sakai CIKM08] Through Experimental Design Cartere6e et al SIGIR10 36

Reusability - Leave- one- out 1 2 3 z k sys 2 N N R R N sys 3 R N N R R sys M R R N N N sys 1 R R N N N Judge 1 2 3 z k sys 2 C D M N T sys 3 A E D F Q sys M B F S Z L sys 1 A K C X Y 37

Reusability - Leave- one- out Assume that a systems did not par0cipate Evaluate the non- par0cipa0ng system with the rest of the judgments 1 2 3 z k sys 2 N? R N sys 3 R N N R R sys M R R N N N sys 1 R R N N N Judge N 1 2 3 z k sys 2 C D M N T sys 3 A E D F Q sys M B F S Z L sys 1 A K C X Y?? 38

Reusability Reusability is hard to guarantee Can we test for reusability? Leave- one- out [Zobel SIGIR98, Voorhees CLEF01, Sanderson and Joho SIGIR04, Sakai CIKM08] Through Experimental Design Cartere6e et al SIGIR10 39

Low- Cost Experimental Design Query 1 0353 0421 0331 Query 2 0251 0122 0209 Query 3 0487 0434 0421 Query 4 0444 0446 0386 Query 5 0320 0290 0277 Query 6 0299 0302 0324 40

Sampling and Reusability Does sampling produce a reusable collec0on? We don t know and we can t simulate it Holding systems out would produce a different sample Meaning we would need judgments that we don t have 41

Experimen0ng on Reusability Our goal is to define an experimental design that will allow us to simultaneously: Acquire relevance judgments Test hypotheses about differences between systems Test reusability of the topics and judgments What does it mean to test reusability? Test a null hypothesis that the collec0on is reusable Reject that hypothesis if the data demands it Never accept that hypothesis 42

Reusability for Evalua0on We focus on evalua0on (rather than training, failure analysis, etc) Three types of evalua0on: Within- site: a group wants to internally evaluate their systems Between- site: a group wants to compare their systems to those of another group ParOcipant- comparison: a group wants to compare their systems to those that par0cipated in the original experiment (eg TREC) We want data for each of these 43

subset topic Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 T 0 1 n T 1 n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8 n+9 n+10 n+11 n+12 n+13 n+14 n+15 T 2 n+16 All- Site Baseline Within- Site Reuse Within- Site Baseline (for site 6) 44

subset topic Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 T 0 1 n T 1 n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8 n+9 n+10 n+11 n+12 n+13 n+14 n+15 T 2 n+16 All- Site Baseline Between- Site Reuse Between- Site Baseline (for sites 5 and 6) 45

subset topic Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 T 0 1 n T 1 n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8 n+9 n+10 n+11 n+12 n+13 n+14 n+15 T 2 n+16 All- Site Baseline ParOcipant Comparison 46

Sta0s0cal Analysis Goal of sta0s0cal analysis is to try to reject the hypothesis about reusability Show that the judgments are not reusable Three approaches: Show that measures such as average precision on the baseline sets do not match measures on the reuse sets Show that significance tests in the baseline sets do not match significance tests in the reuse sets Show that rankings in the baseline sets do not match rankings in the reuse sets Note: within confidence intervals! 47

Agreement in Significance Perform significance tests on: all pairs of systems in a baseline set all pairs of systems in a reuse set If the aggregate outcomes of the tests disagree significantly, reject reusability 48

Within- Site Example Some site submi6ed five runs to the TREC 2004 Robust track Within- site baseline: 210 topics Within- site reuse: 39 topics Perform 5*4/2 = 10 paired t- tests with each group of topics Aggregate agreement in a con0ngency table 49

Within- Site Example baseline tests reuse tests p < 005 p 005 p < 005 6 0 p 005 3 1 3 significant differences in baseline set that are not significant in reuse set 70% agreement is that bad? 50

Expected Errors Compare observed error rate to expected error rate To es0mate expected error rate, use power analysis (Cohen, 1992) What is the probability that the observed difference over 210 topics would be found significant? What is the probability that the observed difference over 39 topics would be found significant? Call these probabili0es q 1, q 2 51

Expected Errors For each pair of runs: q 1 = probability that observed difference is significant over 210 queries q 2 = probability that observed difference is significant over 39 queries Expected number of true posi0ves += q 1 *q 2 Expected number of false posi0ves += q 1 *(1- q 2 ) Expected number of false nega0ves += (1- q 1 )*q 2 Expected number of true nega0ves += (1- q 1 )*(1- q 2 ) 52

Observed vs Expected Errors Observed: Expected: baseline tests reuse tests p < 005 p 005 p < 005 6 0 p 005 3 1 baseline tests reuse tests p < 005 p 005 p < 005 7098 0073 p 005 2043 0786 Perform a Χ 2 goodness- of- fit test to compare the tables p- value = 088 Do not reject reusability (for new systems like these) 53

Valida0on of Design and Analysis Three tests: Will we reject reusability when it is not true? When reusability is true, will the design+analysis be robust to random differences in topic sets? When reusability is true, will the design+analysis be robust to random differences in held- out sites? 54

Differences in Topic Samples Set- up: simulate design, but guarantee reusability Randomly choose k sites to hold out Use to define the baseline and reuse sets Performance measure on each system/topic is simply the one calculated using the original judgments Reusability is true because all measures are exactly the same as when sites are held out 55

Observed vs Expected Errors (Within- Site) Observed: Expected: baseline tests reuse tests p < 005 p 005 p < 005 196 2 p 005 57 45 baseline tests reuse tests p < 005 p 005 p < 005 1895 43 p 005 621 441 Perform a Χ 2 goodness- of- fit test to compare the tables p- value = 058 Do not reject reusability (for new systems like these) based on TREC 2008 Million Query track data 56

Differences in Held- Out Sites Set- up: simula0on of design with TREC Robust data (249 topics, many judgments each) Randomly hold two of 12 submi ng sites out Simulate pools of depth 10, 20, 50, 100 Calculate average precision over simulated pool Previous work suggests reusability is true (Only within- site analysis is possible) 57

Observed vs Expected Errors (Within- Site) Observed: Expected: baseline tests reuse tests p < 005 p 005 p < 005 130 17 p 005 127 160 baseline tests reuse tests p < 005 p 005 p < 005 1354 139 p 005 1216 1631 Perform a Χ 2 goodness- of- fit test to compare the tables p- value = 074 Do not reject reusability (for new systems like these) based on TREC 2004 Robust track data 58

Course Outline Intro to evalua0on Evalua0on methods, test collec0ons, measures, comparable evalua0on Low cost evalua0on Advanced user models Web search models, novelty & diversity, sessions Reliability Significance tests, reusability Other evalua0on setups 59

Evalua0on using crowd- sourcing Sheng et al KDD 08, Bailey et al SIGIR08, Alonso and Mizarro SIGIR09, Kazai et al SIGIR09, Yang et al WSDM09, Tang and Sanderson ECIR10, Sanderson et al SIGIR10 Cheap but noisy judgments Large load under a single assessor per topic Can you mo0vate an Mturker to judge ~1,500 documents Mul0ple assessors (not MTurkers) per topics works fine [Trotman and Jenkinson ADCS07] Inconsistency across assessors Malicious ac0vity? Noise? Diversity in informa0on needs (query aspects)? 60

Online Evalua0on Joachims et al SIGIR05, Radlinski et al CIKM08, Wang et al KDD09, Use clicks as indica0on of relevance? Rank bias Users tend to click on documents at the top of the list independent of their relevance Quality bias Users tend to click on less relevant documents if the overall quality of the search engine is poor 61

Online Evalua0on Evaluate by watching user behaviour: Real user enters a query Record how the users respond Measure sta0s0cs about these responses Common online evalua0on metrics Click- through rate Assumes more clicks means be6er results Queries per user Assumes users come back more with be6er results Probability user skips over results they have considered (pskip) 62

Online Evalua0on Interleaving [Radlinski et al CIKM08] A way to compare rankers online Given the two rankings produced by two methods Present a combina0on of the rankings to users Ranking providing more of the clicked results wins Treat a flight as an ac0ve experiment 63

Team Draw Interleaving Ranking A 1 Napa Valley The authority for lodging wwwnapavalleycom 2 Napa Valley Wineries - Plan your wine wwwnapavalleycom/wineries 3 Napa Valley College wwwnapavalleyedu/homexasp 4 Been There Tips Napa Valley wwwivebeentherecouk/0ps/16681 5 Napa Valley Wineries and Wine wwwnapavintnerscom wwwnapavalleycom 6 Napa Country, California Wikipedia Ranking B 1 Napa Country, California Wikipedia enwikipediaorg/wiki/napa_valley 2 Napa Valley The authority for lodging wwwnapavalleycom 3 Napa: The Story of an American Eden booksgooglecouk/books?isbn= 4 Napa Valley Hotels Bed and Breakfast Presented Ranking wwwnapalinkscom 1 Napa Valley The authority 5 for NapaValleyorg lodging wwwnapavalleyorg 2 Napa Country, California 6 Wikipedia The Napa Valley Marathon enwikipediaorg/wiki/napa_valley enwikipediaorg/wiki/napa_valley wwwnapavalleymarathonorg 3 Napa: The Story of an American Eden booksgooglecouk/books?isbn= 4 Napa Valley Wineries Plan your wine wwwnapavalleycom/wineries 5 Napa Valley Hotels Bed and Breakfast wwwnapalinkscom 6 Napa Valley College wwwnapavalleyedu/homexasp 7 NapaValleyorg wwwnapavalleyorg B wins! 64

Credit Assignment The team with more clicks wins Randomiza0on removes presenta0on order bias Each impression with clicks gives a preference for one of the rankings (unless there is a 0e) By design: If the input rankings are equally good, they have equal chance of winning Sta0s0cal test to run: ignoring 0es, is the frac0on of impressions where the new method wins sta0s0cally different from 50%? 65

Course Outline Intro to evalua0on Evalua0on methods, test collec0ons, measures, comparable evalua0on Low cost evalua0on Advanced user models Web search models, novelty & diversity, sessions Reliability Significance tests, reusability Other evalua0on setups 66

By the end of this course You will be able to evaluate your retrieval algorithms A At low cost B Reliably C Effec0vely 67

Many thanks to Mark Sanderson @RMIT, for some of the significance tes0ng slides Many thanks to Filip Radlinski @MSR, for many of the online evalua0on slides 68