Low-Cost and Robust Evaluation of Information Retrieval Systems

Size: px

Start display at page:

Download "Low-Cost and Robust Evaluation of Information Retrieval Systems"

Darren Stephens
6 years ago
Views:

1 Low-Cost and Robust Evaluation of Information Retrieval Systems A Dissertation Presented by BENJAMIN A. CARTERETTE Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY September 2008 Computer Science

3 Low-Cost and Robust Evaluation of Information Retrieval Systems A Dissertation Presented by Benjamin A. Carterette B.Sc., MIAMI UNIVERSITY M.Sc., UNIVERSITY OF MASSACHUSETTS AMHERST Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Approved as to style and content by: James Allan, Chair Javed A. Aslam, Member W. Bruce Croft, Member Ramesh K. Sitaraman, Member John Staudenmayer, Member Andrew G. Barto, Department Chair Computer Science

5 Acknowledgments This work would not have been possible without the support of my friends, family, and colleagues. Foremost among them is my advisor, James Allan, who suggested a small project on evaluation my first semester as a graduate student. That project didn t pan out, but it led to this work and the attendant publications, recognition, and opportunity. James offered bottomless support and was always willing to dispense advice. As I begin my own professorship, James will be my primary model for how to advise and teach. I got into information retrieval in the first place by a fortuitous accident: Fazli Can, who was assigned to be my academic advisor my freshman year at Miami University, is a longtime IR researcher. He took me under his wing and guided me through my first research project. I owe him a great debt for setting me on this path. I have also been fortunate to have excellent mentors during my summer internships at Yahoo! and Microsoft Research. Rosie Jones at Yahoo! took me on initially with no more than a positive word from a former intern, and I finished that first internship as a better scientist with much more confidence in my ability as a researcher. Sue Dumais at Microsoft saw the importance of work on evaluation and helped me get the support I needed. The work I did that summer is something that none of us likely would have done on our own, and I believe it opens many doors for future research. From both Rosie and Sue I learned the value of seeking out collaborations; much of this work would not have been possible without the help of collaborators, including Javed Aslam, Paul Bennett, Max Chickering, Evangelos Kanoulas, Josh Lewis, Virgil Pavlu, Desi Petkova, Ramesh Sitaraman, and Mark Smucker. And even when there was no formal collaboration, many people contributed to this work just by asking challenging questions, through stimulating discussions, or by offering support. There are too many to name them all, but they include other UMass faculty, notably Bruce Croft and R. Manmatha, my friends and colleagues in the Center for Intelligent Information Retrieval, and the people I worked with at Yahoo! and Microsoft. My parents and brother and sister have been endlessly supportive and patient, even when I failed to write or call for long stretches. They have always been willing to come to my side when I needed them, and for that I thank them deeply. I hope that I can provide the same when they need me. This work was supported in part by the Center for Intelligent Information Retrieval, in part by SPAWARSYSCEN-SD grant number N , in part by the Defense Advanced Research Projects Agency (DARPA) under contact num-

6 vi Acknowledgments ber HR C-0023, and in part by Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect those of the sponsor.

7 Abstract Research in Information Retrieval has progressed against a background of rapidly increasing corpus size and heterogeneity, with every advance in technology quickly followed by a desire to organize and search more unstructured, more heterogeneous, and even bigger corpora. But as retrieval problems get larger and more complicated, evaluating the ranking performance of a retrieval engine gets harder: evaluation requires human judgments of the relevance of documents to queries, and for very large corpora the cost of acquiring these judgments may be insurmountable. This cost limits the types of problems researchers can study as well as the data they can be studied on. We present methods for understanding performance differences between retrieval engines in the presence of missing and noisy relevance judgments. The work introduces a model of the cost of experimentation that incorporates the cost of human judgments as well as the cost of drawing incorrect conclusions about differences between engines in both the training and testing phases of engine development. Through adopting a view of evaluation that is more concerned with distributions over performance differences rather than estimates of absolute performance, the expected cost can be minimized so as to reliably differentiate between engines with less than 1% of the human effort that has been used in past experiments.

9 Table of Contents Page ACKNOWLEDGMENTS v ABSTRACT vii LIST OF TABLES xi LIST OF FIGURES xv LIST OF SYMBOLS xix CHAPTER 1. INTRODUCTION Low-Cost Comparative Evaluation Robust Evaluation Contributions Organization INFORMATION RETRIEVAL EVALUATION Experimentation in Information Retrieval Relevance Judgments Testing Hypotheses Summary and Directions COMPARATIVE EVALUATION Algorithmic Document Selection Precision Measures Recall Measures Summary Measures Empirical Evaluation CONFIDENCE Distribution of Measures Over Judgments A Significance Test Based on Judgments Probabilistic Stopping Conditions

10 x Table of Contents 4.4 Experiments RANKING RETRIEVAL SYSTEMS Hypotheses About Rankings An Efficient Algorithm Conclusion ROBUST EVALUATION Confidence and Probability of Relevance Features for Modeling Relevance An Efficient Algorithm Experiments HYPOTHESIS TESTING AND EXPERIMENTAL DESIGN Hypothesis Tests in Information Retrieval Testing Hypotheses with Incomplete Judgments Experimental Design The TREC Million Query Track Summary CONCLUSION Future Work Are Assessors Necessary? APPENDIX: EXPERIMENTAL DATA A.1 Data Overview A.2 Data Analysis BIBLIOGRAPHY

11 List of Tables Table Page 3.1 Mean number of judgments required by MTC algorithms to prove a difference between two ranked lists over a single topic Mean number of judgments required by pooling to prove a difference between two ranked lists over a single topic Mean number of judgments required by MTC algorithms to prove a difference between two ranked lists over 50 topics Mean number of judgments required by pooling to prove a difference between two ranked lists over 50 topics Mean number of judgments required by MTC algorithms to prove a difference between two ranked lists from the same site over 50 topics Mean number of judgments required by pooling to prove a difference between two ranked lists from the same site over 50 topics Values of precision@5 and average precision depend on the assignment of relevance to 10 documents x 10 i. Assignments are numbered arbitrarily, with x 10 i being the ith assignment Number of judgments made and resulting accuracy when comparing two systems over a single topic using MTC by either prior with α = 0.05, β = Number of judgments made and resulting accuracy when comparing two systems over a single topic using incremental pooling by either prior with α = 0.05, β = Number of judgments made and resulting accuracy when comparing two systems over 50 topics using MTC by either prior with α = 0.05, β =

12 xii List of Tables 4.5 Number of judgments made and resulting accuracy when comparing two systems over 50 topics using incremental pooling by either prior with α = 0.05, β = Three example system rankings of ten documents. Each element in a set is the rank of the document corresponding to the element s index. J 10 are the true relevance judgments Values of average precision depend on the assignment of relevance to 10 documents. Assignments x 10 i are numbered arbitrarily. The fifth line (x 10 95) shows the actual judgments and the true ranking of systems Results of running MTC to rank five systems with α = and relevance probabilities based on the simple document-weights model Results of using incremental pooling to rank five systems with α = and relevance probabilities based on the simple document-weights model Results of running MTC to rank increasing numbers of systems with different values of α and relevance probabilities based on the simple document-weight model Result of using MTC methods to evaluate systems from the same participating site The first column shows the number of judgments needed to reach 95% confidence and the percent decrease from the full set of qrels. The next two columns show rank correlations and pairwise accuracy between the true ranking and the estimated ranking by EMAP for the full set of systems and the subset of pairs with significant differences. All MTC+sim results are statistically significant improvements over incremental pooling Confidence that P ( MAP < 0) and accuracy of prediction when generalizing a set of relevance judgments acquired using the standard MTC algorithm and MTC plus expert aggregation probabilities. Each bin contains over 1,000 trials from the adhoc 3, 5 8 sets. Median judged is the number of judgments to reach 95% confidence on the first two systems. Mean τ is the average rank correlation for all 10 systems Confidence vs. accuracy of MTC with probability estimates when a pair of systems retrieved 0 30% documents in common (broken out into 0% 10%, 10% 20%, and 20% 30%)

13 xiii 6.8 Mean number of judgments, pairwise accuracy of original runs, rank correlation of all runs, and accuracy at detecting significant differences in reusability experiments. All numbers are averaged over 100 trials Judgments collected for the 2007 Million Query Track Performance on 149 Terabyte topics, 1692 partially-judged topics per EM AP, and 1084 partially-judged queries per statmap, along with the number of unjudged documents in the top 100 for both sets A.1 Statistics of the TREC collections used in experiments for this work

15 List of Figures Figure Page 2.1 Illustration of the precision-recall curve and the measures that can be derived directly from it. The corpus in this example is only 100 documents, 33 of which are relevant Coefficient matrix of sum precision with documents ordered by Algorithm 3.6. If S 2 = {1, 2}, then E SP S2 is the sum of the shaded cells on the left. If S n 2 = {1, 2,..., n 2}, then E SP S C n 2 is the sum of the shaded cells on the right The bounds of DCG@100 at iteration j (left) and the weight of the document selected at iteration j (right) for TREC-3 systems clartm and pircs1, topic 164 (similarity coefficient ) The bounds of DCG@100 at iteration j (left) and the weight of the document selected at iteration j (right) for TREC-3 systems clartm and pircs1 (similarity coefficient ) The bounds of DCG@100 at iteration j (left) and the weight of the document selected at iteration j (right) for TREC-3 systems nyuir1 and nyuir2 (similarity coefficient ) Relationship between the similarity between two ranked lists and the number of judgments required to prove that they have a difference in average precision (left) or mean average precision (right) Relationship between the difference in average precision (left) and mean average precision (right) and the number of judgments required to prove that they are different Distribution of precision@5 and precision@5 under the binomial prior (top) and uniform prior (bottom). The red curves are normal distributions with the same mean and variance

16 xvi List of Figures 4.2 Distribution of and under the binomial prior (top) and uniform prior (bottom). The simulated collection size is n = 100. Red curves in the top left plot illustrate the five normal distributions being mixed Distribution of sum precision, average precision, and average precision over a simulated collection of size n = 100 under the binomial prior (top) and uniform prior (bottom) The left plot shows the bounds on AP changing as judgments are made for systems crnlea and inq101. The thick lines are the expectation of AP and the 95% confidence interval. The right plot shows the test statistic T changing as more judgments are made As test parameters α and β increase from 0 to 0.5, the number of judgments required to reach the stopping condition decreases Rank plots : Each 2-dimensional plot has axes determined by the difference in φ between two systems. Each point is determined by an assignment of relevance to 10 documents. All of the points in a region bounded by solid lines represent the ranking of systems that labels that region Rank plots, each highlighting the effect of judging a different document relevant (documents 1 5 across the top; documents 6 10 across the bottom). Red points indicate values that are still possible after judging a document A rank plot after judging documents 4, 5, and 8. Dark red points indicate possible values after those all three of those judgments. Light red points are possible values after judging documents 4, 5, or 8 individually Possible values if document 4 is judged relevant. Contour lines show the bivariate normal approximation to the joint distribution of the difference in φ between pairs of systems System graph. Nodes are systems; edge weights are the variance of the difference in a measure. Bold edges are in the graph s minimum spanning tree An example mapping of vector space similarities to probabilities by logistic regression. The left plot shows document vectors in 3-dimensional space. The right shows probability of relevance increasing with similarity to FBIS (a relevant document) and decreasing with similarity to FT (nonrelevant)

17 xvii 6.2 Example mappings from rank to probability of relevance. Depending on prior parameters γ 0, γ 1 (which are based on the number of known relevant and nonrelevant documents), probabilities decrease with rank at different rates The choice of parameter α (Type I error) and the number of systems being compared determine the number of judgments, the Kendall s τ correlation, and the rank distance Number of judgments versus accuracy of ranking. Each point is the mean of 50 trials with a fixed α and number of systems. The solid lines are fitted using least squares. The horizontal dashed lines represent good performance τ = 0.9 or d rank = Regardless of experimental setting, a good ranking requires about 500 judgments Result of using MTC methods with document similarity features to rank full sets of TREC systems. Each circle represents a system; open circles are the systems in their true ordering by MAP, while closed circles are the predicted ranking by a small set of judgments and EMAP. Red squares indicate manual runs Confidence vs. accuracy of expert aggregation model. The solid line is the perfect result; performance should be on or above this line. Each point represents at least 500 pairwise comparisons Confidence vs. accuracy of document similarity model. Performance should be on or above the solid line Relationship between significance test p-values and likelihood ratios for the corresponding models Examples of likelihood ratios given increased expectations of AP calculated based on topic sample T and judgments J. The topic population standard deviation in this example is σ AP = Cumulative distribution of AP assuming the population mean is µ AP = 0.05 and the standard deviation is σ AP = As the number of topics and number of judgments per topic increase, power increases, as does total cost. Note especially the points at the back center-left, which have high power but low cost, in comparison to the points at the back right, which have high power and high cost

18 xviii List of Figures 7.5 EM AP and statmap evaluation results sorted by evaluation over 149 Terabyte topics From left, evaluation over Terabyte queries versus statmap evaluation, evaluation over Terabyte queries versus EM AP evaluation, and statmap evaluation versus EM AP evaluation Stability of MTC and statap Total assessor cost required to reach a stable ranking. The number of queries required to reach τ = 0.9 is indicated on the plot for both MTC (blue) and statap (red) A.1 True mean average precisions of TREC systems. Each circle represents a retrieval run. Red squares indicate manual runs A.2 Mean similarity between submitted runs for each TREC (left) and mean similarity between runs from the same site (right). Similarity between all runs has tended to increase from TREC-4 on even as similarity between runs from the same site decreased or remained constant A.3 Similarity between pairs of systems by topic. The topics at each TREC demonstrate a wide range of similarities. The lines are the median similarity A.4 Distribution of Kendall s τ rank correlations between the scores of systems over pairs of topics for each TREC. If the correlations are normally distributed, we cannot reject the hypothesis that topics were independently sampled

19 List of Symbols Generally speaking, capital italic letters (e.g. X, Y ) represent random variables, lowercase italics (e.g. x, y) represent a scalar or a value of a random variable, bold lowercase letters (e.g. x, y) represent vectors, and bold uppercase letters (e.g. X, Y n m ) represent matrices (which may be accompanied by a size). Greek letters (e.g. µ, α) usually represent parameters either selected by a user or estimated from data. Sans serif letters (e.g. A, B) represent ranked lists, i.e. sets in which each element is the rank of a document. All-caps style (e.g. algorithm) represents algorithm/subroutine names. Calligraphic letters (e.g. J, R) generally represent sets, and sometimes population spaces over which a corresponding random variable is defined. When necessary, however, these rules may be violated. n The number of documents in a corpus. d i A document in the corpus; 1 i n. x i The relevance of document i. x n Relevance judgments to all n documents: x n = {x 1, x 2,..., x n }. X i A Bernoulli random variable for the relevance of document i. p i The probability that document i is relevant: p i = p(x i = 1). p n ; p A set or vector of probabilities of relevance: p n = p = {p 1, p 2,..., p n }. X n A random variable over all possible assignments of relevance: X n = {X 1, X 2,..., X n }. p(x n ) The probability distribution of assignments of relevance. X n The space of 2 n relevance assignments. J A set of relevance judgments: J = {x j1, x j2,..., x jn }. T A set of topics. Frequently overloaded to mean the size of the set as well. t A topic in T. R The set of judged relevant documents for a topic. Sometimes overloaded to mean the number of relevant documents as well. A i The rank of document d i in system A. A The ranking of documents by system A: A = {A 1, A 2,..., A n }. φ The value of arbitrary evaluation measure φ for a particular system with a particular set of judgments. φ The difference in φ between two systems on the same topic. P (φ) The probability distribution of values of φ over possible assignments of relevance to unjudged documents, i.e. over values of random variable X n. Eφ The expectation of φ over values of random variable X n. Var (φ) The variance of φ over values of random variable X n.

20 xx List of Symbols A B System A is comparatively better than B by some evaluation measure φ, i.e. φ A > φ B. A B System A is equivalent to B by some evaluation measure φ, i.e. φ A = φ B. m The number of systems in a set being evaluated. R m A random variable over the space of possible rankings of m systems. σ A particular ranking of m systems; σ R m. H( ) Shannon entropy of a random variable. d The number of features in a model of relevance. F A feature matrix with n rows and d columns. θ A parameter vector of length d, or the parametrization of a model of φ. L( ) The likelihood function. φ The average of φ over a set of topics. σ φ The standard deviation of φ over a set of topics. α, β Type I and Type II error rates, respectively. C; C T ; C j Cost function C and components C T (cost of sampling a query) and C j (cost of making a judgment).

21 Chapter 1 Introduction Research and development in information retrieval progresses by a cycle of system design, implementation, and experimentation. Broadly speaking, experimentation consists of interacting with a system and evaluating its retrieved results. Decisions at the design phase often cannot be made without empirical results from previous implementations; the experimentation phase therefore must be short, yet its conclusions reliable. Complicating matters is the fact that the corpora users want to organize and search, and the types of search activities they want to perform, grow larger, more complicated, and more heterogeneous as time goes on. The first corpus assembled for the express purpose of evaluating automatic retrieval systems was the Cranfield collection of about 1,400 abstracts of work on aerodynamics. It was designed with simple keyword search in mind. Research corpora since then have grown in the number and length of the documents, the heterogeneity of their content, and the types of search tasks they are intended for: from 1,400 aerodynamics abstracts to 3,200 general computer science abstracts, to 348,000 medical research abstracts, to the TIPSTER collections of millions of full-text general news articles, to the web billions of highly diverse documents in semi-structured HTML and other formats. Evaluation of retrieval results is typically approached from one of two directions: user-based or system-based (Voorhees, 2002). In a user-based evaluation, a group of subjects is assembled to interact with the system in a controlled way. The subjects may be measured by quantities such as the time they spend on a given task and their ability to find relevant material using the system. They may also be asked to answer survey questions about their satisfaction with the engine. While these types of studies are invaluable for certain types of questions, they are highly inefficient as part of a rapid development cycle. Not only are they expensive and time-consuming, users reported satisfaction often does not correlate well with their ability to find relevant material; in particular, they often lack a good sense of the engine s recall the proportion of relevant material found. A system-based evaluation also relies on a group of users, but in a way that allows the data they provide to be reused in subsequent experiments. Specifically, a group of assessors develops a set of information needs that are representative of the types of needs of actual users of the system. These information needs are distilled down to a query, which is run against a system or a set of systems. The documents retrieved by these systems are then judged with respect to their relevance to the information need. These judgments can then be used to train and evaluate new systems. The information needs and judgments can be packaged as a test collection

22 2 Chapter 1. Introduction and redistributed to other research groups, amortizing the cost over possibly thousands of experiments. This makes the system-based evaluation a necessity for the development cycle. But as corpora get larger and more heterogeneous, the number of judgments needed to draw strong conclusions about systems grows in tandem. In addition, different search tasks require different judgment criteria; as the types of tasks that users want to perform diversify, the amount of judging effort required to test all of them explodes. The expense of obtaining relevance judgments has a major detrimental effect on information retrieval development: it restricts the types of problems that can be studied and the data they can be studied on. Without either incurring the expense of obtaining a set of relevance judgments or a one-time user study, researchers are largely limited to studying problems that have been studied at TREC (the Text REtrieval Conferences organized by NIST) and related conferences such as NTCIR and CLEF that have a stated goal of assembling test collections with large sets of relevance judgments for distribution to the research community. To a certain extent this is positive. Before a task can be studied at TREC, its proposers must convince the research community that it is interesting and worth the time and expense of constructing a test collection. This has served the field of IR well, preventing it from becoming highly fractured as each research group pursues its own notion of information retrieval influenced but not defined by the community at large. On the other hand, individual groups often do need to collect their own judgments, particularly when developing a system for a new corpus or pursuing a task that extends beyond the traditional boundaries of information retrieval. Furthermore, even for the problems that have been studied at TREC, there is still a relative dearth of data; document relevance has been assessed on a large scale for fewer than 1,000 topics since the first TREC in 1992 (Voorhees and Harman, 2005) As well, the continued growth of corpora has begun to make the process difficult even with the resources of NIST; recent work suggests that traditional methods for acquiring judgments (such as system pooling (Sparck Jones and van Rijsbergen, 1976)) can no longer find enough relevant documents for reliable evaluation (Buckley et al., 2006). The broad goals of this work are to provide methods for focusing judging effort on those documents most informative for an evaluation, and to reliably answer evaluation questions when relevance judgments are missing on a large scale. To that end, we focus on some specific evaluation questions to the exclusion of others, in particular questions of comparative evaluation. 1.1 Low-Cost Comparative Evaluation In a retrieval engine design cycle, the evaluation question is less often how good is this system? than how much better (or worse) is this system than one previously developed?. In particular, in an iterative design cycle, systems are often built incrementally; presumably, developers building on a system from a previous design cycle already know that it meets some baseline standard of performance. The question is whether the changes improve it, and by how much. The comparative evaluation

23 1.2. Robust Evaluation 3 task is focused on identifying these relative performance differences between systems rather than making absolute measurements of performance. By focusing on whether one system is better than another i.e. whether it retrieved more relevant documents at higher ranks certain documents may clearly carry more information about relative performance differences than others. Two systems that retrieved exactly the same documents in exactly the same order, for instance, clearly have identical retrieval performance regardless of the relevance of the documents. If two documents are swapped in the ranking of one system while all others are identical (for example, the top-ranked and 10th-ranked document in one system are ranked 10th and first, respectively, by the other), those two swapped documents are the only ones that need be judged. With two pairs of swapped documents, one pair at ranks 1 and 10 and the other at, say, 50 and 60, all four documents are informative, but those at ranks 1 and 10 are almost certainly more informative than the ones at ranks 50 and 60. However, if the second pair appears at ranks 2 and 100, the relative value is not as clear; it may depend on which evaluation measures are of interest or which are more likely to be relevant. Supposing documents can be ordered by informativeness implies that the marginal value of a judgment decreases as assessors work down the list. This can be leveraged to determine the likelihood of a comparison of two systems going one way or the other: if the remaining unjudged documents are such that nearly all of them would have to be relevant to change the relative ordering of the systems, the confidence in the conclusion should be high even though the existing judgments are not sufficient to prove it. Assigning probabilities to possible assignments of relevance to unjudged documents produces a distribution over relative orderings of systems; in this way the likelihood can be estimated directly. 1.2 Robust Evaluation This idea of the likelihood of the relative ordering of systems naturally suggests an evaluation model in which the judgments themselves are the observations or data on which hypotheses about systems are evaluated. The standard approach in information retrieval is to test hypotheses about evaluation measures given calculations of those measures over a sample of topics, e.g. that the mean difference in some measure over the population of topics is zero. Variance due to missing judgments is not modeled: evaluation measures are typically summed over relevance judgments, with some assumption made about the relevance of unjudged documents. Their variance accumulates in the model s error terms; if judgments are not missing at random, there is no guarantee that the conclusions drawn from the test are reliable. In particular, a malicious adversary with access to systems being evaluated could construct a set of judgments that assessors agree on but that nevertheless results in incorrect yet statistically significant conclusions. By taking the judgments as the observations, missing judgments induce a space of hypotheses about orderings of systems, with the likelihood of a hypothesis conditioned on the judgments themselves. Every judgment an assessor makes reduces the volume of that space and focuses the distribution on a smaller set of possible orderings. In this model, conclusions about relative orderings of systems come with

24 4 Chapter 1. Introduction a confidence based on the distribution of possible assignments of relevance to unjudged documents. If this distribution can be sensibly defined, this model is robust in the sense that the confidence estimates are predictive statements about hypotheses: a hypothesis with 70% confidence has a 70% of being true given a complete set of judgments. A malicious adversary cannot force incorrect conclusions to be drawn, even when the evaluator does not know anything about the distribution of systems that may be evaluated; the worst it can do is force low confidence in the result of the evaluation. Given that researchers, not knowing results of future research, do not know what the distribution of retrieval systems is like, we argue that evaluation results with confidence estimates that model unjudged documents are the best that can be achieved, assuming the confidence estimates are trustworthy. A low-confidence result simply means that more judgments are necessary, and the methods suggested above provide a guide to acquiring them. 1.3 Contributions This work, then, is devoted to low-cost acquisition of relevance judgments by intelligent selection of documents, and robust evaluation through reliable estimates of the likelihood of hypotheses about relative orderings of systems. If the subject of this work is experimental design in information retrieval, its overarching contribution is a theory of experimental design and evaluation based on the judgments themselves rather than measures calculated over judgments. We provide a framework in which the variance in measures due to missing judgments can be effectively understood and controlled Major Contributions The major contributions of this work are the elements of our theory of experimental design. These are: 1. Algorithms for acquiring relevance judgments to rank systems by common evaluation measures such as precision, recall, NDCG, and average precision. 2. An understanding of the space of hypotheses about rankings of systems through the idea of confidence and distributions of measures over the space of possible judgments. 3. Models for estimating the probabilities of relevance of unjudged documents, using available judgments as training data. From these elements we can show through both theoretical and empirical analysis that: 4. The optimal experimental design for information retrieval consists of several hundred queries for which assessors make a few dozen judgments each.

25 1.3. Contributions Minor Contributions While much of the literature on this subject takes a purely empirical approach to verification, we argue that theoretical analysis is required for two reasons: first, the available data reflects an evaluation task ranking a large set of systems from a variety of sites that almost no one outside of NIST would actually perform; second, the retrieval runs that that data comprises is not representative of the types of runs that IR researchers and developers would typically evaluate. If the assumptions of a theoretical analysis are clearly understood, the result s generalizability can be directly evaluated based on those assumptions. Thus each of the elements above is justified through a combination of theoretical and empirical analysis. Specifically: 1. We prove that some of the selection algorithms are optimal in the sense that the number of judgments required to prove that there is a difference between two systems is no greater than that required by any other algorithm with no knowledge of the distribution of relevant documents. 2. We show empirically that these algorithms require up to 30% fewer judgments than a simple incremental variant of the standard pooling method, and up to 50% fewer than an unordered pool, to prove that there is a difference between a pair of systems. 3. We demonstrate that using the probabilistic stopping condition results in high-confidence conclusions about relative differences between systems that are more accurate and require fewer judgments on average than the simple incremental pooling method. 4. We use power analysis to provide a theoretical justification for using a large set of queries with small sets of judgments in an information retrieval experiment. 5. We show that by using all the methods above, we can approximate the results of past TREC evaluations with a high degree of accuracy but as little as 1% of the judging effort undertaken by NIST. Additionally, this work makes some arguments about test collection reusability, and also presents a new way to visualize rankings of systems and the information that a judgment provides about a ranking. In particular: 1. We argue that the reusability of a test collection should be understood in terms of the calibration of estimates of confidence in conclusions about relative differences between systems. 2. We show empirically that very small test collections can be reusable in the above sense when our models of relevance are applied. 3. We present a geometric interpretation of the space of hypotheses about rankings of m systems in which assignments of judgments result in points in m 1-dimensional space, and rankings are represented by sets of points in a hyperplane-bounded region. 4. We show how the effect of a judgment can be understood in this space, and how it leads to a simple way to approximate the information provided by a judgment.

26 6 Chapter 1. Introduction Finally, we describe the TREC 2007 Million Query Track, a recent effort that provides some empirical evidence that evaluating over many queries with a few judgments each provides results that are similar to evaluating over a few queries with many judgments each. 1.4 Organization We will build a theory of evaluation based on the judgments gradually over five chapters. Initially the focus is on comparing two systems over a fixed set of topics (Chapters 3 and 4); these two chapters, perhaps counterintuitively, are the densest. They are followed by generalizations to ranking a set of systems over a fixed set of topics (Chapters 5 and 6), and then making inferences about the population of topics and designing experiments that will allow strong inferences to be made (Chapter 7). Specifically: Chapter 2 Information Retrieval Evaluation is a survey of the history of experimentation and evaluation in the field of information retrieval, with a particular focus on work on strategies for selecting documents to judge and methods for coping with incomplete judgments. Chapter 3 Comparative Evaluation covers algorithmic selection of documents to judge when comparing pairs of systems. Different measures require different strategies; some of the strategies are provably optimal. Chapter 4 Confidence formally defines the confidence in a comparative evaluation of two systems. Confidence is defined by distributions over missing judgments; these are derived for a few of the most common evaluation measures. Confidence is used to define a stopping condition for an algorithmic selection process. Chapter 5 Ranking Retrieval Systems generalizes the previous two chapters to ranking sets of systems rather than comparing two. Exact results are computationally much more difficult; approximations are necessary for efficient algorithms. Chapter 6 Robust Evaluation proposes that robust evaluation requires models of relevance that make good predictions on unjudged documents. This ties into the notion of reusability in that a good model produces good confidence estimates on unseen systems. Models based on document similarity and on the performance of retrieval systems are proposed and evaluated. Chapter 7 Hypothesis Testing and Experimental Design generalizes further to allow testing hypotheses about the population of topics given judgments from a sample. This leads to experimental design based on judgments: determining both the number of topics that must be sampled and the number of documents that must be judged to ensure a test with high power. We then present results and analysis of the TREC 2007 Million Query Track, which provides empirical verification of that design.

27 1.4. Organization 7 Chapter 8 Conclusion concludes the work with a look towards future directions in information retrieval evaluation.

29 Chapter 2 Information Retrieval Evaluation It is generally agreed that the first system-based evaluation was performed by Cleverdon (Cleverdon, 1967; Voorhees, 2002; Voorhees and Harman, 2005). Cleverdon and Mills (1963) collected research papers on aerodynamics. They sent questionnaires to the authors asking what the basic research questions were that inspired the paper and how they would judge each of their cited references relevant to each of those questions. Through this method (and with some additional judging labor) they assembled a test collection of about 1,500 documents, 400 queries, and relevance judgments for every document to every query. The Cranfield methodology of evaluation using such test collections has come to be the standard in information retrieval evaluation. This chapter starts with a discussion of experimentation in the context of the Cranfield methodology. The key ingredients in an information retrieval experiment are presented and possible complications pointed out. The following sections discuss the work that has been done by the information retrieval community to overcome those complications and place this work in the context of this previous work. 2.1 Experimentation in Information Retrieval At the top level of any information retrieval experiment is the retrieval task. The choice of task dictates or influences all other experimental design choices, from the definition of relevance to the measurements of system performance that are interesting to the number and type of queries to test to the size and type of the corpora that are indexed and retrieved from. Let us consider each of these in turn Retrieval Tasks and Relevance People use search engines for a wide variety of purposes. The prototypical task is ad hoc retrieval: a user enters an arbitrary query and the engine returns a ranked list of documents that match the query according to some model. The query is considered a one-time event, with the corpus unchanging from the time the query is entered to the time the results are returned. This contrasts with tasks such as routing or filtering, in which a the corpus changes over time and new documents are constantly checked for relevance against standing queries, or question answering, in which the query is a natural-language question and the ranked list

30 10 Chapter 2. Information Retrieval Evaluation consists of natural-language answers that are not necessarily literally attested in any document in the corpus. Known-item retrieval (Beitzel et al., 2003b), home page finding (Hawking and Craswell, 2001), and topic distillation (Craswell and Hawking, 2002) are similar to ad hoc retrieval in the basic framework, but differ in that users performing one of these tasks is looking for something more specific than information about a topic; they are looking for (respectively) a particular document, a home page, or pages that are entries into hyperlinked sites of dispersed but relevant material. Novel-item retrieval has the goal of retrieving things often passages rather than documents that have not been seen before (Harman, 2002) (where before is generally taken to mean higher in a ranked list ). More than any other factor, the retrieval task dictates the definition of relevance. In tasks like ad hoc and routing, a document is typically considered relevant if it contains any information about the topic. Known-item retrieval and home page finding typically have one relevant document (or a very small set). For the topic distillation task, relevant pages may contain no information about the topic at all; it is sufficient for them to be a gateway to a site about the topic. Determining whether pages are novel depends on knowing what has been ranked above them (Carbonell and Goldstein, 1998), or possibly on what a user has seen in the past. Even within the domain of a particular task, relevance can be difficult to define precisely (Mizzaro, 1997; Belkin, 1981; Saracevic et al., 1988). Perhaps the most widely-used notion of relevance is system-oriented, in which the relevance of a document to a query is dependent only on the representations of the document and query, and not on any external features of the user that submitted the query, the state of the corpus, other documents in a ranking, and so on (Lavrenko, 2004). This sort of definition clearly facilitates repeatable system-based evaluations; though other definitions that make relevance ranking-dependent or corpus-dependent certainly do not make such evaluations impossible. Definitions of relevance that are conditioned to a greater degree on the user result in experiments that are not as easily repeatable, as a user s needs can evolve in time even as the corpus is considered static. Then there is the question of whether relevance is binary, graded, continuous, or preferential. Topical relevance is typically defined on a binary scale: a document is either relevant (on-topic) or it is not. On the web, where queries are very short, often ambiguous, and users tend not to look past the top 10 documents retrieved, relevance on a 3- or 5-point graded scale has become standard (cf. Burges et al. (2005); Carterette and Jones (2007) for web-inspired work on ranking and evaluating using graded judgments). Rorvig (1990) proposed that relevance be assessed as a relative property between two documents, which could translate into a continuous measure of utility. This work is largely concerned with the ad hoc retrieval task and a binary, system-oriented notion of relevance. However, the methods presented can in principle be applied to other tasks, scales of relevance, and definitions of relevance that are pair-based or set-based.

31 2.1. Experimentation in Information Retrieval Test Collections A test collection comprises a corpus of documents to be searched, a set of information needs (usually in the form of topics or queries) for which there are relevant documents in the corpus, and relevance judgments indicating whether and to what degree each document is relevant to each information need. Some of the most frequently-used test collections in the literature are the Cranfield collection of aerodynamics technical papers, the OHSUMED corpus of medical abstracts, the CACM corpus of abstracts from Communications of the ACM, the INSPEC corpus, the TREC news corpora, consisting of articles from the New York Times, Wall Street Journal, Associated Press, and more, and the TREC GOV corpora, consisting of millions of web pages in the.gov domain. The first four of these are all under 30,000 documents; it is only in the last 20 years that large test collections have become widely available for IR research (Voorhees and Harman, 2005). Each of these has an associated set of information needs either taken from actual needs of users of the corpus or developed for the express purpose of retrieval research. In 1980, a corpus of 11,000 documents was considered unusually large (q.v. Robertson et al., 1980). Part of the reason is the computational power available at the time, of course, but by 1990 corpus sizes had not grown significantly larger (Voorhees and Harman, 2005). I have not found any paper before 1992 using a collection of more than 30,000 documents. The limited size negatively affected the accuracy of evaluation measures (Tague, 1981) as well as the measured quality of the systems themselves (Hawking and Robertson, 2003). The corpora that were available came to be used for experiments that were far removed from what they had originally been constructed for (Robertson, 1981). There was a need for general-purpose test collections that could be shared across the community. Putting together a test collection that is interesting for IR research is a difficult task (Tague, 1981). The documents should be heterogeneous enough that one can expect performance to generalize to other collections, but homogeneous enough that query samples will cover a fair amount of the possible space of information needs. The information needs should be representative of the space of needs users of the corpus will have, and large enough that statistically significant conclusions about systems can be reached. The first large test collections for general research purposes were put together by NIST for the first TREC (Text REtrieval Conference) in 1992 (Harman, 1992; Voorhees and Harman, 2005). Where previous collections had been no more than several megabytes, the new TIPSTER collection was about 3 gigabytes and consisted of over a million documents (Harman, 1992). The sets of information needs were smaller (50 topics rather than the several hundred queries in previous collections), but more tightly focused, well-defined, and assembled with greater quality control (Voorhees and Harman, 2005). With larger test collections, the problem of acquiring relevance judgments (already well-known in 1976 (Sparck Jones and van Rijsbergen)) became much harder. In the Cranfield tests, Cleverdon could afford to obtain relevance judgments on nearly all of the documents to all of the queries. With a million documents, a compromise had to be made. Random sampling was considered but discarded because the sample would have to be a significant fraction of the collection in order to find enough relevant documents (Harman, 1992). Instead, a biased sampling

32 12 Chapter 2. Information Retrieval Evaluation method called pooling (Harman, 1992) was adopted (pooling was first proposed by Sparck Jones and van Rijsbergen (1976), and had been used with smaller collections before TREC (e.g. Salton et al., 1983)). Using the pooling method, only documents that were retrieved by some system would be judged, ensuring that judging was limited to those documents least likely to be nonrelevant. In practice, an IR developer may have a corpus ready to be searched, for example an intranet, a large database of short texts, or a collection of books, and a set of topics that can be sampled. The problem is not in finding a corpus or topics, but in acquiring enough relevance judgments that will allow them to have confidence in the evaluation. Repeatability and Reusability In an ideal setting, system-based evaluations over test collections are repeatable: the same experimental environment produces the same results every time. This means that published results can be verified by other research groups, and results can be directly compared across research groups and over time. Additionally, an ideal test collection is reusable: though its relevance judgments may have been collected from a particular set of systems, they should be complete enough that they can reliably evaluate new systems with properties that are as yet unknown. This way, new systems can be reliably compared to old; as well, the high cost of collecting judgments can be amortized over many experiments. Needless to say, a complete set of judgments is always reusable. Whether the pooling method creates reusable test collections was answered in the positive by Zobel (1998). Interestingly, while pooling failed to find up to 50% of the relevant documents for some topics, these missing judgments did not seriously affect comparisons between systems. However, some recent work suggests that the pooling method is not sufficient for the larger GOV corpora (Buckley et al., 2006). Assessor Agreement The system-oriented notion of relevance treats it as a semantic property, concerned with the meaning of the query and the document. A more user-oriented approach is more pragmatic, concerned more with what is needed by the user at the time the query was entered. In either case, relevance judgments are subject to disagreement among assessors. Cormack et al. (1998) proposed a method for acquiring relevance judgments called Interactive Searching and Judging (ISJ), by which assessors submit a query to a retrieval engine, judge retrieved documents, and then reformulate the query based on what they have learned in making those judgments. Using this method to acquire judgments for a subset of the TREC topics produced a set of judgments that overlapped only 33% with the NIST judgments. Despite this, the evaluation of systems over those topics did not substantially change; other studies have independently reached the same conclusion (Voorhees, 1998; Harter, 1996). This suggests that evaluations are rather robust to disagreement between assessors. Presumably there is much greater agreement on some documents than others, and those documents that assessors agree highly on tend to be the ones that are most useful for differentiating between systems.

33 2.1. Experimentation in Information Retrieval 13 Soboroff et al. (2001) investigated an extreme form of disagreement in which there is some agreement about the number of relevant documents for a query, but almost no agreement whatsoever about which documents are relevant: they formed a pool of the top N retrieved documents, as is normally done, but rather than assess any of the documents in the pool, they simply selected a random subset (with size drawn from a normal distribution) to be relevant. The result was that performance evaluated using these pseudo-rels correlated fairly highly with performance over NIST s relevance judgments. Subsequent investigation has shown that this is likely because retrieval systems tend to retrieve many relevant documents in common, but tend not to retrieve as many nonrelevant documents in common (Carterette and Allan, 2007a; Aslam and Savell, 2003). At a higher level it seems to suggest that assessor disagreement is not a great concern for experimentation Evaluation Measures The choice of evaluation measure is influenced by both the definition of relevance used by the task and the attributes of system performance we want to measure (Tague- Sutcliffe, 1991; Tague, 1981). As described above, much of the work in information retrieval generally and evaluation specifically assumes that relevance is systemoriented, i.e. independent of the user. Furthermore, it is often convenient to assume relevance is binary: a document is either relevant or not relevant. Given those two assumptions, the two basic questions are is the system good at retrieving relevant documents? and does the system do a good job of ranking the documents it retrieves? These are answered by recall and precision. We may formally define them in the language of contingency tables: corpus relevant not relevant retrieved TP FP not retrieved FN TN A relevant document that was retrieved is a true positive (TP). A nonrelevant document that was retrieved is a false positive (FP). Unretrieved relevant documents are false negatives (FN), and unretrieved nonrelevant documents are true negatives (TN). Precision is then defined as and recall is precision = recall = T P T P + F P T P T P + F N i.e. the proportion of retrieved documents that are relevant and the proportion of relevant documents that were retrieved respectively. Many of the standard statistics based on contingency tables (χ 2 ; ROC curves; T P +T N T N T N+F P accuracy = T P +F P +F N+T N ; specificity = ) are inappropriate for IR research because they are partially based on counts of true negatives. True negatives dominate the highly heterogeneous corpora found in research and on the web: to a

34 14 Chapter 2. Information Retrieval Evaluation good approximation, every document is nonrelevant to any given query, and therefore getting good performance by these statistics is as simple as not retrieving any documents! This was not always the case; some older corpora were homogeneous enough that relevant documents outnumbered nonrelevant documents for some queries. Some work therefore criticized the reliance on precision and recall on measurement- or information-theoretic grounds and suggested alternatives that counted true negatives (Guazzo, 1977; Raghavan et al., 1989; Bollmann, 1985). Precision and recall have not yet been replaced in mainstream evaluations, though in web evaluations, precision-like measures DCG and NDCG (Jarvelin and Kekalainen, 2002), described in more detail below, have become common. Retrieval systems typically return a ranking of documents rather than a classification of relevant or not relevant. To evaluate precision and recall, we put a threshold at some rank, for example 10, and for the purposes of the contingency table say that all documents above that rank are retrieved and all below it are not retrieved. As we increase the rank threshold, TPs and FPs increase while FNs and TNs decrease. Precision fluctuates as the rank threshold increases, tending to decrease, while recall never decreases. The precision-recall curve shows how precision changes as recall increases; it provides an intuition for the costs and benefits of increasing the rank threshold. A non-interpolated precision-recall curve can appear jagged, with precision rising each time a new relevant document is found. Interpolation smooths the curve by finding the maximum value of precision at any rank such that recall is at least one of a set of k increasing values. An 11-point curve, interpolating precision at recall levels 0, 0.1, 0.2,..., 1.0, is typical (van Rijsbergen, 1979). Precision and recall at multiple points give a good deal of information about the performance of a system. Nonetheless, it is sometimes desirable to combine them into a single measure. A pointwise precision+recall measure is F, defined as the weighted harmonic mean of precision and recall: F β = (β + 1)P R βp + R where P is precision at a given rank, R is recall at the same rank, and β is the amount to weight precision relative to recall (van Rijsbergen, 1979). If β = 1, F is proportional to the ratio of the area to the perimeter of the rectangle formed using the (recall, precision) point to as the upper right corner (Figure 2.1 shows a rectangle that may be formed at rank 10). Another rank-based summary statistic is R-precision, the precision at the rank equal to the total number of relevant documents. R-precision is also the point at which precision, recall, and F 1 are equal. Aslam et al. (2005) showed that R- precision is very informative in that it can be used to infer the entire precision-recall curve. However, its informativeness is reduced if relevant documents are scarce. A convenient and informative summary statistic can be defined in terms of the precision-recall curve: average precision is the area under that curve, though it is more often defined as the mean of the precisions at each point of recall. 1 Average 1 The use of the phrase average precision seems to be an example of semantic shift in action. Early references define it as an average of precisions at some cutoff over several topics. By 1977, average precision was a set of averages of 11-point interpolated precision over all topics (e.g. Yu and Salton, 1977). Due to the desire to report a single number over all topics, average precision

35 2.2. Relevance Judgments 15 precision measures both the ability of the system to rank relevant documents highly and its ability to find relevant documents, so it takes into account both precision and recall and measures them over the entire ranking, not just individual points. As a result, it is highly informative, but it loses some of the information provided by the curve or a sequence of precision-recall points. Average precision can also be seen as a weighted pairwise preference metric: it is effectively reduced every time a nonrelevant document is ranked above a relevant document. Bpref is also based on pairwise preferences, but defined more intuitively so that there is a gain when a relevant document is ranked above a nonrelevant document (Buckley and Voorhees, 2004). Unlike the other measures described here, bpref is calculated over only the documents that have been judged. Conventionally, unjudged documents are assumed to be nonrelevant for evaluation purposes; bpref does not rely on this assumption. Discounted cumulative gain (DCG) does not require binary judgments; it is designed to handle judgments of arbitrary grades. To calculate DCG, relevance grades are mapped to numeric gains, which are discounted by a function of the rank the documents were retrieved at (generally a logarithmic function (Jarvelin and Kekalainen, 2002; Burges et al., 2005), but sometimes a linear function (Huffman and Hochster, 2007)). These discounted gains are summed to a particular rank cutoff, often rank 10 to reflect the fact that users tend not to look past the top 10 documents retrieved. Dividing DCG by the maximum possible value of DCG (i.e. the DCG of a perfect ranking of documents) produces normalized discounted cumulative gain (NDCG). Other measures that have appeared in the literature include reciprocal rank (the reciprocal of the rank of the first relevant document retrieved); expected search length; fallout; normalized precision and recall; interpolated average precision; and microaveraged precision and recall. The measures presented above are the most common for evaluating ad hoc retrieval. This work is focused on ranking retrieval systems by precision, recall, DCG, NDCG, and average precision, with occasional nods to other measures. Figure 2.1 shows an example precision-recall curve. It shows the relationships between precision, recall, reciprocal rank, R-precision, average precision, F 1, and interpolated precision. Each measure provides information about the others. 2.2 Relevance Judgments With the exception of bpref, the evaluation measures described above require a certain level of completeness in the relevance judgments. To calculate precision at 5, soon came to mean the average of those 11 averages. In 1988, that double average was described in two different ways depending on which average was applied first: the aforementioned average of 11 averages, and the average of the 11-point interpolated precisions for each topic then averaged over all topics (e.g. Furnas et al., 1988). Thinking of it as averaging over interpolated points seems to have allowed the conceptual jump to averaging over non-interpolated points, and by the second TREC in 1993 average precision was being used as the average of precisions at points of recall for a single topic (Harman, 1993) (though the other definitions were still in use). This then required the introduction of the inelegant mean average precision (first seen in the literature in 1996 (Voorhees and Harman, 1996)) to describe the average of non-interpolated average precisions over a set of topics.

16 Chapter 2. Information Retrieval Evaluation Figure 2.1. Illustration of the precision-recall curve and the measures that can be derived directly from it.

36 16 Chapter 2. Information Retrieval Evaluation Figure 2.1. Illustration of the precision-recall curve and the measures that can be derived directly from it. The corpus in this example is only 100 documents, 33 of which are relevant. we need to know whether each of the top five documents is relevant or nonrelevant. To calculate recall, we need to know the number of relevant documents in the corpus and since documents can be relevant without containing any query term, this could require judging every document in the corpus. Since a large collection will necessarily have many unjudged documents, unjudged documents are assumed to be nonrelevant by convention, but any measure that takes recall into consideration (e.g. average precision, F, R-precision) must be considered at best an estimate: Zobel (1998) showed that it is rare to find all of the relevant documents for a query in a large corpus. As discussed above, the high cost of acquiring relevance judgments made experiments on large corpora difficult until the first TREC conference in By providing a large corpus of documents and topics and soliciting retrieval results

37 2.2. Relevance Judgments 17 from research groups using diverse algorithms (both manual and automatic), it became possible to get a sense of which documents were likely to be retrieved by any arbitrary system. A large set of relevance judgments could then be acquired simply by judging the documents that were retrieved by the submitted systems. Specifically, TREC used the pooling method (Sparck Jones and van Rijsbergen, 1976): the top N documents retrieved by each system are pooled, then every document in the pool is judged for relevance. The pool depth has generally been taken to be N = 100, though in some cases it is reduced to N = 50 or even N = 10. In 1997, retrieval results over 50 topics resulted in 72,270 relevance judgments. That is only 0.26 percent of the 27.8 million judgments that could be made for the 50 topics against the entire 556,000-document corpus. Despite being such a small fraction, it would take an assessor that can make three judgments per minute 17 days of around-the-clock work to produce that many judgments! This is quite infeasible for research groups without the resources of NIST. As a result, larger test collections have sparked an investigation into reduced effort in relevance judgments. We distinguish between three types of reduced-effort studies: those that are aimed at collecting relevance judgments with less effort than TREC, those that attempt to automate evaluation to some degree (with or without relevance judgments), and those that attempt to estimate evaluation measures given limited sets of judgments. The work proposed herein intersects all three of these areas Low-Cost Methods Although the pooling method results in a small subset of the total number of possible judgments, it seems to be more than sufficient for research purposes (Zobel, 1998; Voorhees, 1998). This suggests that it might not be necessary to pool 100 documents. Pools of depth 20, 10, or even 5 result in good approximations to evaluations with a pool of depth 50 or 100 (Carterette and Allan, 2005), though whether these smaller collections would be able to be reused to evaluate unseen systems is an open question. Another option is to construct topics such that only a subset of the collection could be relevant. An example is restricting topics to events that happened between certain dates, as some queries in the GALE evaluation do (Kumaran and Allan, 2007). Another example is known item retrieval, in which topics are defined to have only one relevant document (Beitzel et al., 2003b). These sorts of topics do not provide enough variance to allow us to make general statements about differences between systems (Sparck Jones and van Rijsbergen, 1976). Sanderson and Joho (2004) judged documents retrieved by a single retrieval system rather than a set of systems. As long as the system performed relatively well, this proved to provide sufficient judgments to evaluate other systems from which judgments were not acquired. Along with ISJ, Cormack et al. (1998) proposed Move-to-Front pooling (MTF). MTF imposes a priority ordering on the pool based on the assumptions that higherranked documents are more likely to be relevant, and that documents from systems that have recently discovered a relevant document are more likely to be relevant. MTF discovers relevant documents faster than traditional pooling, and achieves very good approximate evaluations with much less work. Zobel (1998) proposed a

38 18 Chapter 2. Information Retrieval Evaluation similar method that first judges a shallow pool, then, based on the results of the evaluation, extrapolates which systems and topics are likely to provide more relevant documents, and extends the pool using more documents from those. Zobel also discovers relevant documents faster than traditional pooling. These methods rely on an experimenter making the assumption that unjudged documents are nonrelevant, in which case judgments of nonrelevance are noninformative. Some recent work in this domain has treated document selection as a sampling problem, with the goal of ensuring that any errors in an evaluation due to missing judgments are distributed in such a way that the resulting measurements have no bias and minimum variance among all unbiased estimators (Aslam et al., 2006; Aslam and Pavlu, 2008). In this way even a very small set of judgments can produce reliable (if high-variance) estimates of performance. Carterette et al. (2006) proposed an algorithmic method for selecting documents to quickly and reliably determine whether there is a relative difference between two systems with a minimum number of judgments. The method does not rely on assumptions about the distributions of relevant documents in ranked lists or that unjudged documents are nonrelevant; instead, it uses the form of a particular evaluation measure to determine which documents are most informative. These methods will be developed in detail in this work Automatic and Semi-automatic Methods Semi-automatic methods for building test collections generally involve using a small set of human judgments or other input to infer the relevance of unjudged documents. The classic example (from outside the field of IR) is the BLEU measure for machine translation evaluation Papineni et al. (2002). BLEU takes a small set of reference translations provided by human translators and compares a candidate translation against the references using a geometric mean of n-gram matches. Carterette et al. (2003) used this to evaluate retrieval systems by taking a few reference judgments and comparing unjudged documents against those. The results were not substantially different from using the reference judgments alone, however. Automatic and semi-automatic methods in IR have been more often applied in web environments in which the collection sizes are much larger and there is some external information about relevance in the form of user behavior or manual classifications. Joachims (2002, 2003); Radlinski and Joachims (2006) have considered using clicks by users to evaluate (or train) retrieval systems with no explicit relevance judgments. One method is based on merging ranked lists from two competing algorithms into a single ranked list, then counting clicks on results from one versus clicks on results from the other. Another method randomly permutes a list before presenting it to the user; this removes the presentation bias present in clicks. Carterette and Jones (2007) considered the use of clicks to evaluate sponsored search results by predicting the relevance of advertisements given clicks over the full ranking. Though the BLEU method was unsuccessful, text-based features have been used to predict relevance in other situations. Beitzel et al. (2003a) have described the use of editor-drive taxonomies such as the Open Directory Project to automate evaluation. They found that matching web document titles to categories can provide

39 2.2. Relevance Judgments 19 a reliable evaluation. Jensen et al. (2007) have followed up on this, attempting to measure exactly how much reliability is lost by transitioning from manual to automatic judgments. Buettcher et al. (2007) used document similarities as features in an SVM classifier to predict the relevance of unjudged documents, and Carterette and Allan (2007b) used similarities as features in a logistic regression classifier. Another approach is to take advantage of the similarity between systems that allows consistent evaluation results even with high assessor disagreement Soboroff et al. (2001). Aslam and Yilmaz (2006, 2007) estimate the relevance of unjudged documents by minimizing an objective function based on mean squared error. A substantially more complicated model is presented by Carterette (2007), in which the ranking performance of systems is used to generate measure-independent predictions of relevance Estimating Performance The classic approach to estimating performance in the presence of incomplete judgments is, as stated above, to simply assume that all unjudged documents are nonrelevant. If the judgments are relatively complete to begin with, this assumption is entirely reasonable. When judgments are missing in large numbers other estimation methods are required. The bpref (for binary preference) measure discussed in Section is one of the first attempts to explicitly acknowledge unjudged documents (Buckley and Voorhees, 2004). The core idea is very similar to Kendall s τ rank correlation (Kendall, 1970): count the number of document pairs in a ranked list that have been swapped from their true ordering by relevance and normalize to a range of [0, 1]. To fit the retrieval evaluation with binary judgments requires a few modifications: first, pairs of documents with the same judgment are considered to have no preference ordering and are excluded from the counts of swapped pairs. Second, only the first R judged nonrelevant documents in a ranking count (where R is the number of judged relevant documents). This prevents bpref from always being near 1 when a system ranks a few relevant documents above thousands of judged nonrelevant documents. Kendall s τ is a lower bound on average precision (Joachims, 2002), so it is not surprising that bpref correlates well with average precision. The correlation tends to break down with very small sets of judgments, however (Sekai, 2006). Yilmaz and Aslam (2006) proposed a modification to average precision called inferred average precision (infap) in which the precision at the rank of an unjudged document is taken to be a Laplacian-smoothed ratio of relevant documents to judged documents that are ranked above it. This measure is very robust to missing judgments, though it requires a smoothing parameter ɛ. A simpler approach is to simply exclude unjudged documents from the ranking entirely, shifting the ranks of subsequent documents down to simulate a ranking in which only judged documents were retrieved (Sakai, 2007). This approach seems to be superior for measures based on graded judgments in particular. The sampling method referred to above induces a minimum-variance unbiased estimator of an evaluation measure in addition to being a low-cost sampling method (Aslam et al., 2006). This method also allows the placement of confidence intervals

40 20 Chapter 2. Information Retrieval Evaluation over a measure, and can be adapted to use existing judgments (i.e. out-of-sample judgments) (Aslam and Pavlu, 2008). Carterette et al. (2006) introduced the idea of estimating an evaluation measure as an expectation over possible assignments of relevance to unjudged documents. In that work, the relevance of a document was assumed to be uniformly distributed; later work has focused on obtaining better predictions of relevance (Carterette, 2007; Carterette and Allan, 2007b). This idea is developed in more detail in this work. 2.3 Testing Hypotheses The need for significance tests was identified early in the history of retrieval evaluation (Sparck Jones and van Rijsbergen, 1976). Robertson identified the need for controlled research collections that restricted the amount of variance that would come from anything but the algorithms being tested (Robertson, 1981). Early work is very concerned with whether retrieval evaluations meet the assumptions made by parametric tests like the t-test, and as a result recommend nonparametric tests such as the sign test or the Wilcoxon signed rank test (van Rijsbergen, 1979). Subsequent work by Savoy (1997) showed that it is unlikely that retrieval measures are normally distributed over topics. Zobel (1998) and Sanderson and Zobel (2005) used subsamples of topics to argue that the Wilcoxon test is the most appropriate for information retrieval. However, as Smucker et al. (2007) point out, the Wilcoxon test is a test for the difference in median, while means are usually reported. Thus the use of the Wilcoxon test can be misleading. In fact the t-test is quite robust to violations of its assumptions (Lehmann, 1997), and Smucker et al. show that the performance of the t-test is nearly identical to the performance of the non-parametric Fisher exact distribution test. The question of topic sample size is strongly related to questions of hypothesis testing (Bodoff and Li, 2007; Carterette and Smucker, 2007). The sample size must be large enough that the variance in mean performance does not outweigh variance in system performance (Bodoff and Li, 2007), but not so large that very small differences between systems are identified as significant (Carterette and Smucker, 2007). Early work suggested that topic sample sizes of 75 were mandatory (Sparck Jones and van Rijsbergen, 1976), though subsequent work has shown that a sample size of 25 is often enough (Zobel, 1998; Buckley and Voorhees, 2000). More recent work has suggested that large samples might in fact be more desirable, since they require fewer judgments per topic (Sanderson and Zobel, 2005; Carterette and Smucker, 2007; Carterette et al., 2008b). These ideas are developed more in this work. Although the idea goes back at least as far as 1981 (q.v. Robertson, 1981), Cormack and Lyman (2006) recently described a test for significance over varying document collections. Rather than test over different topics, they test a single topic over multiple corpora. Jensen et al. (2007) considered the question of reproducibility: if a difference is significant on one set of queries, will it be significant on another set? They investigated how reproducibility is hurt by using automatic relevance judgments.

41 2.4. Summary and Directions 21 This is another work that suggested that large topic sizes with a few judgments each may be superior to the alternative. 2.4 Summary and Directions Questions of information retrieval experimental design have been of interest to the research community since very early in the history of the field of automatic IR. Until relatively recently, most work on this subject has assumed that relevance judgments need be complete enough to contain a large fraction of the relevant documents in order to accurately estimate measures like average precision with a strong recall component. Recent work has suggested that judgments of nonrelevance can be useful for both estimating evaluation measures and increasing the confidence in a comparison of systems, and that unjudged documents can sometimes simply be ignored without serious adverse effect. As corpora continue to grow, greater understanding of the effect of missing judgments will be required for unbiased evaluation or comparison of retrieval engines. The goal of this work is to take steps towards providing that understanding, backing them up with both theory and experiment. At the highest level this is a question of experimental design: how many topics are needed, and which documents need to be judged for those topics, to reach reliable conclusions about differences between retrieval engines executing a particular task on a particular corpus? To answer this, instead of testing hypotheses about evaluation measures, we will test hypotheses about the judgments themselves. The first step is acquiring judgments that will allow high-confidence conclusions to be drawn from those tests; this is the focus of the next two chapters.

43 Chapter 3 Comparative Evaluation Suppose a researcher or developer has a choice of two retrieval systems to deploy for some task. Given that she prefers the one that performs better, she must first evaluate the two systems. If there are no choices beyond those two, the degree of difference is unlikely to be a concern. The approach of least effort, then, is to simply determine which of the two is better, ignoring the magnitude and significance of the difference. Scenarios like that play out frequently in retrieval system design. Parameter tuning, for instance, results in a large set of retrieval outputs from different parameter settings; there is no reason not to take the system with the best relative performance, regardless of its absolute performance, if the chosen system must be from that set. Incremental changes to a system may determine future decisions, but whether they do or not depends on whether they had any effect in the first place. And even when the degree of difference does matter, the first question about the two systems is whether they are different at all. Questions about the degree of difference, if necessary, come next. If such changes do not affect the final ranking of documents, then there is no need to evaluate any systems: they retrieved the same documents at every rank and thus have identical performance. If the ranking is minimally affected, for example if the documents at ranks 1 and 10 are swapped but all others remain in place, then only those documents need be judged to understand the relative difference in performance. We can take advantage of the similarities between the systems to determine whether they are different with minimal judging effort. As the rankings diverge, there are more possible judgments that could be made. If, for example, the documents at ranks 1 and 10 swap, and the documents at ranks 50 and 60 swap, then all four of those documents are interesting. The question is how interesting each document is, and whether each is still interesting after having judged some others. In this case it may be trivial, but as the rankings diverge more and more, it becomes less obvious which documents should be judged and in what order. In this chapter we develop algorithmic methods for determining the documents to judge and the order to judge them in to determine whether two systems are different. The algorithms select the documents that will prove two systems are different, so that the sign of the difference can be known no matter what the relevance of the unjudged documents is. Some of these algorithms can be shown to be optimal in the sense that no other method (with the same knowledge about the distributions of systems or relevance) can be expected to require fewer judgments.

44 24 Chapter 3. Comparative Evaluation 3.1 Algorithmic Document Selection As suggested above, the goals of an algorithm procedure are to: 1. determine how interesting a document is for comparing two systems, and 2. determine whether existing judgments are sufficient to prove the sign of the difference. These questions are predicated on the researcher having selected an evaluation measure. In the following sections, we will show how to algorithmically select documents for comparing two systems by different retrieval measures. The algorithms have a pattern independent of measure: 1. Compute a weight for each document reflecting its effect on our knowledge of the difference. These weights are based on the positions of documents within both rankings, and in some cases on existing judgments. 2. Send the highest-weight document to an assessor for judging. The judgment gives some evidence that the difference is either positive or negative, depending on assumptions about the relevance of unjudged documents. 3. If necessary, update the weights to reflect the judgment. 4. Determine whether there is any assignment of relevance that could result in the inferior system catching up, changing the sign from what our current evidence suggests. From this section on we need to be able to unambiguously refer to a document that may have been retrieved by more than one system. In the literature, evaluation measures have nearly always been defined with documents numbered by the rank they were retrieved at. This does not work when considering two different systems: if they retrieved two different documents at rank 1, for instance, we cannot unambiguously use d 1, since it could refer to either document. The evaluation measures below are therefore defined in terms of a completely arbitrary numbering of documents: d 1 may have been retrieved at rank 1 by some system, but it is just as likely to have been retrieved at rank 1,000 or rank 1,000,000. Since evaluation measures are typically defined in terms of ranks in some way, we will need to be able to recover the rank at which a document was retrieved by a particular system. Define a ranked list (or retrieval system), denoted A, as a set of integers 1..n, n the size of the corpus, such that element i in A corresponds to the rank of document d i in that system s output. Then A i is shorthand for the rank of document d i by the system denoted A. Likewise, x i represents the relevance of document d i. Since this work is primarily concerned with binary relevance, x i {0, 1}. When d i has not been judged, x i could be either 0 or 1; we may assume either depending on what we want to know. Retrieval results usually comprise ranked lists over a set of topics or queries. If topics are independent (as they are in the ad hoc task), the judgment on document 1 for topic 1 has no bearing on the judgment on document 1 for topic 2. The corpus can effectively be treated as having size nt, where T is the number of topics; document 1 in topic 1 might be identified as d 1, while in topic 2 as d n+1. For the sake of notational clarity, Topics will not generally be specifically identified; all should be clear from context.

45 3.2. Precision Measures Precision Measures Precision measures focus on the amount of relevant material at the top of the ranked list, without regard for the amount of relevant material in the corpus. Since precision measures are among the easiest to understand, We start by detailing algorithmic selection procedures for them Precision As discussed in Section 2.1.3, precision at rank k is simply the ratio of relevant documents retrieved by rank k to the number of documents retrieved. Precision is perhaps the most intuitive of measures, the easiest to implement, and the one that users seem to respond to the most. This section, which is straightforward and simple, should serve as an introduction to the methods used for more difficult measures. Precision can be expressed in terms of relevance indicators x i as prec@k = 1 n x i I(A i k) (3.1) k where I is the indicator function I( ) = i=1 { 1 if is true 0 otherwise. For comparative evaluation, we are interested in whether there is a difference in precision between systems A and B, i.e. whether prec@k = prec prec is positive, negative, or zero. We may express prec@k as prec@k = 1 n x i (I(A i k) I(B i k)). (3.2) k i=1 It is easy to see the implication of this for selecting documents to judge: if the rank of a document is less than k in both A and B, then it has no effect on prec@k. Likewise, if the rank is greater than k in both A and B, it has no effect. The only documents that tell us anything about prec@k are those that were retrieved by A but not by B or vice versa. Let w i = I(A i k) I(B i k) be the weight of document d i ; only documents with weight w i = 1 or w i = 1 need be judged. Suppose we judge a few documents according to w i and find that system A has retrieved more of the relevant ones than B. At this point we have reason to believe that A might be better than B, e.g. if all unjudged documents turn out to be nonrelevant, then prec@k > 0. The question is whether there is any possible assignment of relevance to the unjudged documents that could result in B catching up to A, i.e. any assignment of relevance that would result in prec@k 0. We can answer this by calculating a lower bound on prec@k (denoted prec@k ): prec@k = 1 min x n X n k n x i (I(A i k) I(B i k)) i=1 subject to the constraint that if x i has been judged, its value remain fixed. Here X n is the set of all 2 n possible assignments of relevance to the n documents in the

46 26 Chapter 3. Comparative Evaluation corpus, and x n is a particular assignment of relevance to all n documents. If the lower bound is greater than 0, then the judgments made are sufficient to prove that A is better than B; no more judging need be done. Calculating the bound is a 0 1 integer programming problem, so naïvely it is NP-hard. But we can make use of the weights w i again to calculate it efficiently. The best possible outcome for B is that every unjudged document that favors B (i.e. that has w i = 1) turns out to be relevant, and at the same time every unjudged document that favors A (i.e. that has w i = 1) turns out to be nonrelevant. If prec prec > 0 even in this case, there is no possible way for the sign of prec@k to change. This, then, is the lower bound, and it can be expressed in terms of x i and known judgments J as: TP@k = i J w i x i + i / J w i I(w i < 0); prec@k = TP@k. (3.3) k where TP@k is the lower bound on the difference in the number of true positives in the top k (relevant & retrieved). It is easy to prove that this is a lower bound: if any document included in the second sum were omitted, prec@k would be larger. If any document not included in the second sum were to be included, prec@k would be no less (because that document would have w i 0). The first sum cannot be changed, since it is entirely a function of the judged documents. An upper bound can be calculated similarly: TP@k = i J w i x i + i / J w i I(w i > 0); prec@k = TP@k. (3.4) k Given equations 3.3 and 3.4, the effect on the bounds of judging any given document d i can be calculated: If w i > 0, judging d i nonrelevant decreases the upper bound by 1 k. If w i > 0, judging d i relevant increases the lower bound by 1 k. If w i < 0, judging d i nonrelevant increases the lower bound by 1 k. If w i < 0, judging d i relevant decreases the upper bound by 1 k. If w i = 0, judging d i has no effect. As long as we judge only documents with w i 0, every judgment is guaranteed to reduce the distance between bounds. It is thus fairly easy to see that judging c documents with w 0 results in a minimum prec@k prec@k compared to any other set of c documents, i.e. such an algorithm reduces the distance between bounds at least as much as any other algorithm. This result sets the stage for the results in the next few sections. Minimal Test Collection for Dierence in Precision We would actually like a slightly better result than the one above: we would like an algorithm that determines whether prec@k < 0 or prec@k > 0 with minimal judgments. We argue that Algorithm 3.1, which alternates between selecting

47 3.2. Precision Measures 27 documents with w i = 1 and documents with w i = 1, is the best that can be done without any knowledge of the distribution of relevance. Algorithm 3.1. MTC-Precision. Minimal test collection for a difference in precision. 1: J 2: for each document i do 3: w i I(A i k) I(B i k) 4: while prec@k < 0 < prec@k do 5: i arg max i w i (over unjudged documents) 6: j arg min i w i (over unjudged documents) 7: x i, x j judgment on documents i and j. 8: J J {i, j } Let k c be the number of documents that A and B retrieved in common in the top k. Let R A be the number of relevant documents retrieved by A in the remaining k A = k k c, and R B be the same for B. Then the probability that a document selected uniformly at random from those with w i = 1 is relevant is p A = R A k k c, and the probability that a document selected uniformly at random from those with with w i = 1 is relevant is p B = R B k k c. The expected value of the lower bound after c judgments to documents selected by Algorithm 3.1 (with c/2 from A, i.e. having w i = 1, and c/2 from B, i.e. having w i = 1) is: c/2 E MTC prec@k = p A i=1 c/2 i=1 = c 2 p A c 2 p B ( p B k k c c ) 2 ( k k c c ). 2 The final term comes from assuming the remaining c/2 unjudged documents with w i = 1 will turn out to be nonrelevant. Any algorithm that is not strictly alternating will either take more than c 2 documents from A and fewer than c 2 from B, or more than c 2 from B and fewer from A. We will consider what happens to the expected lower bound in both cases. If a different algorithm (denoted ALG) selects c more from A (and thus c fewer from B), the expected lower bound becomes E ALG prec@k = c 2 p A + c p A c ( 2 p B + c p B k k c c ) c. 2 This algorithm increases the lower bound over the alternating algorithm if and only if p A + p B > 1. (Recall that increasing the lower bound i.e., showing that it is greater than zero if it is true that prec@k > 0 is the goal.) If ALG selects c more from B (and c fewer from A), the expected lower bound becomes E ALG prec@k = c 2 p A c p A c ( 2 p B c p B k k c c ) + c. 2 This algorithm increases the lower bound over the alternating algorithm if and only if p A + p B < 1.

48 28 Chapter 3. Comparative Evaluation We will not assume anything about the distributions of p A and p B (except that they are identically distributed a reasonable assumption, considering that the system labels can be swapped). It is possible that the distributions could be such that one of those strategies is indeed better than the alternation strategy. But there is also the upper bound to consider. The algorithm that prefers A can only decrease the expected upper bound if p A + p B < 1 in direct contradiction to the condition for increasing the expected lower bound. The same is true for the algorithm that prefers B. Therefore using one of these biased algorithms entails focusing on one of the bounds at the expense of the other. But if p A and p B are identically distributed, then prec@k > 0 as often as prec@k < 0, and focusing on one bound can only be expected to help in at most 50% of system comparisons. Thus an algorithm that shows a preference for A or B that is not based on any knowledge of p A or p B cannot do better than the alternating algorithm on average. Note that this result only applies if the algorithm has no information about p A or p B. An algorithm that learns as judgments are made may be able to do better; we will revisit this in Chapter Discounted Cumulative Gain Family Discounted Cumulative Gain (DCG) is a precision-like measure that assigns a gain for the relevance of a document and a discount by the rank it was retrieved at. DCG was originally defined with a logarithmic discounting function, but it can be seen more generally as a family of measures, with each member defined by a gain function g(x i ) that maps values of relevance to real numbers, and a discounting function d(x) that maps ranks to real numbers. Let us write DCG = n i=1 g(x i ) d(a i ) I(A i k). As defined by Jarvelin and Kekalainen (2002), g(x) = x and d(x, b) = 1 if x b and log b x otherwise. A slightly modified instance uses gain function g(x) = 2 x 1 and discounting function d(x) = log 2 (x + 1) (e.g. Burges et al. (2005)). Huffman and Hochster (2007) discounted by rank rather than the log of the rank: d(x) = x. Precision at k can also be seen as a member of the DCG family, with g(x) = x and d(x) = k. In that sense, this section can be seen as a generalization of the previous one. We make a few assumptions about g and d, namely that g(0) = 0 (i.e. a nonrelevant document contributes nothing) and d(x) is nondecreasing (i.e. greater ranks are never penalized less). Both of these assumptions hold in every published version of DCG. As in the previous section, define a difference in DCG between two systems as DCG@k = DCG DCG n ( I(Ai k) = g(x i ) I(B ) i k). (3.5) d(a i ) d(b i ) i=1

49 3.2. Precision Measures 29 Document weights are again determined by how much they would add to if relevant: 1 ( I(Ai k) w i = g(1) I(B ) i k). (3.6) d(a i ) d(b i ) In this case, w i = 0 for documents retrieved at the same rank by both systems or documents not retrieved by either system. Bounds on DCG@k can be defined in terms of w i and judged documents J : DCG@k = i J DCG@k = i J w i x i + i / J w i I(w i < 0) (3.7) w i x i + i / J w i I(w i > 0). (3.8) The intuition for the lower bound is that the best-case outcome for B is all unjudged negative-weight documents turning out to be relevant, and all unjudged positiveweight documents turning out to be nonrelevant; similarly for the upper bound and the best-case outcome for A. Either case must be tempered by the weights of the judged relevant documents. Again we can describe the effect on the bounds of judging any given document d i : If w i > 0, judging d i nonrelevant decreases the upper bound by w i. If w i > 0, judging d i relevant increases the lower bound by w i. If w i < 0, judging d i nonrelevant increases the lower bound by w i. If w i < 0, judging d i relevant decreases the upper bound by w i. If w i = 0, judging d i has no effect. The intuition for the bounds above suggests why judging documents greedily by weight is a good strategy: the document with greatest weight w i will increase the lower bound the most if relevant, or decrease the lower bound the most if nonrelevant. Putting these pieces together produces the greedy algorithm below. Algorithm 3.2. MTC-DCG. Minimal test collection for a difference in DCG. 1: J ( I(Ai k) d(a i) 2: for each document i do 3: w i g(1) I(Bi k) d(b i) 4: while DCG@k < 0 < DCG@k do 5: i arg max i w i (over unjudged documents) 6: x i judgment on document i. 7: J J i When running over a set of topics, the algorithm is essentially the same. The only difference is that w i is calculated for each document for each topic (nt weights rather than just n). 1 Though DCG can handle graded (non-binary) judgments, we are only concerned with binary judgments here. Carterette and Jones (2007) considered the more general graded case. )

50 30 Chapter 3. Comparative Evaluation Minimal Test Collection for Dierence in DCG The result for precision in Section generalizes as well. The greedy algorithm above is optimal in the sense that the sign of will be proved with the minimum expected number of judgments compared to other algorithms that have no knowledge of the distribution of relevance. The proof works essentially the same way: evaluate the change in the expected lower and upper bounds when documents are selected by some other approach. We will show that a necessary condition for a non-greedy algorithm to increase the lower bound is that either p A + p B > 1 or p A + p B < 1, and, as in Section 3.2.1, when the necessary condition for the lower bound holds, the opposite (and disjoint) condition holds for the upper bound. Let J be a set of c judged documents. Denote the expected lower bound after c iterations of Algorithm 3.2 as E MTC DCG@k. E MTC DCG@k = i J w i p A I(w i > 0) + i J w i p B I(w i < 0) + i / J w i I(w i < 0) Note that w i w j for i J, j / J. Here p A and p B are defined slightly differently than before: p A is the probability that a document selected uniformly at random from those with positive weight (i.e. more preferred by A than by B) is relevant and p B is the probability that a document with negative weight is relevant. Suppose there is an alternative algorithm that selects documents differently. Denote its expected lower bound as E ALG DCG@k. We can model a different selection process by considering the effect of replacing document i J with document j / J. Since document weights are independent, we need only consider replacing one. First consider an algorithm (denoted ALG) that selects a different set of documents, but one with the same distribution of signs of document weights. In this case, weight w i > 0 and w j > 0, or w i < 0 and w j < 0. If w i > 0 and w j > 0, then E ALG DCG@k E MTC DCG@k : since w i w j, the first sum cannot be any greater, and no other sum changes. If w i < 0 and w j < 0, then w i is taken away from the second sum and added to the third sum (since i is removed from S), and w j is removed from the third sum and added to the second sum (since it is added to S). Canceling the terms in common leaves E ALG DCG@k E MTC DCG@k = w i p B + w j p B w j + w i = w j (p B 1) w i (p B 1) = (w j w i )(p B 1) 0 because w i w j < 0. Therefore E ALG E MTC. Analogous results can be shown for the upper bound. Therefore an algorithm that selects different documents with the same set of weight signs cannot do any better than the greedy approach for either bound. Now consider the case of an algorithm (still denoted ALG) that selects different signed weights. This algorithm necessarily selects a different set of documents.

51 3.3. Recall Measures 31 If w i < 0 and w j > 0, E ALG DCG@k E MTC DCG@k = w i p B + w i + w j p A = w i (1 p B ) + w j p A If ALG outperforms MTC, then E ALG DCG@k > E MTC DCG@k w i (1 p B ) + w j p A > 0. The greedy selection mechanism guarantees that w i w j, so w i (1 p B ) w i p A > 0, which implies that p A + p B < 1. Now consider the upper bound. If w i < 0 and w j > 0, then E ALG DCG@k E MTC DCG@k = w j p A w i p B + w i = w j p A + w i (1 p B ) and E ALG DCG@k < E MTC DCG@k only if p A + p B < 1 the opposite of the condition under which the same algorithm improves the lower bound. Though the alternative is expected to do better when DCG@k > 0, it is expected to do worse when DCG@k < 0. If p A and p B are identically distributed, both cases are equally likely, and therefore it is no better than the greedy approach on average. Likewise, if w i > 0 and w j < 0, E ALG DCG@k E MTC DCG@k = w i p A + w j p B w j = w j (1 p B ) w i p A which is greater than zero only if 1 > p A + p B. And the same algorithm can only improve the upper bound if p A +p B > 1. Again assuming p A and p B to be identically distributed, an algorithm that selects fewer positive-weight documents and more negative-weight documents cannot uniformly outperform the greedy algorithm. Thus, if p A and p B are identically distributed, the greedy algorithm Alg. 3.2 does at least as well as any alternative algorithm that has no information about those distributions. This result holds for any measure that can be expressed as a member of the DCG family (subject to the constraints on the gain and discount functions mentioned above). 3.3 Recall Measures Generally speaking, recall measures are concerned with the proportion of relevant material that was retrieved at the top of the ranked list. Because they are normalized, the technical details of algorithmic selection procedures are somewhat more complicated than for precision measures: judgments interact between topics in that the normalization for a topic is dependent on judgments, so more information must be considered to compute document weights.

52 32 Chapter 3. Comparative Evaluation Recall As described in Chapter 2, recall is the proportion of relevant documents retrieved. We can express it in terms of relevance indicators x i as rec@k = 1 n i=1 x i n x i I(A i k) (3.9) It has the same numerator as precision the number of documents relevant and retrieved by rank k and therefore, when comparing systems over a single topic, the same algorithm that works for finding a difference in precision works for recall as well. We therefore initially define w i identically to how it was defined for precision: w i = I(A i k) I(B i k). For a single topic, the bounds on rec@k are essentially equivalent to Eqs. 3.3 and 3.4; the denominator can be ignored. When comparing over multiple topics, however, the denominator influences the bounds and the selection of documents. Specifically, the form of the bound depends on whether rec@k for an arbitrary topic has already been proven to be positive or negative: the denominator of the bound needs to be as small as possible when the difference has not been proved, but as large as possible when it has been, in order to put the bound as close to zero as possible. i=1 rec@k = TP@k R (3.10) rec@k = TP@k R + (3.11) where R and R + are bounds on the number of relevant documents and conditional on whether there are enough judgments to prove the difference: { R i J = x i + i / J w i I(w i < 0) if difference not proved i J x i + n J otherwise; { R + i J = x i + i / J w ii(w i > 0) if difference not proved i J x i + n J otherwise. (3.12) (3.13) Let rec@k be the average of rec@k over a set of topics. The bounds on rec@k are the means of the bounds on the individual topics. We should select the document with the greatest expected effect on these mean bounds. For a given topic for which the difference has not yet been proven, judging d i has the following effects: If w i > 0, judging d i nonrelevant decreases the upper bound by subtracting 1 from both numerator and denominator. If w i > 0, judging d i relevant increases the lower bound by adding 1 to both numerator and denominator. If w i < 0, judging d i nonrelevant increases the lower bound by adding 1 to the numerator and subtracting 1 from the denominator. If w i < 0, judging d i relevant decreases the upper bound by subtracting 1 from the numerator and adding 1 to the denominator.

53 3.3. Recall Measures 33 If w i = 0, judging d i relevant increases the denominator by 1 in both bounds. If w i = 0, judging d i nonrelevant has no effect. Unlike with precision measures, there is no symmetry between judging a document relevant and judging it nonrelevant. Also, each judgment now influences future judgments: topics for which many relevant documents are known to exist will benefit less from another judgment than topics for which few relevant documents are known. Consider the following example: we are interested in rec@5 for a particular topic and the two systems have retrieved two documents in common in the top 5. When we begin, the upper bound is 3/3 (all documents uniquely retrieved by A are relevant; none uniquely retrieved by B are) and the lower bound is 3/3 (vice versa). Judging a document from A nonrelevant will reduce the upper bound to 2/2 no change. Likewise, judging a document from B nonrelevant will increase the lower bound to 2/2. But judging a document from A relevant will increase the lower bound to 2/4 (still taking B s three unique documents, but now forced to take the new relevant one from A), and judging a document from B relevant will decrease the upper bound to 2/4. Judging some other document relevant would change both bounds, to 3/4 and 3/4 respectively. Clearly we would prefer to judge some document relevant. But since we are not (yet) assuming any prior information about the distribution of relevance, and since all judgments of nonrelevance have the same effect, our best course of action is to order documents by their effect if judged relevant, and then judge the one with the greatest effect. Note that all documents for a given topic have the same effect, but documents from some topics may have a greater effect than documents from other topics. Specifically, a document from a topic for which the two systems have more documents in common will have a greater effect. In fact, for recall it is always preferable to judge a document relevant rather than nonrelevant. We can see this by comparing the effects of judging a particular document d i on the two bounds. Consider the upper bound rec@k = TP@k /R +. Judging a document with w i < 0 relevant results in a new upper bound of: rec@k = TP@k 1 R by the fourth of the rules enumerated above. The net decrease obtained by judging document i, then, is: rec@k rec@k = TP@k R + TP@k 1 R TP@k + R+ = R + (R +. (3.14) + 1) On the other hand, judging a document with w i > 0 nonrelevant results in a new upper bound of: rec@k = TP@k 1 R + 1

54 34 Chapter 3. Comparative Evaluation by the first rule above. The net decrease in this case is: = TP@k R + TP@k 1 R + 1 TP@k + R+ = R + (R +. (3.15) 1) We can show that judging a document relevant is always better than judging a document nonrelevant by subtracting Eq from Eq. 3.14, giving 2R + ( TP@k 1) R + (R + + 1)(R + 1) which is positive as long as TP@k > 0 which it must be, or else we already have enough judgments to rule out rec@k > 0. By a similar argument (omitted), we can show that judging a document relevant always increases the lower bound more as well. Therefore let w i be the maximum of the change in the upper bound and the change in the lower bound if i is relevant: { TP@k + R w i + = max R + (R +, + 1) TP@k R R (R + 1) }. (3.16) Following the pattern of previous sections, we propose the following algorithm: Algorithm 3.3. MTC-Recall. Minimal test collection for a difference in recall. 1: J 2: while rec@k < 0 < rec@k do 3: for each unjudged document i do 4: calculate w i (Eq. 3.16) 5: i arg max i w i (over unjudged documents) 6: x i judgment on document i. 7: J J i Minimal Test Collection for Dierence in Recall We would like to prove not only that the strategy described above will reduce the distance between bounds the most on average, but also that it will do so when performed sequentially that is, when the same greedy strategy is used at time 1, time 2, and so on. It is not immediately obvious that this would be the case, since each judgment entails changing the effects at the next time step. Furthermore, as in the previous sections, we would like to show that if rec@k > 0, this greedy strategy will prove it at least as fast as any other strategy. We will use induction to show that the expected lower bound is at least as high under the greedy strategy as any other strategy that has no knowledge of the distribution of relevance. The weight calculation and document selection steps of Alg. 3.3 can be viewed as a topic selection heuristic followed by application of the alternation algorithm MTC-Precision to that topic. Therefore we will show that MTC-Recall (denoted MTC) is superior to an algorithm (denoted ALT) that uses a different topic

55 3.3. Recall Measures 35 selection heuristic but uses the alternation algorithm within a topic, which in turn is superior to one that uses the same topic selection heuristic as ALT but uses a non-alternating algorithm within topics (denoted ALG). An alternative algorithm using a different topic selection heuristic may have judged more documents for some topics but fewer for others. Let t 1 be the topic chosen for the dth judgment by our algorithm. Suppose we have previously judged c t1 documents for topic t 1. Let t 2 be a topic for which we have judged c t2 documents, the alternative has judged c t 2 documents, and c t 2 c t2. The proof is by induction. The base case is trivial, as the mean lower bound for MTC at least as large as the mean lower bound for both ALT and ALG by construction (lines 3 5 of Algorithm 3.3). Assume the induction hypothesis that E ALT rec@k E MTC rec@k after d judgments. Let E rec@k be the expectation after d + 1 judgments. Denote the expectation for a particular topic t after c t judgments as E rec@k t, and after c t + 1 judgments as E rec@k t. Suppose the (d+1)st judgment by MTC is to a document in topic t 1. As above, we will suppose that we can select a document among the first d + 1 selected by ALT that is from a different topic (denoted t 2 ) for which ALT has selected more documents than MTC. Then we may express the expected lower bound of the mean difference in recall after d + 1 judgments as the expected lower bound of the mean difference in recall after d judgments minus the lower bound of recall for either t 1 or t 2, plus the lower bound of recall for the same topic, or: E ALT rec@k E MTC rec@k = E ALT rec@k 1 T E ALT rec@k t2 + 1 T E ALT rec@k t2 E MTC rec@k + 1 T E MTC rec@k t1 1 T E MTC rec@k t1. (3.17) Now consider the expected gain of judging the (c t1 + 1)st document relevant for topic t 1 : E MTC rec@k t1 E MTC rec@k t1 ( [ ] [ ]) TP@k + 1 TP@k = p E R E + 1 R + (1 p) 0 [ ] TP@k R = pe R (R + 1) which is p times the expected weight of that document. Therefore we know that the expected gain of judging the selected document is at least as much as the expected gain from judging any other document in any other topic, including t 2, i.e. E MTC rec@k t 1 E MTC rec@k t1 E MTC rec@k t 2 E MTC rec@k t2. Now, since ALT has selected more documents from t 2 than MTC (i.e., c t 2 > c t2 ), it follows that the expected effect of the (c t 2 + 1)st document by ALT is no greater than the expected effect of the (c t2 + 1)st document by MTC, or E MTC rec@k t 2 E MTC rec@k t2 Pulling the above together: E ALT rec@k t 2 E ALT rec@k t2.

56 36 Chapter 3. Comparative Evaluation E ALT rec@k E MTC rec@k (induction hypothesis) E MTC rec@k t1 E MTC rec@k t1 E MTC rec@k t2 E MTC rec@k t2 (by construction) E ALT rec@k t2 E ALT rec@k t2 (because c t 2 c t2 ) E ALT rec@k E MTC rec@k (by Eq. 3.17). Therefore the expected lower bound after d + 1 judgments by MTC-Recall is greater than the expected lower bound after d + 1 judgments by ALT. A similar result can be shown for the expected upper bound. It is now simple to show that ALT is at least as good as an algorithm that judges documents from the same topics but does not alternate between systems; it follows directly from MTC-Precision s optimality Normalized Discounted Cumulative Gain Family Normalized Discounted Cumulative Gain (NDCG) is, as the name suggests, DCG normalized to the range [0, 1]. Like DCG, NDCG can be written in a general form: n i=1 NDCG = I(A i k)g(x i )/d(a i ) n i=1 I(P (3.18) i k)g(x i )/d(p i ) where P is the perfect ordering of documents with all relevant documents ranked above all nonrelevant documents. The difference in NDCG can then be written as: NDCG@k = 1 n ( I(Ai k) g(x i ) I(B ) i k) (3.19) N k d(a i=1 i ) d(b i ) n N k = I(P i k)g(x i )/d(p i ). i=1 Define the weight of a document w i as in Eq Note that after k relevant documents have been found, N k is fixed; additional relevant documents will neither increase nor decrease it. The bounds of NDCG@k therefore depend on the number of relevant documents known to exist. Furthermore, unlike the previous sections, we cannot calculate e.g. the lower bound by taking every document with w i < 0. Because each subsequent document contributes less to the denominator, there is a point at which taking an additional document will increase the lower bound rather than decrease it. Let W be the cutoff for document weights: if w i W, we will take document i for the lower bound; otherwise we will not. Let W + be defined likewise for the upper bound (if w i W +, we will take document i). We could now recast the bounds for DCG in terms of W and W + : DCG@k = i J DCG@k = i J w i x i + i / J w i I(w i W ) w i x i + i / J w i I(w i W + )

57 3.3. Recall Measures 37 with W = W + = 0. For NDCG, the cutoffs are not so easily defined. Instead of trying to define them, let us express the bounds of NDCG explicitly in terms of W and W + : 1 NDCG@k = min W N k 1 NDCG@k = max W + N + k i J w i x i + i / J w i I(w i W ) (3.20) w i x i + i J i / J w i I(w i W + ) (3.21) The denomators N k and N + k are normalization constants for the two bounds. Similarly to recall, they depend on how many documents have been judged relevant, which documents retrieved by either system have not been judged, and whether the difference has already been proved. Both are lower-bounded by any documents already judged relevant: N k, N + k k,r i=1 g(1)/d(i). Each then has an additional component in the number of unjudged documents with w i W or w i W + respectively. N k N + k { min{k,r} = i=1 g(1)/d(i) + V i=1 g(1)/d(r + i)i(r + i k) if difference not proven k i=1 { g(1)/d(i) otherwise min{k,r} = i=1 g(1)/d(i) + V + i=1 g(1)/d(r + i)i(r + i k) if difference not proven k i=1 g(1)/d(i) otherwise where V is the number of unjudged documents with w i W and W + is the number with w i W +. Taking these expressions together with the bounds above, we can see that calculating the bound requires minimizing or maximizing over a term that appears in the limit of a sum. It is unlikely that there is any closed-form solution to this. Nonetheless, they can be solved efficiently by a simple linear search. Describing the effect of a judgment as we did in previous sections is consequently much more difficult. As with recall, the effects are not symmetric for the bounds; unlike recall, it is not even necessarily the case that a relevant judgment is always more informative than a nonrelevant judgment. Thus evaluating the possible effect of a judgment may require the complete computation of both bounds with hypothesized judgments of both relevance and nonrelevance. As Algorithm 3.4 shows, this

58 38 Chapter 3. Comparative Evaluation is quite a bit more involved than the previous sections, but it is still efficiently computable. Algorithm 3.4. MTC-NDCG. Minimal test collection for a difference in NDCG. 1: J 2: while NDCG@k < 0 < NDCG@k do 3: for each unjudged document i do 4: calculate bounds NDCG@k, NDCG@k for both hypothesized judgments (denote these w R+ i, w R i, w N+ i, w N i for relevant upper bound, relevant lower bound, nonrelevant upperbound, and nonrelevant lower bound, respectively). 5: w i = max{wr+ i, w R i, w N+ i, w N i } 6: i arg max i w i (over unjudged documents) 7: x i judgment on document i. 8: J J i Because of the truncated normalization factor, the proof is very technical. It has been omitted for the sake of brevity. 3.4 Summary Measures We are often interested in some combination of recall and precision that tells us something about how they trade off over increasing rank cut-offs. Three such measures that are commonly used are F, R-precision, and average precision F As described in Chapter 2, F is proportional to the weighted harmonic mean of precision and recall. As such, essentially the same argument holds for F as for recall. If TP = k prec = R rec then 1 + β F = 1/prec + βrec 1 + β 1/prec + βrec = (1 + β)tp A@k (1 + β)tp B@k k + βr k + βr (1 + β) TP@k =. k + βr This is essentially recall with an extra multiplicative factor on the two variable quantities, plus a constant in the denominator. The document selection strategy is not appreciably different we just need to consider β when determining the effect of judging a document. The details are left to the reader.

59 3.4. Summary Measures R-Precision R-precision is the precision at rank R, where R is the number of relevant documents. Equivalently it is the recall at rank R, or the F 1 at rank R; as it happens, all three are equal at that point. It can be written as: Rprec = 1 xi xi I Letting R = x i, we can write the difference in R-precision as (A i x i ). (3.22) Rprec = 1 R xi (I(A i R) I(B i R)). (3.23) As with precision and recall, it is still the case that a document placed above rank R or below rank R by both systems has no effect on a difference in R-precision; now, however, R is an unknown quantity. If R is unknown, can Rprec be bounded? The idea behind the bound is similar to that used for precision and recall: all documents uniquely retrieved above a certain rank by one system are considered relevant, while the rest are considered nonrelevant. The difference is that the bound occurs at a particular rank k R that must be found. Finding that rank turns out to not be a difficult problem: it is simply the rank at which the proportion of unjudged unique documents retrieved is maximized. Let k = max k 1 k I(A i k) I(B i k). i / J (With the constraint that k cannot be less than the number of judged relevant documents.) Then: Rprec = prec@k ; Rprec = prec@k. (3.24) The bound can be found by iterating from R to n and keeping a hash table of documents seen. Each time a new document is seen, a counter is incremented. If a previously-seen document is found, the counter is decremented. At k = n, the counter will be equal to zero. This simple algorithm tells us how to define w i. The document that has the greatest effect on the bounds is the one that takes the longest to be canceled out. Its net effect is the number of iterations it stays un-canceled. Another way to see this is in terms of precision at each rank: knowing that the document at rank k means that precisions at ranks k, k + 1, k + 2,..., n must be at least 1/k, 1/(k + 1), 1/(k + 1),..., 1/n. Knowing that document is nonrelevant likewise means that those precisions are at most (k 1)/k, k/(k + 1),..., (n 1)/n. As long as k is such that the document has not been canceled, it is therefore contributing to the bounds on precision at k. Let w i = I(A i k ) I(B i k ) n k=1 I(A i k) k I(B i k) k = I(A i k ) I(B i k ) ((H n H Ai 1) (H n H Bi 1)) = I(A i k ) I(B i k ) (H Bi 1 H Ai 1) (3.25)

60 40 Chapter 3. Comparative Evaluation where H n = n i=1 1/i is the nth harmonic number. As with DCG, w i will be positive if i is ranked higher by A than by B. Multiplying by I(A i k ) I(B i k ) ensures that the document will have some effect on the bound, though it also means that the weights must be updated each time k changes. The update is fairly minimal, though. If we were trying to prove a difference at precision at k, clearly this algorithm is as good as the one presented in Section 3.2.1; it is the same algorithm, except it imposes more ordering information on the document weights. That additional information simply ensures that the algorithm is likely to be optimal not only for precision at k, but for precision at all ranks from A i to B i 1. The bigger that range is, the more likely it is to be optimal for whatever the true value of R is Average Precision Average precision is the mean of the precisions at the ranks at which relevant documents were retrieved. By equation 3.1, precision at rank i is 1/i n j=1 x ji(a j i), so average precision can be written as: AP = 1 R = 1 R n x i prec@a i i=1 n i=1 x i 1 A i n x j I(A j A i ) = 1 R j=1 n n i=1 j=1 I(A j A i ) A i x i x j (3.26) again letting R be the number of relevant documents, R = x i. Each pair of documents i, j is counted twice, once with I(A j A i )/A i and once with I(A i A j )/A j. Clearly, if i j only one of those can be non-zero, and the denominator will be whichever of i or j has the greater rank. Let us therefore define a ij = 1/ max{a i, A j }, and write average precision as: AP = 1 R n a ij x i x j (3.27) i=1 j i Given a second system B, let b ij = 1/ max{b i, B j } and c ij = a ij b ij. We can then express a difference in average precisions as: AP = 1 R n c ij x i x j. (3.28) i=1 j i Let SP = c ij x i x j (SP is sum precision, following Aslam et al. (2006)). We will first focus on SP. To find the bounds on SP, we return to the combinatorial optimization framework last seen in Section The upper bound SP is: SP = max x n X n cij x i x j

61 3.4. Summary Measures 41 subject to the constraint that if i has been judged, x i must be fixed. Note that the objective function is quadratic. This can be written in matrix-vector multiplication form as: max x X x Cx n where x n 1 is a vector of 0-1 relevance judgments and C n n is an upper-triangular matrix of coefficients c ij. This can be seen as an instance of the quadratic knapsack problem, which is NP-hard in general. One approach sometimes taken in such problems is to relax the constraints by allowing the variables to be real-valued: x i R, 0 x i 1. It then becomes a quadratic programming problem with linear constraints. But since C is triangular, its determinant is zero, and the objective function is not convex; the problem has infinitely many solutions. As we will show, we can in fact maximize this expression as long as we are starting off with no prior judgments. It is clear, however, that the problem is starting off difficult. Suppose we assume all unjudged documents are nonrelevant, as is conventional. Consider the effect of judging a document i relevant: it would either increase or decrease SP by an amount determined by coefficients c ij. The very first document judged relevant (call it d 1 ) would add only c 11 to SP. Because of the quadratic nature of SP, the second (d 2 ) would add c 22 +c 12. The third would add c 33 +c 13 +c 23, and so on. Thus if R is a set of k relevant documents, judging an additional document i relevant would add c ii + k R c ik to SP. We could also assume all documents are relevant, which is unconventional but interesting to consider. Judging a document nonrelevant then has an effect. Judging document i nonrelevant removes c ii from SP, but also c i1, c i2, c i3,..., c in, since if x i = 0 then x 1 x i = 0, x 2 x i = 0, and so on. In general, then, if N is a set of k nonrelevant documents, judging a new document i nonrelevant will subtract c ii + k / N c ik. This suggests an algorithm: for each document, assign a relevant weight wi R = c ii + k J c ikx k and a nonrelevant weight wi N = c ii + k J c ikx k + k / J c ik. Without knowing anything about the distribution of relevance, we cannot favor selecting documents by relevant weight or nonrelevant weight; instead we assign w i = max{ wi R, wn i } and judge the document with maximum w i. If the document is judged relevant, it is added to the set R; if not, it is added to the set

62 42 Chapter 3. Comparative Evaluation N. The weights are then updated accordingly. This is shown in pseudo-code in Algorithm 3.5. Algorithm 3.5. mtc. Select documents to judge to prove a difference in SP. 1: J 2: while SP < 0 < SP do 3: for each unjudged document i do 4: w R i c ii + k J c ikx k 5: w N i c ii + k J c ikx k + k / J c ik 6: w i max{ w R i, wn i } 7: i arg max i w i 8: x i judgment on document i 9: J J i At this point we do not even know how to calculate the bounds in line 2, much less argue that this algorithm would reduce the distance between them optimally. In the following sections we will argue that a static version of this algorithm is optimal under certain conditions. Minimal Test Collection for a Dierence in Sum Precision Though solving a quadratic knapsack problem is NP-hard in general, it may be that our values (coefficients c ij ) are constrained enough that an efficient optimal algorithm exists. In this section we will prove a general optimality result about Algorithm 3.6, which is similar to mtc. Note that it does not ask for judgments and that it only calculates the relevant weight of a document. Algorithm 3.6. mtc-max. Sort documents in non-decreasing order by weight. 1: S 2: for k in 1 to n do 3: for each document i / S do 4: w i c ii + j S c ij 5: S k arg max i w i 6: return S Before the proof, two lemmas establish constraints on the coefficients c ij. Some additional notation will be useful. i j means that i is ranked above j is ranked above k by system A. k A This is intended to illustrate the relationships between the coefficients: specifically, regardless of the ranks the documents appear at, it is true that a ii > a ij = a jj > a ik = a jk = a kk (where a ij = 1/ max{a i, A j }). Two adjacent rankings illustrate the relationships between coefficients c ij, as seen below. Lemma 3.1. For all i, j, either c ii c ij c jj or c jj c ij c ii.

63 3.4. Summary Measures 43 An intuitive understanding of this is that, given two documents i and j, their respective differences in reciprocal ranks c ii and c jj place bounds on their total value c ii + c ij + c jj. Proof. Recall that c ij = a ij b ij. There are four cases to consider, each a different relative rankings of i and j in A and B. case 1: i i j A j case 2: i j B j A i B case 3: j i i A j case 4: j j B i A i. B In the first case, c ij = c jj, since a ij = a jj and b ij = b jj. Without knowing the values of the rankings, the relationship between c ii and c jj is indeterminate, but it must be the case that either c ii c ij = c jj or c jj = c ij c ii. In the second case, a ii > a ij and b ii = b ij, so c ii = a ii b ii > a ij b ij = c ij. Further, a jj = a ij and b ij < b jj, so c ij = a ij b ij > a jj b jj = c jj. Therefore c ii > c ij > c jj. The third case is analogous to the second, but c jj > c ij > c ii. The fourth is analogous to the first, but either c jj c ij = c ii or c ii = c ij c jj. Therefore every possible ranking of documents i and j results in either c ii c ij c jj or c jj c ij c ii. The second lemma establishes a relationship between coefficients based on the order in which they are selected by the algorithm. It will be convenient to assume that documents are numbered in order of their selection by the algorithm, so that x 1 is the relevance of the first document selected, x 2 the second, and in general, x i the relevance of the ith document selected. Since the numbering of documents is entirely arbitrary, there is no loss of generality. Lemma 3.2. If i is the ith document selected by Algorithm 3.6, then c ij c jj for all documents j > i. In other words, if we already have document i, taking document j will add the difference in reciprocal ranks c jj and another quantity that is at least as large as c jj. Note that this does not necessarily mean that c ij c ii : as Lemma 3.1 suggests, it is possible that c jj = c ij > c ii. Proof. The proof is by induction on i. The base case, c 1j c jj follows because c 11 c jj by lines 4 and 5 of Algorithm 3.6, and c 11 c 1j c jj by Lemma 3.1. Assume the induction hypothesis: c i 1,j c jj for an arbitrary i, j > i > 1. The proof will continue by contradiction. Suppose c jj > c ij. Select an arbitrary k < i (i.e. a document selected at an earlier iteration), so that c kj c jj and c ki c ii, both

64 44 Chapter 3. Comparative Evaluation by the induction hypothesis. Each of these three constraints has relative rankings consistent with it: c jj > c ij : j i A i j B or j i A j i B c kj c jj : k j A k j B or k j A j k B or j k A j k B c ki c ii : k i A k i B or k i A i k B or i k A i k B Let us consider all relative rankings of i, j, k that are consistent with the above and the consequent relationship between c jk and c ik. k j i A k i j B c jk = c jj > c ii = c ik k j i A i k j B c jk > c ik k j i A i j k B c jk > c ik j k i A i j k B c jk > c ik j i k A i j k B c jk = c ik k j i A k j i B c jk = c jj > c ij = c ik k j i A j k i B c jk > c jj > c ii = c ik k j i A j i k B c jk > c ik j k i A j k i B c jk c jj > c ii = c ik j k i A j i k B c jk > c ik j i k A j i k B c jk = c ik. No matter how the three documents are arranged, if c jj > c ij then c jk c ik. Thus c jj + k S c jk > c ii + k S c ik. But if that is true, we would have selected document j at iteration i; we selected document i because c ii + k S c ik c jj + k S c jk (lines 4 & 5 of Alg. 3.6). The assumption leads to a contradiction, and therefore c ij c jj. Now we are ready for the main result. Let p be the probability that a document is relevant, and p be the probability that two documents are relevant. Define the expected value of sum precision over a set of documents S as: E SP S = i S c ii p + j i;i,j S c ij p. Theorem 3.1. Let SP Sk be the difference in sum precisions calculated over S k, the first k documents selected by Algorithm 3.6. Let U k S k be another set of k documents, and let SP Uk be the difference in sum precisions calculated over U k. Then E SP Sk E SP Uk.

65 3.4. Summary Measures 45 In other words, the first k documents selected by Algorithm 3.6 result in an expected difference in sum precisions no smaller than any other set of k. The proof is fairly technical, but the intuition is that there is an intermediate set S k that differs by only one document from S k and that has expected difference in sum precisions E SP S k such that E SP Sk E SP S k E SP Uk. The trick is figuring out which document in S k to swap out to make it true: it will be a document u that is in U k (and not, obviously, in S k ), and whose difference in reciprocal ranks c uu is greater than the difference in reciprocal ranks for all other documents in U k that are also not in S k. After finding that document, the result follows by applying Lemmas 3.1 and 3.2. Proof. The proof will be by induction on k. The base case is trivial: if k = 1, c 11 c uu for all u by lines 4 & 5 of Alg. 3.6, so E SP S1 = c 11 p c uu p = E SP U1. Assume the induction hypothesis that E SP Sk 1 E SP Uk 1 for all U k 1 S k 1. Note that since the documents are numbered by their algorithmic order, all documents in S k have index less than k, while there are documents in U k that have index greater than k. Note that S k 1 S k. Therefore E SP Sk = E SP Sk 1 + c kk p + j S k 1 c jk p. Let l be the document in U k such that l k and l has greater difference in reciprocal rank than any other document u U k, u k, i.e. c ll c uu for all u U k, u k. Lemma 3.1 ensures l exists by imposing an ordering on the differences in reciprocal rank c ii for all i. By Lemma 3.2, c jl c ll for all j < l; it follows that c jl c ll c ul c uu c uv c vv... for u, v U k, u, v / S k. Let S k be the set obtained by replacing document k with document l defined above, i.e. S k = S k k + l. By lines 4 & 5 of Alg. 3.6, E SP Sk = E SP Sk 1 + c kk p + j S k 1 c jk p E SP Sk 1 + c ll p + j S k 1 c jl p = E SP S k. Removing l from both S k and U k gives E SP S k l = E SP Sk 1, which is no less than E SP Uk l by the induction hypothesis. It remains to be shown that j S k 1 c jl p i U k l c ulp. Let j be an arbitrary document in S k 1. If there is a document u U k l such that j = u, then c jl p = c ul p. Otherwise, c jl p c ll p c ul p by the application of Lemma 3.2 above. Then E SP Sk E SP S k = E SP Sk 1 + c ll p + c jl p j S k 1 and the proof is complete. E SP Uk l + c ll p + u U k l From this general result, we can prove several things: c ul p = E SP Uk Corollary 3.1. Let SP + be the difference in sum precisions calculated over all documents with w i > 0 in the ordered set S produced by Algorithm 3.6. Then SP = SP +. Proof. Theorem 3.1 holds with p = 1, p = 1, since for the purpose of computing a bound we can consider any document relevant.

66 46 Chapter 3. Comparative Evaluation C = c 11 c 12 c 13 c 1,n 2 c 1,n 1 c 1n c 22 c 23 c 2,n 2 c 2,n 1 c 2n c 33 c 3,n 2 c 3,n 1 c 3n c n 2,n 2 c n 2,n 1 c n 2,n c n 1,n 1 c n 1,n c nn Figure 3.1. Coefficient matrix of sum precision with documents ordered by Algorithm 3.6. If S 2 = {1, 2}, then E SP S2 is the sum of the shaded cells on the left. If S n 2 = {1, 2,..., n 2}, then E SP S C n 2 is the sum of the shaded cells on the right. Corollary 3.2. Let E N SP Sk = i S k c ii q + j i,i,j S k c ij q, where q = 1 p and q = 1 p. Then E N SP Sk E N SP Uk. Note that whereas p p, since it is the probability that two documents are relevant, q q, since it is the probability that either of two documents is not relevant. This does not change the proof, however. Corollary 3.3. Let S C k = C S k be the last n k documents in the order of Alg Then E SP S C k E SP U C k. Proof. E SP S C k = E SP C E SP Sk E SP C E SP Uk = E SP U C k. The final result of this chapter puts the above pieces together into an algorithm that selects the k documents expected to increase the lower bound of SP by the greatest amount. The idea is to take the top k 1 documents with the expectation that they will be relevant, and the bottom k 2 documents with the expectation that they will be nonrelevant. The number k 1 or k 2 can be found by a simple linear search, looking for the maximum expected increase in SP. The top k 1 documents increase SP by an expected E SP Sk1. If any of those is nonrelevant, they have no effect on the lower bound. The bottom k 2 documents increase SP by an expected E N SP S C n k2 if nonrelevant any that are relevant have no effect, since the lower bound would already have included them. Therefore, by the results above, SP + E SP Sk1 E N SP S C n k2 SP + E SP Uk E N SP U C. 1 n k 2 We must be careful to note that this is not necessarily the optimal increase in the lower bound. Each document in S k 1 that is relevant will bring with it additional

67 3.4. Summary Measures 47 coefficients: the ones that are joint with the documents in the lower bound that have not been judged nonrelevant (illustrated in Figure 3.1). Consider a boundary case : c ii is small, c uu is large but no greater than c ii, all c uj are non-negative and as large as possible, and all c ij are non-positive and as small as possible. It can be shown that for A i 2, there are at most two possible values of B u that would satisfy the constraints above. But setting A i and B u in such a way precludes equivalent conditions from holding symmetrically, i.e. the upper bound algorithm remains optimal. As we move away from the boundary condition, it becomes harder to place document u in system B in such a way that the lower bound algorithm is non-optimal, and there is never a way to place it such that the upper bound algorithm is non-optimal. This argument is based on only one document in S and U. As the number of documents increases, the argument gets much more complex, with constraints on the harmonic means of ranks of documents. Nevertheless, it appears to be the case that when the lower-bound algorithm is non-optimal, the upper-bound algorithm is, and vice versa. We tentatively conclude that there is no (static) algorithm that lacks knowledge of p that is uniformly better than the one proposed. Of course, we do not know in advance that the lower bound is the one we need to increase. Arguments similar to those above hold for decreasing the upper bound with an algorithm mtc-min identical to Algorithm 3.6 except for taking the arg min at line 5. The upper k documents in mtc-min s ordering will have a lot of overlap with the bottom k documents in mtc-max s ordering, and vice versa. Given a budget of k documents total, binary search can be used to determine which documents to judge from the two orderings in order to both increase the lower bound and decrease the upper bound. An Ecient Algorithm for Average Precision After the detail of above, it is not that difficult to adapt the results to average precision. When taking the top k documents to set the bound, we need to consider that each additional one could potentially decrease the bound by increasing the denominator. It can be shown that in the ordering produced by Algorithm 3.6, documents should be taken as long as max i w i > AP, i.e. the maximum-weight document is greater than the average precision of the current set. This may result in the bound consisting of only one document, but that is perfectly fine. To keep processing efficient, only four documents will be weighted and one of those four judged. The four are: 1. The unjudged document included in the upper bound with maximum weight relative to the other upper bound documents. 2. The unjudged document with maximum weight relative to the lower bound documents. 3. The unjudged document included in the lower bound with minimum weight relative to the other lower bound documents. 4. The unjudged document with minimum weight relative to the upper bound documents.

68 48 Chapter 3. Comparative Evaluation Though none of these documents is guaranteed to reduce either bound by a maximal amount, each is likely to have a significant effect on the bounds. If the alternative is computing a weight for every document against both bounds, some reduction in optimality is probably worthwhile. Recomputing bounds against judged documents using the same algorithm will produce values that are only approximate and not necessarily bounds at all. Again, we may be willing to trade optimality for efficiency, especially if one judgment cannot make a large difference. 3.5 Empirical Evaluation While we were able to prove optimality in expectation for some of these algorithms, ultimately they will be used to evaluate real retrieval systems. We need an empirical argument to show that they are efficient in practice. In this section we evaluate the number of judgments it actually takes to prove that there is a difference in some evaluation measure between two systems. The retrieval runs we test on are those submitted to TREC ad hoc tracks described in Appendix A. These runs come from a variety of research groups and companies involved in search engine development, and represent a wide variety of different retrieval methods. We are interested in the algorithms performance in terms of the number of judgments it takes to prove a difference in some evaluation measure for single topics as well as averaged over sets of topics. We have chosen five measures for which to present results: precision at rank 10, recall at rank 10, DCG (with the log 2 discounting function) at rank 10, NDCG at rank 10, and average precision. The experimental procedure is as follows: 1. Randomly select a pair of systems from one of the TREC sets. 2. Randomly select a topic (unless evaluating over the full set). 3. Run the algorithm to completion. This process runs for each TREC set and each evaluation measure. We used the same random seed so that the same pairs and topics would be evaluated in each experiment. This allows direct comparisons of results. A baseline algorithm that, like the various MTCs in this chapter, has no knowledge of the distribution of relevance is the pooling method, in which the top N retrieved documents by both systems are pooled, and every document in the pool judged. Traditionally, the documents in a pool are served in a random order. For the strongest possible comparison, we have implemented an incremental pooling approach by which the pool is judged in descending order of rank retrieved: the top-ranked documents are judged first, followed by the second-ranked documents, and so on until rank N. To make the comparison even stronger, we used the bounds to determine whether the difference was proven; in this way we are testing whether judging by decreasing document weight is superior to a simpler approach that judges by decreasing rank. We can also test whether the bounds provide any benefit by comparing to a randomly-ordered pool of depth N.

69 3.5. Empirical Evaluation 49 TREC AP TREC TREC TREC TREC TREC TREC Table 3.1. Mean number of judgments required by MTC algorithms to prove a difference between two ranked lists over a single topic. Because the TREC corpora are large hundreds of thousands of documents the retrieval systems do not rank every document. This limits the ability of algorithms to find every relevant document. We make the simplifying assumption that the top 100 documents ranked by the two systems being compared comprise the corpus. This allows a comparison to standard depth-100 pooling. Since there is no way to impose an ordering on documents that have not been retrieved, any method would require some compromise Results Table 3.1 shows the mean number of judgments required to prove that a difference between two systems exists on a single topic by five different evaluation measures. Table 3.2 shows the mean number required by the incremental pooling method. The numbers are close for the rank-truncated measures, though this is expected with a cutoff as high as rank 10. For average precision, which is calculated over the entire list, MTC requires 18 27% fewer judgments on average to prove that a difference exists. Comparing to the number of judgments in the entire unordered pool, MTC requires 37 49% fewer judgments on average. Note that the number of judgments required for precision at 10 and recall at 10 are (nearly) identical, as are the number required for DCG@10 and NDCG@10. This is because the normalization factors are immaterial when evaluating over a single topic. The documents that prove a difference in precision are the same as those that prove a difference in recall, and vice versa. Note also that precision and recall required more judgments on average than DCG and NDCG for both methods. This is because precision and recall have no way to order the documents that are interesting; a document at rank 10 in one system and rank 11 in the other is just as informative for a difference in precision and recall as a document at rank 1 in one system and rank 1000 in the other. For DCG and NDCG, the latter is clearly more informative than the former, and the weights are able to capture that. Figure 3.2 shows an example of the upper and lower bounds decreasing and increasing (respectively) with the number of judgments. The evaluation measure is DCG@100 (using a deeper rank to show a more gradual decline). The two systems in this comparison, clartm and pircs1, have a nearly 60% difference in DCG@100; as a result, the upper bound decreased faster than the lower bound increased. It

70 50 Chapter 3. Comparative Evaluation TREC AP TREC TREC TREC TREC TREC TREC Table 3.2. Mean number of judgments required by pooling to prove a difference between two ranked lists over a single topic. DDCG@100 bounds document weight number of judgments judgment number Figure 3.2. The bounds of DCG@100 at iteration j (left) and the weight of the document selected at iteration j (right) for TREC-3 systems clartm and pircs1, topic 164 (similarity coefficient ). took a little over 60 judgments for the upper bound to finally drop below zero. The weight of the document chosen at each iteration is shown in the adjacent figure. Note the steep decline: the first documents judged provide much more information than later judgments. Evaluation Over Sets of Topics Systems would seldom be evaluated over a single topic; it is much more common to evaluate over 50 or more topics. Table 3.3 shows the number of judgments needed to prove a difference in each of five evaluation measures over the 50 topics in each TREC data set (49 for TREC-4). Over the set of topics, recall and NDCG require more judgments than precision and DCG; this is expected from the earlier analysis. Precision, DCG, NDCG, and average precision seem to scale roughly linearly in the number of topics; in each case the number of judgments made per topic is comparable to the number made for single topics. The numbers for recall at first may appear to be in error. They are not. Remember that once a difference in recall has been proven for a particular topic, calculating

71 3.5. Empirical Evaluation 51 TREC AP TREC , TREC , TREC , TREC , TREC , TREC , Table 3.3. Mean number of judgments required by MTC algorithms to prove a difference between two ranked lists over 50 topics. TREC prec@10 rec@10 DCG@10 NDCG@10 AP TREC TREC TREC TREC TREC TREC Table 3.4. Mean number of judgments required by pooling to prove a difference between two ranked lists over 50 topics. the bound requires assuming that every unjudged document is relevant. Therefore even when the bounds for individual topics are on the correct side of zero, they are so close to zero that they provide negligible counterweight against the unproven topics that are still on the wrong side of zero. NDCG and AP, despite having normalization factors in the number of relevant documents, do not suffer from the same problem because every document taken for the denominator affects the numerator as well, so only limited numbers of relevant documents may be assumed. Table 3.4 shows the number of judgments required by the incremental pooling method to prove the difference. As expected, MTC outperforms the pooling method except in the case of recall. This is a case where incremental pooling actually does have more information than MTC: it knows that systems tend to retrieve relevant documents at higher ranks, and by judging from the top down it is able to focus its judging effort for recall better than MTC in this particular case. Note that when MTC is based on a measure that uses order information, it outperforms incremental pooling by a large degree. Figure 3.3 shows the bounds of DCG@100 changing as more judgments are made for TREC-3 systems clartm and pircs1 over all 50 topics. The rate of decrease is quite similar to that depicted in Figure 3.2, featuring the same two systems over a single topic. It is much smoother, however, because it takes place over thousands of judgments rather than dozens. The adjacent figure again shows that the first documents judged contribute by far the most, with a seemingly exponential dropoff from the first judgment to the 1000th.

72 52 Chapter 3. Comparative Evaluation d$upper.bound[d$k DDCG bounds == 1] document weight number Index of judgments number of judgments Figure 3.3. The bounds of DCG@100 at iteration j (left) and the weight of the document selected at iteration j (right) for TREC-3 systems clartm and pircs1 (similarity coefficient ). TREC prec@10 rec@10 DCG@10 NDCG@10 AP TREC , TREC , TREC , TREC , TREC , TREC , Table 3.5. Mean number of judgments required by MTC algorithms to prove a difference between two ranked lists from the same site over 50 topics. Within-Site Evaluation One of the motivations of this work is to provide methods for smaller research groups to be able to efficiently collect relevance judgments. We can simulate the application of these algorithms within a particular group by only comparing systems that were submitted by the same site: each site that is participating in the TREC track is allowed to submit multiple runs. These are frequently smaller variations on a method rather than the larger variations that are seen between sites. Table 3.5 shows the number of judgments required to prove a difference in five evaluation measures over 50 topics when systems pairs are drawn from within a site chosen randomly. Note that the numbers are less than those in Table 3.3 in all cases, though it is still a rather significant judging effort. The recall numbers are still very high. Table 3.6 shows the results from incremental pooling. Again, the numbers are less than comparing systems between sites, but significantly more than required using MTC.

73 3.5. Empirical Evaluation 53 TREC AP TREC TREC TREC TREC TREC TREC Table 3.6. Mean number of judgments required by pooling to prove a difference between two ranked lists from the same site over 50 topics. DDCG bounds document weight number of judgments number of judgments Figure 3.4. The bounds of DCG@100 at iteration j (left) and the weight of the document selected at iteration j (right) for TREC-3 systems nyuir1 and nyuir2 (similarity coefficient ). Figure 3.4 shows an example of the bounds changing for two very similar systems, nyuir1 and nyuir2. Though they retrieved many documents in common, one is quite a bit better than the other; the MTC methods are able to home in on those documents that prove that difference. Here the upper bound barely decreases, as the most informative documents nearly always results in increasing the lower bound. Similarity Between Rankings Another way of saying that there is less within-site variation in retrieved results than between-site variation is that systems submitted by the same site tend to be more similar to each other. Here we investigate the relationship between ranking similarity and number of judgments required. In Appendix A, we propose a measure of similarity based on differences in reciprocal ranks (Eq. A.1). Figure 3.5 shows the relationship between similarity and number of documents required for single-topic evaluation (left). Each point is a different pair of systems; pairs from different sites and pairs from the same site are plotted together. Note that the relationship is very noisy, though there is a signif-

74 54 Chapter 3. Comparative Evaluation number of judgments number of judgments similarity coefficient similarity coefficient Figure 3.5. Relationship between the similarity between two ranked lists and the number of judgments required to prove that they have a difference in average precision (left) or mean average precision (right). icant negative correlation of While systems that are very similar (as those from the same site are) require few judgments, those that are not can vary wildly in the number of judgments required. They can either be very far apart in terms of performance (in which case few judgments would be needed) or very close together (in which case many would be necessary). The right plot, over multiple topics, has an even more pronounced relationship with a correlation of Figure 3.6 shows the relationship between difference in average precision for a single topic (left), difference in mean average precision (right), and the number of documents judged. As conjectured, a larger difference in quality results in fewer judgments needed to prove it. Of course, before making judgments we cannot know the relative difference between the systems, so this is not a useful predictive feature Conclusion We have shown that the MTC methods provide a significant reduction in the number of judgments needed to prove that a difference between two systems exists by some evaluation measure from a variation on the standard pooling method. However, proving the difference still requires quite a few judgments: on average, more than 100 judgments per topic are required to prove a difference in mean average precision. The steeply-declining value of judgments shown in Figures 3.2 and 3.3 suggests that it might be possible to stop judging before having proved that the difference exists and still have a high degree of confidence in the conclusion. This is the subject of the next chapter.

75 3.5. Empirical Evaluation 55 number of judgments number of judgments D average precision D mean average precision Figure 3.6. Relationship between the difference in average precision (left) and mean average precision (right) and the number of judgments required to prove that they are different.

77 Chapter 4 Confidence The results of the previous chapter show that despite being minimal, the number of judgments required to prove that two systems are different can be quite high, especially for measures that incorporate recall. But prove is a very high standard when only a handful of the O (2 n ) possible remaining assignments of relevance could result in the sign of the difference changing; it is unlikely that the worse system will catch up. Furthermore, as the results and some of the proofs in the last chapter suggest, the marginal value of a judgment decreases dramatically. After all, half the documents are chosen with the expectation of a judgment ruling out as many ways for systems to be different as possible. If a (k + 1)st judgment has low value for our confidence in a comparison, it may be not worth making. This chapter takes the first steps to quantifying the notion of unlikely to catch up. Within the bounds on a measure defined in the previous section, some values are more likely than others, depending on how many and which documents need to be relevant to achieve that value. Which values are likely and which aren t determine how likely it is that more judgments would change our decision. We can use this to develop a probabilistic stopping condition for the judging process; when the likelihood of the relative difference between two systems is high, judging can stop. 4.1 Distribution of Measures Over Judgments Determining whether a system is likely to catch up requires understanding the distribution of the values between the bounds on a measure. Since evaluation measures are calculated over judgments, all information about this distribution is contained in the distribution of possible judgments. As an example, suppose we have a set of 10 unjudged documents and a retrieval system A for which we want to know average precision and precision@5. Though we do not know the relevance of any of the documents, we can still calculate the distributions of the two measures: each of the possible 2 10 assignments of relevance result in a particular value of prec@5 and AP (Table 4.1); the distribution is defined by the probability of each of those assignments. Let φ : S X n R be an arbitrary evaluation measure, where S is the space of retrieval system results and X n is the space of possible assignments of relevance x n = {x 1, x 2,..., x n }. For a given retrieval system A S, the distribution of φ

78 58 Chapter 4. Confidence X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 R prec@5 AP x x x x x x x x x Table 4.1. Values of precision@5 and average precision depend on the assignment of relevance to 10 documents x 10 i. Assignments are numbered arbitrarily, with x 10 i being the ith assignment. over relevance judgments is P (φ). As Table 4.1 suggests, one way to calculate the probability that φ = z is to simply count the number of assignments of relevance that result in φ = z. But since there are 2 n possible assignments of relevance, this is naïvely infeasible. The next few sections are devoted to constructing P (φ) without counting assignments of relevance. As in the previous section, J is the set of existing judgments ( J 0). The judgments in J constrain the assignments of relevance that are possible. If, for example, document 1 has been judged relevant, then all assignments that have x 1 = 0 are impossible; they have probability 0. Every judgment reduces the number of possible assignments of relevance by half. The distribution of a measure is the conditional on these judgments: P (φ = z J ) = P (φ = z J, x n )p(x n = x n J ) (4.1) x n X n where X n = {X 1, X 2,..., X n } is a random variable over the space X n. Since φ is completely determined by an assignment of relevance, P (φ = z J, x n ) is either 1 or 0, depending on whether the assignment x n results in φ = z. The distribution therefore reduces to the distribution of assignments of relevance. In the previous chapter we were interested in the sign of a difference in some measure; likewise, in this chapter we will ultimately be concerned with the probability that the sign of the difference is less than, greater than, or equal to zero. The confidence in a result will be defined as confidence = P ( φ < 0 J ). (4.2) The end goal of this chapter is to have a stopping condition for the judging process based on high confidence in the result. Before defining that, we must show how to calculate confidence for various retrieval measures.

79 4.1. Distribution of Measures Over Judgments Prior Probability of Relevance In the previous chapter we assumed that algorithms had no information about the distribution of relevance. Though we will continue to make that assumption, Eq. 4.1 requires a prior p(x n J ). A noninformative or flat prior would seem to carry the least risk, but there are at least three possible flat distributions: 1. All assignments of relevance consistent with judgments J are equally likely, i.e. p(x n = x n J ) = 1. Assignments of relevance not consistent with J 2 n J have probability All values of R, the number of relevant documents, are equally likely, and the R relevant documents are distributed uniformly at random (but consistent with J ) among documents, i.e. p(x n = x n J ) = n R= J p(xn R, J )p(r J ) = ) R 1/( n R 1 n J All values of φ that are consistent with J are equally likely. The last is not necessarily consistent; there may be no assignment of relevance that makes all values of φ equally likely. Furthermore, it ties the evaluation too closely to a particular measure; P (prec@5) and P (rec@5) cannot both be uniformly distributed, for instance. Between the other two, the better choice is not immediately clear. Both lead to the same assumption about the relevance of an individual document: for the first, p i = p(x i = 1) must be 1/2 for all unjudged i, because if any is not 1/2 then p(x n ) 1/2 n. For the second, for each value of R, the number of relevant documents in a sample of size m is distributed hypergeometrically. If the sample is size 1, then p(x i = 1) = n p(x i = 1 R)p(R) = R=0 = = = n R=0 R=0 R=0 1 n + 1 ( R n R ) 1)( ( 0 n 1) n 1 R! n + 1 1!(R 1)! n 1 R n + 1 n n(n + 1)/2 n(n + 1) = !(n 1)! n! Under the first, however, the relevance of each document is independent and identically distributed, while under the second, the relevance of documents is interdependent. It is easy to see that the first leads to the conclusion that R, being the sum of i.i.d. Bernoulli trials, is binomially distributed; therefore we will refer to it as the binomial prior. The second explicitly assumes a uniform distribution of R, so we will call it the uniform prior. In the following sections, we consider how the choice of prior determines the distribution of a measure. Through this chapter, k n will be used to refer to a given rank. For the most part, we will assume that there are no prior judgments,

80 60 Chapter 4. Confidence i.e. J =. This will simplify the presentation considerably, though it is straightforward to adapt the results to include existing judgments: n can be thought of as the number of unjudged documents, R as the number relevant among those unjudged, and k as the number unjudged in the top k retrieved (overloading notation a bit). Any existing judgments simply place invariant bounds on a measure, and the distribution is over the values between the bounds Distribution of Precision As in the previous chapter, precision is perhaps the easiest measure to work with and thus serves as a good sandbox for understanding these ideas. Intuitively it is clear how precision will be distributed: if the relevance of documents is i.i.d. with probability p i = 1/2, precision will have a binomial distribution. If the number of relevant documents is uniformly distributed, and relevant documents are distributed uniformly throughout the list, precision will be uniformly distributed as well. It is just a matter of working out the details. Plugging into equation 4.1 gives: P (prec@k = z J ) = P (prec@k = z x n, J )p(x n J ) x n X n For simplicity, let x k be the set of judgments of the top-k-ranked items, and x n k be the set of judgments on the remaining n k items (x n = x k x n k ). We can use this to show that the distribution of precision is equal to the distribution of the number of retrieved relevant documents: x n X n P (prec@k = z x n, J )p(x n J ) = x k P x k =kz p(x k J ) x n k p(x n k J ) = P (TP@k = kz J ) x n k p(x n k J ). The relevance of the remaining n k documents has no effect whatsoever on precision at k. The probabilities sum to 1, leaving us with P (prec@k = z J ) = P (TP@k = kz J ). Under the binomial prior, TP@k is essentially a sum of k Bernoulli random variables, and therefore is binomially distributed: P (prec@k = z J ) = P (TP@k = kz J ) = ( k kz ) 1 2 kz k kz 1 2 i.e. the fraction of ways that kz relevant documents can be distributed among k unjudged documents. Precision can therefore be approximated with a normal distribution, and as k n, the normal approximation improves. Of course, it is a somewhat strange distribution: most of the mass is concentrated around the expectation, with very low values and very high values having near-0 probability. It seems quite risky to essentially rule out many possible values of precision without having made any judgments. We will come back to this below.

81 4.1. Distribution of Measures Over Judgments 61 With the uniform prior, we are given R relevant documents, but they are distributed randomly among the unjudged documents. We need to calculate the number of ways that kz of those can fall into the top k. This is the hypergeometric distribution: of n documents of which R are relevant, the probability that kz are relevant in a sample of size k is ( )( R n R ( kz k kz) / n k). The uniform prior therefore becomes: P (prec@k = z J ) = P (TP@k = kz J ) = = n P (TP@k = kz J, R)p(R J ) R=0 n ( )( ) ( ) 1 R n R n / n + 1 kz k kz k R=0 = 1 k + 1. The final identity can be verified using the Wilf-Zeilberger (WZ) method described by Petkovsek et al. (1996, chapter 7) and implemented for the computer algebra system Maxima (de Souza et al., 2004) by Caruso (1999). Note that the final answer does not depend on n, z, or R; therefore precision is uniformly distributed under this prior. This distribution is both easy to compute and not risky in the way the binomial distribution is: whereas the binomial distribution would rule out many possible values that are in fact very likely to occur in reality, this distribution rules out nothing and does not give preference to anything. Expectation and Variance Having shown that precision has a distribution over possible assignments of relevance, we can see that precision is itself a discrete random variable, analogous to flipping k (biased) coins and taking the average number of heads. Like any random variable, it has an expectation and variance. The definition of expectation and variance would compel us to sum over all possible assignments of relevance: Eprec@k = x k X k p(x k = x k ) 1 k n x i I(A i k) Var (prec@k) = Eprec@k 2 (Eprec@k) 2 = ( p(x k = x k ) 1 n 2 k 2 x i I(A i k)) (Eprec@k) 2. x k X k i=1 But since we have shown that precision is distributed either binomially or uniformly (depending on the prior), calculating the expectation and variance is much simpler than that. Expectation in both cases is 1/2 = 1/k k/2, while variance is 1/k 2 k/4 for the binomial prior and 1/k 2 (k 2 + 2k)/12 for the uniform prior. These expressions are given in any probability and statistics textbook, but it is worth considering their derivations in more detail: other evaluation measures do not have such standard distribution functions, and we shall have to derive expectations and variances ourselves. i=1

82 62 Chapter 4. Confidence Precision can be written in terms of Bernoulli random trials X i for the relevance of document i: prec@k = 1 n X i I(A i k). k i=1 Note that this is identical to the definition of precision from Section 3.2.1, only substituting random variables X i for variables x i. Then the expectation of precision is a function of the sum of the expectations of X i : Eprec@k = 1 n EX i I(A i k). k i=1 We showed above that EX i = 1/2 for all i for both priors. The expectation of precision is therefore a sum over k of these expectations: Eprec@k = 1 1 k 2 I(A i k) = 1 k k 2 = 1 2. To determine variance, we can expand the square and apply linearity of expectations to the result: Var (prec@k) = 1 n k 2 EX i I(A i k) + i=1 n EX i X j I(A i k)i(a j k) 1 2 i=1 j i Since both X i and I(A i k) are either 0 or 1, squaring them is just an identity function. We know EX i = 1/2. EX i X j reduces to the probability that both X i and X j are relevant; this depends on the prior. In the binomial case, the two documents are independently relevant, so EX i X j = p i p j = 1/4 (where p i = p(x i = 1)). In the uniform case, the probability that both are relevant is a function of the probability of finding two relevant documents in a sample of two documents from a corpus of n documents with R relevant. n EX i X j = p(x i = 1, X j = 1) = p(x i = 1, X j = 1 R)p(R) = = = Therefore variance is: binomial: uniform: just as stated above. n R=0 n R=0 1 n + 1 ( R n R ) 2)( ( 0 n 2) R(R 1) n(n + 1)(n 1) R=0 n(n + 1)(2n + 1)/6 n(n + 1)/2 n(n + 1)(n 1) Var (prec@k) = 1 k 2 ( k 2 + k2 k 4 Var (prec@k) = 1 k 2 ( k 2 + k2 k 3 = (2n 2)/6 n 1 ) 1 2 = 1 2 4k ) 1 2 = = 1 k 2 k 2 + 2k 12 2.

83 4.1. Distribution of Measures Over Judgments 63 Distribution of Dierence in Precision As this work is ultimately more interested in the comparative evaluation problem, the distribution of interest is really the distribution of a difference in precision between two systems over the same topic or set of topics. As it turns out, despite the two priors leading to very different posteriors on precision, their posteriors for precision are fairly similar. The distribution of the difference is: P ( prec@k = z J ) = P (prec prec = z J ) = P (TP TP = kz J ). As illustrated in Section 3.2.1, subtracting the two precisions removes the dependencies between the systems the documents that they retrieved in common. Let k be the number of unique documents retrieved by each system, and let us overload notation a bit by using prec@k to mean the precision among the k unique documents. Then prec@k = prec@k. Where we go from there depends again on which prior we have assumed. With the binomial prior, the two precisions are effectively independent: P ( TP@k = k z J ) = k z A =0 P (TP = z A J )P (TP = k z z A J ) k ( ) k 1 k ( k ) 1 k = z z A =0 A 2 k z z A 2 ( ) 2k 1 2k = k k (4.3) z 2 which is again a binomial distribution. The uniform prior decomposes as follows: P ( TP@k = k z J ) (4.4) = R = R k z A =0 z A 1 = z A 1 n + 1 P (T P A = z A J, R)P (T P B = k z z A J, R, z A )p(r J ) ( R )( n R ) z A k z ( A n + 1 n ) k ( n k )( z k z k z A ( R za )( n k R+z A ) z B k z ( B n k z A ) / k ) ( )( n n k k k ). (4.5) Again, the final identity can be verified using the WZ method. The two distributions are illustrated in Figure 4.1. Note that both are symmetric around the same mean, but the distribution resulting from the binomial prior has heavier tails and is modeled fairly closely by a normal distribution. Despite the distribution of precision assigning low probabilities to some values, the corresponding distribution of the difference in precision actually seems quite reasonable! The

84 64 Chapter 4. Confidence Density Density Density Density Figure 4.1. Distribution of and under the binomial prior (top) and uniform prior (bottom). The red curves are normal distributions with the same mean and variance. uniform prior is strongly peaked at = 0 because there are more ways for the difference to be 0 than any other value, and all of them have equal probability. Even for a measure as simple as precision, the computation that the uniform prior requires to find the posterior is difficult (though the WZ method can verify the hypergeometric identities, it cannot find them; I found them by inspection and trial and error). Of course, it can be simulated instead, but that shifts the computation from a one-time cost of researcher time to an ever-present cost of processor time

85 4.1. Distribution of Measures Over Judgments 65 plus researcher time waiting for the simulation to complete. However, for more complicated measures, simulation may be the only reasonable solution Distribution of Recall To determine the probability that recall has a given value, we must consider that, unlike precision, the value depends on both the number of relevant documents in the top k as well as the total number of relevant documents. So to determine P (rec@k = 0.1), we need to sum P (TP@k = 1, R = 10), P (TP@k = 2, R = 20),..., P (TP@k = z, R = 10z), z k. Under the binomial prior, we can decompose recall as follows: P (rec@k = z J ) = = k P (TP@k = i, R = i/z J ) i=0 k P (TP@k = i J )P (R = i/z J, TP@k = i) i=0 Both TP@k and R are distributed binomially, so we can rewrite this as: P (rec@k = z J ) = = 1 2 k i=0 n ( k i k i=0 ) 1 2 k ( n k i/z i ) 1 2 ( )( ) k n k i i/z i n k (4.6) This sum is not properly hypergeometric and does not appear to reduce any further. As n increases, binomial distributions become increasingly normal. Therefore it is reasonable to assume that P (R J, TP@k) is normal, making P (rec@k J ) a binomial mixture of normals. This is illustrated in Figure 4.2. Like precision, the distribution of recall assigns very low probabilities to some eminently reasonable possibilities. The uniform prior again requires decomposition over R: P (rec@k = z J ) = R = R P (R rec@k = zr R, J )p(r J ) ( )( ) ( ) 1 R n R n / n + 1 zr k zr k i.e. to a uniform mixture of hypergeometric distributions. Figure 4.2 shows this distribution. Note that rec@5 = 0 is more likely than any other value. Expectation and Variance As with precision (Section 4.1.2), we can rewrite Eq. 3.9 in terms of Bernoulli random trials for relevance: rec@k = 1 Xi Xi I(A i k) (4.7)

86 66 Chapter 4. Confidence The expected value and variance of recall are calculated over all possible assignments of relevance: = p(x k = x k 1 ) xi I(A i k) xi x k X k Var (rec@k) = p(x k = x k 1 ( ) 2 ) ( x x k X k i ) 2 xi I(A i k) (Erec@k) 2. It is, of course, infeasible to sum over 2 n possible assignments of relevance, but, as with precision, we can take advantage of the linearity of expectation: [ Xi Erec@k = E ]I(A i k) (4.8) Xi [ ] X i Var (rec@k) = E ( X i ) 2 I(A i k) + [ ] 2X i X j E ( X i ) 2 I(A i k)i(a j k) (Erec@k) 2. (4.9) But in this case, E[X i / X i ] is not as straightforward as precision s EX i. It is computable, but it depends on the prior. Specifically it depends on the prior distribution of R = X i. Consider the three expectations E[X i / [ X i ], E X i / ( X i ) 2], and E [X i X j / ( X i ) 2] for both priors. Clearly we need only consider the cases where X i = 1, since when X i = 0 the entire expression is 0. We can then see each expectation as the expectation of the reciprocal of random variable R n 1 = 1 + X j, where the sum is over all variables but the X i in the numerator. Under the binomial prior, R has a binomial distribution. [ ] [ ] Xi 1 E = p(x i = 1)E + p(x i = 0) 0 Xi 1 + R n 1 = 1 n 1 ( ) 1 n 1 1 R n 1 R R R 2 2 R=0 = 1 1 (1 1/2) n 2 n/2 1/2 n/2 = 1/n. (provable using the WZ method) The approximation is off by O (2 n /n) and negligible even for relatively small n. 1/2 Similarly, the other two expectations are roughly (n/2) and 1/4 2 (n/2), respectively, 2 both with errors of O ( 2 n /n 2). Therefore, under the binomial prior we have:

87 4.1. Distribution of Measures Over Judgments 67 1 n I(A i k) = k (4.10) n Var (rec@k) 1/2 (n/2) 2 I(A i k) + ( ) 2 1/4 k (n/2) 2 I(A i k)i(a j k) n = k/2 ( ) 2 (n/2) 2 + (k2 k)/4 k (n/2) 2 (4.11) n with errors in O (k2 n /n) and O ( k 2 2 n /n 2) respectively. Under the uniform prior, R is uniform, so the expectation is of the reciprocal of a uniform random variable. For each value of R, we must consider the probability that X i = 1, i.e. that given a universe of n documents of which R are relevant, if we sampled just one, what is the probability it would be relevant? E [X i / ] X i = = n 1 R p(r)p(x i = 1 R) ( n R n R ) 1 1 1)( ( 0 R n + 1 n 1) R=1 R=1 = 1 1 R R n + 1 n = 1 n + 1. The other two expectations are similar, but not identical: [ ( ) ] ( 2 n R n R ) 1 1 E X i / Xi = 1)( ( 0 R 2 n + 1 n ) = H n n R= n [ ( ) ] ( 2 n R n R ) 1 1 E X i X j / Xi = 2)( ) 0 R 2 = n H n n + 1 n 3 n R=1 where H n = n i=1 1/i is the nth harmonic number. Therefore, under the uniform prior, we have: ( n 2 Erec@k = 1 n + 1 I(A i k) = k n + 1 Var (rec@k) = kh n n 2 + n + (k2 k)(n H n ) n 3 n ( ) 2 k. n + 1 Note that the expectations are similar regardless of prior, but the uniform prior results in significantly more variance. Distribution of Dierence in Recall We can express the distribution of a difference in recall is P ( rec@k = z J ) = P (rec rec = z J ). Like precision, we can factor this: P (rec rec = z J ) = P (rec = z 1 J )P (rec = z 2 J, rec = z 1 ). z 1,z 1 z 1 z 2=z

88 68 Chapter 4. Confidence Since the two systems have been run on the same topic(s), the number of relevant documents is the same. Any relevant documents that the two retrieved in common in the top k cancel, so we need only consider the documents that were uniquely retrieved by each system. Under the binomial prior, we can further factor by the numbers of uniquely retrieved relevant documents. Then, expanding on Eq. 4.6, we have P (rec rec = z J ) = z 1,z 2 P (rec = z 1 J )P (rec = z 2 J ) = n 1 2 z 1,z 2 k k i=0 j=0 ( k i )( k j )( n 2k i/z 1 i j ) (4.12) i.e. proportional to the number of ways to place i relevant documents in k retrieved, j relevant documents in k retrieved, and the remaining relevant documents in the n 2k not retrieved. With the uniform prior we can similarly count possible placements of relevant documents, but summing over the number of relevant documents rather than the number of relevant retrieved documents: P (rec rec = z J ) = n P (R J )P (rec = z 1 R, J )P (rec = z 2 R, J, z 1 ) z 1,z 2 R=1 = z 1,z 2 R 1 n + 1 ( R )( n R z 1R k z 1R )( R z1r)( n k R+z1R z 2R k z 2R ( n k)( n k k ) ). (4.13) Figure 4.2 shows these distributions, simulated for k = 5 over a collection of size n = 100. Note that despite having long tails (covering every value from -1 to 1), the uniform prior gives a very high probability to the difference in recall being 0. The binomial prior does not cover all the values, but its tails are heavier. It also more closely resembles a normal distribution. Again, the binomial prior seems to have produced a more reasonable distribution for differences, despite being risky as a distribution for the measure itself Distribution of Average Precision With average precision we cannot speak in terms of the probability of z relevant documents in the top k, or of finding z relevant documents in a sample of k. The position of the relevant documents matters more than how many there are in any sample of k. Furthermore, average precision is not a function of a sum of independent random variables, like precision and recall are. As in Eq. 3.26, it is quadratic in the relevance of the documents; rewriting in terms of Bernoulli random trials, it is: AP = 1 a ii X i + a ij X i X j (4.14) Xi j<i

89 4.1. Distribution of Measures Over Judgments 69 Density Density recall@ Drecall@5 Density Density recall@ Drecall@5 Figure 4.2. Distribution of recall@5 and recall@5 under the binomial prior (top) and uniform prior (bottom). The simulated collection size is n = 100. Red curves in the top left plot illustrate the five normal distributions being mixed. A naïve application of the Central Limit Theorem will therefore not help. Instead, we appeal to the theory of limits of stochastic processes, specifically martingale limit theory. A martingale is a type of stochastic process in which the expectation of the ith observation, conditional on the previous i 1 observations, is equal to the (i 1)st observation, or E[Y i Y 1, Y 2,...] = Y i 1 (Motwani and Raghavan, 1995). Any sequence of independent random variables (like coin flips) can be treated as a martingale, though in general martingales can be used to model sequences of

90 70 Chapter 4. Confidence random variables with some dependence among summands, as in the quadratic equation above. A Doob martingale is a special type of martingale defined in terms of conditional expectations of a function f(x 1, X 2,...) of random variables X i (Dubhashi and Panconesi, 2005). We will define a Doob martingale sequence of SP n = R AP n (SP being sum precision as in the previous chapter) in the number of documents in the corpus. SP n = a ii X i + a ij X i X j ESP n = 1 2 a ii a ij. (The expectation comes from applying linearity of expectations and assuming that X i, X j are independent; see below.) Let Y 0 = ESP n Y 1 = E[SP n X 1 ] = a 11 X a 1jX a ii a ij Y n 1 = E[SP n X 1,..., X n 1 ] = a ii X i + a ij X i X j a n 1,n a n 1,jX j Y n = E[SP n X 1,..., X n 1, X n ] = SP n. The sequence {Y 0, Y 1,..., Y n 1, Y n } is a Doob martingale. Now let Z i = Y i Y i 1, and define SP n = SP n ESP n = Z i. Then the sequence {SP 1, SP 2,..., SP n} is a martingale. We want to use martingale limit theory to show that SP n/ Z 2 i N (0, 1). Martingale central limit theorems generally hold under three conditions (Hall and Heyde, 1980): 1. The contribution of any summand is asymptotically negligible. 2. The variance is well-behaved, i.e. it converges to a finite random variable. 3. Any summand s contribution to the variance is asymptotically bounded in n. We will show that each of these hold for SP n = Z i, and its distribution therefore converges to normal. Theorem 4.1. Let Z ni = n 1/2 Z i (with Z i as defined above), and k n = n. Let SP n = Z ni. Then {Z ni, n 1, k n i 1} is a martingale array, and SP n converges in distribution to a Gaussian. Proof. To prove convergence, the three conditions above must hold. In turn: 1. The contribution of the largest summand is asymptotically negligible, i.e. max Z ni p 0 as n. i This holds trivially, as n while Z i 1

91 4.1. Distribution of Measures Over Judgments Variance converges to a finite random variable, i.e. SP 2 n = Z 2 ni p η 2. First, note that SP 2 n = ( Z ni ) 2 = Zni 2. This is a property of martingales in general, following from the definition of a martinagle given above. Then: SP 2 n = Z 2 ni = n 1 (Y i Y i 1 ) 2 n 1 Y 2 where Y 2 is the limit of (Y i Y i 1 ) 2 = Y 2. Y 2 is finite: subtracting Y n 1 from Y n leaves only the contribution of the (n 1)st document, which cannot be infinite. Furthermore, Y 2 > 0 with probability 1: there is no assignment of relevance that could result in Y i Y i 1 = Maximum variance of any summand is bounded in n, i.e. [ ] E max Zni 2 o(n). i Note that Z ni = n 1/2 (Y i Y i 1 ), and the expectation of Y i Y i 1 is bounded by the contribution to SP n of the ith document. Since the contribution of any document is a sum of reciprocal ranks, it follows that E [ max i Z 2 ni] H n /n, where H n is the nth harmonic number. Since H n grows slower than n, H n /n = o(n). Since SP n is a martingale and the three conditions hold, it follows from Theorem 3.3 of Hall and Heyde (1980, page 64) that SP n/sp 2 n is normally distributed with mean 0 and variance 1. So sum precision is normally distributed, but what about average precision? We will show that the distribution of average precision converges to the distribution of sum precision, and therefore transitively converges to a normal distribution. To do that, we will first show that for any given value of sum precision z (and corresponding value of average precision z/r obtained by dividing z by the number of relevant documents), the number of assignments of relevance that produce a sum precision less than z is equal to the number of assignments of relevance that produce an average precision less than z/r. This is not as straightforward as it first appears: SP and AP are not perfectly correlated; for a given z, there may exist an assignment of relevance x such that SP (x) < z while AP (x) > z. Let x n 1 be an assignment of relevance, and let SP n1 and AP n1 = SP n1 /R n1 be the values of sum precision and average precision obtained from that assignment. We claim that for all x n 2 such that the resuling sum precision SP n2 < SP n1 and the resulting average precision AP n2 > AP n1, there is, in the limit, an assignment x n 2 such that the resulting sum precision SP n2 > SP n1 and the resulting average precision AP n2 < AP n1. Additionally, for all x n 2 such that SP n2 > SP n1 and AP n2 < AP n1, there is an x n 2 such that SP n2 < SP n1 and AP n2 > AP n1. In other words, as n, the number of assignments of relevance that produce sum precisions less than SP n1 converges to the number of assignments of relevance that produce average precisions less than AP n1.

92 72 Chapter 4. Confidence Our argument is informal. Essentially, given assignments of relevance x n 1, x n 2 such that SP n2 < SP n1 and AP n2 > AP n1, we can construct x n 2 from x n 2 by turning documents at the bottom of the list relevant until SP n2 > SP n1. Using the lowestranked documents ensures that their contribution to SP n2 is outweighed by their contribution to the number of relevant documents R n1. As n, it becomes more and more likely that we can add enough relevant documents to make SP n2 > SP n1 and AP n2 < AP n1. A similar argument holds for x n 1, x n 2 such that SP n2 > SP n1 and AP n2 < AP n1, except that x n 2 is constructed by turning documents at the top of the list nonrelevant until SP n2 < SP n1 and AP n2 > AP n1. This means that while the pairs of assignments of relevance x n 1, x n 2 that result in SP n1 < SP n2 may be different from those pairs that result in AP n1 < AP n2, the number of such pairs of assignments is asymptotically the same. If all assignments are equally likely, as they are under the binomial prior, then it should follow that the distribution of AP n converges to that of SP n. To prove that it does, we must show that the difference in cumulative probability asymptotically approaches zero. Let sap n = (AP n EAP n )/Var (AP n ) and ssp n = (SP n ESP n )/Var (SP n ). We must show that P (sap n z/r) P (ssp n z) 0 for all z. We have already shown that SP n z AP n z/r in the limit. Random variables are invariant under standardization, so this holds for the standardized versions as well. Then for an arbitrary z, any assignment of relevance that produces ssp n z will also, in the limit, produce sap n z/r. Therefore P (sap n z/r) and P (ssp n z) are both equal to the count of assignments of relevance that would result in ssp n z divided by 2 n, the total probability of relevance assignments, and therefore their difference is asymptotically negligible. Since AP n SP n and SP n N, we conclude that AP n N as well. Figure 4.3 shows simulated distributions of sum precision and average precision with n = 100. Note that n = 100 is not large enough to achieve convergence; deviations from the normal distribution are visible. It is also not large enough for AP to converge to SP, as careful inspection of the two histograms reveals. Note in both cases that many values have near-zero probability, contrary to our experience with evaluating retrieval systems. It is unlikely that the uniform prior leads to any efficiently computable closedform distribution for AP. Given R relevant documents, there is no choice but to look at all ( n R) ways to place them in the ranked list. Figure 4.3 shows the empirical distributions of SP and AP under the uniform prior. Smaller values of SP are more likely, but less likely for AP. Nevertheless, AP is close to uniformly distributed. Expectation and Variance As with precision and recall, calculating the expectation of AP involves summing over all 2 n possible assignments of relevance: EAP = ( 1 p(x n = x n ) aii x i + ) a ij x i x j. (4.15) xi x n X n We can apply linearity of expectations to Eq. 4.15: EAP = [ ] Xi a ii E + [ ] Xi X j a ij E Xi Xi

93 4.1. Distribution of Measures Over Judgments 73 resulting in something very much like our recall expectation. We have shown that E[X i / X i ] (1/2)/(n/2) or = 1/(n + 1), depending on prior. E[X i X j / X i ] is easily found using the same method: when documents are independently relevant, it is an easy extension to show that E[X i X j / X i ] (1/4)/(n/2). When they are not, it requires only a simple extension of the algebra used in Section to show that E[X i X j / X i ] = 1/2(n + 1) Thus: binomial: EAP = 1 n/2 uniform: EAP = 1 n + 1 ( 1 aii ( aii aij ) + ɛ (4.16) aij ). (4.17) And since there are O ( n 2) expectation terms, each with O (2 n /n) error, ɛ = O ( n 2) O ( 2 n /n ) = O ( n2 n). Variance can be found by using Var (AP ) = EAP 2 (EAP ) 2, expanding the expression, and again applying linearity of expectations. Multiplying out EAP 2 is a long and arduous process, and we will save the details. Suffice to say, after calculating all necessary expectations, the variance is: binomial: Var (AP ) = 1 (n/2) 2 ( 1 4 a2 ii a2 ij a iia ij a ija ik ) + ɛ. (4.18) There are O ( n 3) terms with O (2 n /n) error each, so ɛ = O (n2 n ). For the uniform prior: Var (AP ) = H n a 2 n 2 + n ii + n H ( ) n a 2 n 3 n ij + 2a ii a jj + 2a ii a ij n 2 5n + 4H n + 2aij 2n 4 4n 3 2n 2 a ik + 4n 2n 3 15n n 36H n + 2aij 6n 5 30n n n 2 a kl 36n ( 1 aii (n + 1) ) 2 aij (4.19) 2 which, despite being complicated, is efficiently computable. Distribution of Dierence in Average Precision Because AP is normal under the binomial prior, it follows that AP is normal as well. Determining the expectation and variance is as simple as replacing a ij in Eqs and 4.18 with c ij = a ij b ij = 1/ max{a i, A j } 1/ max{b i, B j }. Figure 4.3 shows the distribution of AP over simulated relevance judgments. Even with n only 100, the deviance from a normal distribution is slight; in fact, the hypothesis that AP is normally distributed cannot be rejected at p = 0.05 using the Anderson-Darling test for normality. Note also that the distribution covers a range of values that roughly matches what has been observed at TREC.

94 74 Chapter 4. Confidence Density Density Density sum precision average precision Daverage precision Density Density Density sum precision average precision Daverage precision Figure 4.3. Distribution of sum precision, average precision, and average precision over a simulated collection of size n = 100 under the binomial prior (top) and uniform prior (bottom). If the distribution of AP has no nice closed form under the binomial prior, a difference in AP is unlikely to have a closed form either. Figure 4.3 shows the distribution. Most of the mass covers the same values as the binomial prior, though, like precision and recall, more sharply centered at zero. Once again the binomial prior, despite leading to a poor posterior on the measure, seems to result in a reasonable (and more easily computable) posterior on the difference Other Measures As presented in Chapter 3, most measures can be understood as precision-like (a weighted sum over the relevance of individual documents) or recall-like (same, but normalized). Under the binomial prior, the distributions of precision-like measures will converge to normal by essentially the same argument as for precision: the sum of independent random variables tends to be normally distributed. Similarly, recall-like measures will tend to converge to mixtures of normal distributions. The uniform prior is computationally more difficult, and for measures like DCG, NDCG, and R-precision, there seems to be no nice closed form. The only recourse is simulation: draw values of R from a discrete uniform distribution, pick R docu-

95 4.2. A Significance Test Based on Judgments 75 ments to be relevant, and calculate the value of the measure of interest. This creates an additional source of error and takes time Distributions Over a Set of Topics The distribution results above are given for retrieval system results over a single topic. Systems are generally compared over a set of topics, since there is variance due to topics as well as systems and missing judgments. Fortunately, for most IR tasks, topics are considered to be independent. Extending these results to a set is therefore simple. The expectations are simply averaged over T topics, and the variances are summed and divided by T 2. As well, the distribution of every measure, regardless of prior, will tend to converge to normal as the number of topics increases. This is a simple application of the Central Limit Theorem, which says that the distribution of a sum of independent random variables is asymptotically normal. Note that this does not account for the variance in a measure due to the topic sample. That will wait until Chapter A Significance Test Based on Judgments Significance tests are based on comparing the probability of a set of observations given a null hypothesis to the probability of the observations given an alternative. We can set up a test for relevance judgments J and the difference in an evaluation measure φ, where the test statistic T is the likelihood ratio of the null hypothesis to alternative. H 0 : φ > 0 (4.20) H 1 : φ 0 (4.21) T = P (J φ > 0) P (J φ 0) (4.22) Note that P (J φ > 0) is reversed from what we have been calculating in the previous sections. In a table like Table 4.1, only for φ, this probability would be the number of assignments of relevance that are consistent with J and for which φ > 0, divided by the number such that φ > 0. Since this once again involves counting over 2 n assignments of relevance, we need to simplify, which we can do by applying Bayes identity: T = P (J φ > 0) P ( φ > 0 J )p(j )/P ( φ > 0) = P (J φ 0) P ( φ 0 J )p(j )/P ( φ 0) P ( φ > 0 J ) P ( φ 0) = P ( φ 0 J ) P ( φ > 0) (4.23) where P ( φ > 0) = x n P ( φ > 0 xn )p(x n ), i.e. it is the prior probability that φ > 0, before any judgments have been made. It is in calculating T that normal approximations become very useful: instead of counting 2 n assignments, we can simply plug values into the normal cumulative density function; it only requires

96 76 Chapter 4. Confidence the calculation of an expectation and variance, both of which are computable in polynomial time. The binomial prior is therefore ideal, since it often results in a posterior distribution that can be approximated as normal. The uniform prior, on the other hand, will generally require Monte Carlo simulation to calculate these probabilities. With the distribution results and hypothesis test, we now a have a way to understand the variance in a comparison of two systems due to the missing judgments. Note, however, that this is only a proper hypothesis test if observations J are sampled independently and identically. Typical methods for selecting documents to judge would not meet that requirement; how this affects the conclusions we draw from the test is a question to be answered empirically. This framework allows for testing more elaborate hypotheses, e.g. P ( φ1 φ2 φ 1 < λ) (the percent difference is less than λ). The distribution results from above may or may not hold for such cases; we will not evaluate them here, but whether they do or not, Monte Carlo simulation can still be used to calculate the test statistic. 4.3 Probabilistic Stopping Conditions The next step is to combine understanding variance due to missing judgments with controlling variance due to missing judgments: we can select the minimal set of documents to judge that will reduce variance the most, thus requiring least effort for maximum confidence in the result of a comparison of two systems. The procedure of selecting a document and judging it is a classic sequential sampling procedure. The classic significance test for sequential sampling is the sequential probability ratio test (SPRT) (Wald, 1947; Wetherill, 1975). In the SPRT, a likelihood ratio such as that in Eq is compared to an interval (B, A): B < P (J φ > 0) P (J φ 0) < A. If, after a judgment, the likelihood ratio has fallen below B, we conclude that φ 0. If it rises above A, we conclude that φ < 0. The test assumes that the tth sample is selected independently of the decision of whether or not to stop sampling after t 1 sequential tests. Again we apply Bayes rule and compare the resulting statistic Eq to (B, A): P ( φ < 0 J ) P ( φ 0) B < P ( φ 0 J ) P ( φ < 0) < A. A and B are chosen in order to obtain specified Type I and Type II error probabilities α and β: A = 1 β α, B = β 1 α. A and B are thus related to the cost of false positives and false negatives. In this case false positives and false negatives are equally costly, so we set α = β = 0.05, resulting in A = 19 and B Though the presentation is different from our previous work (Carterette et al., 2006), it is equivalent if P ( φ) is symmetric around its mean.

97 4.4. Experiments 77 The generic sequential algorithm is shown in pseudocode below. Algorithm 4.1. MTC. MTC algorithm with probabilistic stopping condition. 1: while B < P ( φ>0 J ) P ( φ 0) P ( φ 0 J ) P ( φ>0) < A do 2: Calculate document weights w i (if necessary). 3: i arg max i w i 4: Request judgment on document i. 5: J J x i If Monte Carlo simulation is used to calculate T, the delay in selection may be unacceptable (or, if few trials are used, the error rates may be higher than the selection of α and β would predict). Note that the algorithm does not update the prior p(x n ) as judgments are made. In fact we can update it, but there are questions of which information to use and how to use it; we will not tackle this until Chapter Experiments The experiments in this section evaluate the sequential algorithm with the probabilistic stopping condition calculated using the distribution results in this chapter. We ran the sequential algorithm for different measures over pairs of retrieval systems for both single topics and sets of topics. For the sets of topics case, φ is taken as the average of φs calculated for each topic. We want to evaluate whether the stopping condition does what it s supposed to, i.e. for a given α and β, are the false positive and false negative rates actually α and β? Ideally, the error rates should be equal to α and β. If they are less, it means we incorrectly estimated confidence and as a result did more judgments than needed; in fact we do expect them to be less for reasons related to the lack of prior updating that we will get to in Chapter 6. If they are greater, then we have a problem: we cannot trust that the judgments acquired using the algorithm truly identify a difference between systems. The experimental process is the same as in the previous chapter: sample two systems, sample a topic, and run the algorithm to judge documents. This time there are two evaluation criteria the number of judgments and the accuracy at determining the sign of the difference as well as several parameters that can vary independently: the test parameters α and β, the distribution (exact, normal approximation, or empirical Monte-Carlo), and, in the case of Monte Carlo simulation, the number of trials. Again, we compare to the incremental pooling method, this time allowing it to use the probabilistic stopping condition for the strongest possible comparison Results Table 4.2 shows the results of evaluating a single topic using precision at 10, recall at 10, and average precision using these methods. The stopping condition used

98 78 Chapter 4. Confidence binomial uniform TREC AP AP TREC % 92% 97% 100% 100% 96% TREC % 95% 99% 99% 100% 99% TREC % 98% 97% 100% 100% 100% TREC % 95% 97% 100% 100% 97% TREC % 97% 97% 100% 99% 96% TREC % 93% 99% 97% 97% 99% Table 4.2. Number of judgments made and resulting accuracy when comparing two systems over a single topic using MTC by either prior with α = 0.05, β = exact probabilities where possible: for precision and recall, these are based on Equations 4.3 and 4.12 for the binomial prior, and Equations 4.5 and 4.13 for the uniform prior. Average precision probabilities were calculated using the normal convergence result for the binomial prior, and by Monte Carlo sampling (with 1,000 samples) for the uniform. The number of judgments required for precision and recall is not substantially less than the number required to prove the difference (Table 3.1 in Chapter 3), mainly because the rank cutoff is too small to allow substantial differences to appear. For average precision, however, the probabilistic stopping condition allows more than 50% reduction in judging effort, and, as the accuracy numbers show, without seriously compromising the ability to distinguish between systems. The parameters for the stopping condition were α = 0.05, β = 0.05, meaning a willingness to tolerate a 5% false positive rate and a 5% false negative rate. Thus accuracy should be at least 95%. Indeed, as the table shows, except for precision at 10, accuracy is either above 95% or only slightly below for both priors. Precision accuracies may be lower than expected for the binomial prior due to its likely overestimation of the relevance of unjudged documents (it is unlikely that p i = 0.5 uniformly over systems and topics) combined with the fact that precision offers no opportunity for smoothing through normalization as recall and average precision do. Table 4.3 shows the comparable results for the incremental pooling with probabilistic stopping condition approach. Note that the number of judgments required for precision and recall is actually less than required by MTC. This suggests that incremental pooling takes advantage of ordering information provides a small gain. While the accuracy for precision is comparable, the accuracy for recall is much lower. Average precision required around 50% more judgments with the pooling method. The accuracy statistics are slightly higher simply because there are more judgments.

99 4.4. Experiments 79 binomial uniform TREC AP AP TREC % 86% 98% 100% 100% 97% TREC % 91% 99% 100% 100% 99% TREC % 92% 99% 100% 100% 100% TREC % 93% 99% 100% 100% 97% TREC % 90% 99% 99% 99% 97% TREC % 91% 99% 100% 100% 100% Table 4.3. Number of judgments made and resulting accuracy when comparing two systems over a single topic using incremental pooling by either prior with α = 0.05, β = Daverage precision likelihood ratio number of judgments number of judgments Figure 4.4. The left plot shows the bounds on AP changing as judgments are made for systems crnlea and inq101. The thick lines are the expectation of AP and the 95% confidence interval. The right plot shows the test statistic T changing as more judgments are made. Figure 4.4 shows the bounds on the difference in average precision along with the expected average precision and the 95% confidence interval for TREC-3 systems crnlea and inq101 as the number of judgments increases. Note that though the bounds are wide, many of the values within the bounds are very unlikely: they would require most of the documents retrieved by one system to be relevant and most of the documents retrieved by the other to be nonrelevant. Thus the confidence intervals are comparably much tighter than the bounds. Though the expectation hovers around zero, the last few judgments provide enough evidence to push the

100 80 Chapter 4. Confidence 95% confidence interval above zero. Note that the bounds at this point are still quite wide. The adjacent plot shows the test statistic (Eq. 4.22) changing with the number of judgments. Since the expectation fluctuates around zero, the test statistic fluctuates around one, until the 40th through 72nd judgments start to provide enough evidence that the difference is in fact positive. Evaluation Over Sets of Topics As discussed in Section 4.1.6, regardless of the distribution of a measure over a single topic, its mean is normally distributed over a set of topics. Thus the stopping condition over a set of topics is based on a normal distribution with mean and variance equal to the sum of the expectations and variances for each topic as calculated according to the respective prior distribution. Table 4.4 shows the results of running MTC over sets of 50 topics with either prior on individual topics. Note that the increase in judgments when going from one topic to 50 is much less than the increase in the previous chapter. In fact, the number of judgments would seem to decrease sublinearly in the number of topics: while a single topic required about 40 judgments on average to find a difference in average precision, 50 topics require fewer than 10 judgments per topic. The same pattern holds for precision and recall and for both priors. Note also that the accuracy numbers for precision are much more in line with the expectation based on α and β being The recall and average precision numbers, however, are lower than expected. This could be because of how accuracy is computed: We compared to the true AP and recall over the full set of TREC judgments, but the algorithm could only hope to identify relevant documents from the two systems being evaluated. Indeed, when we recompute AP and recall based only on the documents retrieved, the accuracy numbers increase to nearly 100%. Interestingly, there is no appreciable difference from prior performance on precision, but the performance on recall is worse with the uniform prior, and performance on average precision is worse with the binomial prior. Table 4.5 shows the results using incremental pooling. Again, pooling required more judgments for precision, many more for average precision, but many less for recall. The reason for the vast difference in judgments for recall is the same as the previous chapter: after completing judging within the top k retrieved, incremental pooling continues to have a guide to judging documents, while MTC is left afloat Additional Analysis Failure Analysis We looked at pairs for which the sign of the difference in one of the measures had been incorrectly predicted. In all such cases, the difference in the measure is small. For example, the average difference in MAP over incorrectly-predicted cases is , while the overall average difference in MAP is Moreover, MAP values in the former case range from to 0.06, with 80% of values below Because they were so close in MAP, incorrect pairs required more judgments: 489 on average for wrong predictions versus 368 on average for correct predictions.

101 4.4. Experiments 81 binomial uniform TREC AP AP TREC % 94% 90% 99% 94% 91% TREC % 96% 94% 99% 96% 97% TREC % 91% 91% 99% 91% 91% TREC % 94% 91% 100% 94% 91% TREC % 91% 89% 99% 91% 89% TREC % 94% 91% 100% 94% 91% Table 4.4. Number of judgments made and resulting accuracy when comparing two systems over 50 topics using MTC by either prior with α = 0.05, β = binomial uniform TREC prec@10 rec@10 AP prec@10 rec@10 AP TREC % 91% 98% 99% 91% 96% TREC % 90% 99% 99% 90% 100% TREC % 87% 99% 99% 87% 100% TREC % 91% 99% 100% 91% 96% TREC % 89% 99% 99% 89% 93% TREC % 90% 99% 100% 90% 100% Table 4.5. Number of judgments made and resulting accuracy when comparing two systems over 50 topics using incremental pooling by either prior with α = 0.05, β = This suggests that even the full TREC judgments may not be enough to reliably distinguish these systems; it is unlikely they are significantly different. Test Parameters The test parameters α and β can be varied from 0 (no tolerance for errors) to 1 (no concern for errors). Since lower values are expected to produce fewer errors, we would also expect lower values to require more judgments before reaching a stopping point. Figure 4.5 shows how the number of judgments decreases as α and β increase from 0 to Note the steep increase in the number of judgments needed as both α and β approach 0.

102 82 Chapter 4. Confidence number of judgments alpha beta Figure 4.5. As test parameters α and β increase from 0 to 0.5, the number of judgments required to reach the stopping condition decreases. Distribution Choices Each measure has certain options for the distribution of φ used to compute the test statistic. We presented exact distributions for precision and recall under both priors. They could instead be approximated by normal distributions, or estimated empirically using Monte Carlo sampling in fact, for average precision, these are the only two options. The concern with using approximations or sampling is that they will result in more errors. Monte Carlo sampling may be especially susceptible to this in a sequential setting, since the more often it is performed, the more likely it is to produce a value far out in the tail of the distribution. The probability of this happening can be reduced by increasing the number of simulations, but that adds more time to the evaluation. We tested each measure with the normal approximation and Monte Carlo sampling, keeping track of the number of judgments required, the error rate, and the amount of real time taken to run. Interestingly, the error rates are not appreciably

103 4.4. Experiments 83 different; in fact, error rates with the binomial prior and distributional approximations tend to be better than with the exact probabilities for precision and recall. This could mean that exact probabilities for such coarse measures are not particularly useful. It goes without saying that Monte Carlo simulation takes much longer than using a distribution function; the average time increase for only 1000 MC simulations is about 350% Recommendations Based on the above discussion and analysis, when deploying these methods in a production setting, we recommend using low error rates α = 0.05 and β = This ensures that only the most similar systems have a chance at being incorrect, while still requiring dramatically fewer numbers of judgments than traditional alternatives. We also recommend the use of normal approximations to the distributions, since they are easy to compute and do not seem to introduce more errors or significantly more judgments.

104

105 Chapter 5 Ranking Retrieval Systems Chapters 3 and 4 were concerned with comparing pairs of retrieval systems: controlling the variance in a comparison due to missing judgments in the former; understanding the variance due to missing judgments in the latter. But it is seldom the case that one would want to compare only two systems; tuning a retrieval system or performing feature or model selection would potentially involve dozens of systems whose outputs differ depending on retrieval model, parameter settings, features, and so on. To minimize the judgments needed, we would like to take the similarity between systems into account in our selection process: if all the systems are the same, no judging need be done. Naïvely one could simply make all pairwise comparisons between systems, but this would not take advantage of all of the available information: a document useful for distinguishing between systems A and B may not necessarily be useful for distinguishing between systems A and C, or B and C, or any other pair. The goal of this chapter is to generalize the methods from the last two chapters to ranking a set of retrieval systems: selection of documents that will best focus assessor effort towards finding the correct ranking, and evaluation of the confidence that a particular ranking is the right one. The basic framework is based on selecting the document that has greatest mutual information with the space of hypotheses about rankings of systems, but to do this efficiently requires some approximations and assumptions that emerge from a geometric understanding of the ranking hypothesis space. Throughout this chapter the three systems (A, B, and C) in Table 5.1 will be used to illustrate points. This is a very small example a single topic with a corpus of only ten documents but it will prove useful for visualization. There is also new notation: as in Chapter 4, φ is an arbitrary evaluation measure; we will use A B to mean φ A > φ B, A B for φ A = φ B, and A B to mean that φ A φ B. 5.1 Hypotheses About Rankings In the previous chapter we defined confidence as the probability that a difference between two systems is less than zero given some judgments: P ( φ < 0 J ), or equivalently, P (φ A < φ B J ) = P (A B J ). The sign of the difference is a convenient way to represent an ordering of two systems. If we consider the sign to have two outcomes (concluding equality only if all documents have been judged and the

106 86 Chapter 5. Ranking Retrieval Systems C = {d 1, d 2, d 3, d 4, d 5, d 6, d 7, d 8, d 9, d 10 } A = {5, 9, 6, 4, 1, 3, 8, 7, 2, 10} B = {8, 9, 6, 1, 4, 5, 2, 10, 3, 7} C = {4, 7, 10, 5, 6, 9, 2, 1, 8, 3} J 10 = {0, 1, 0, 1, 0, 0, 0, 1, 0, 0} rank A B C 1 d 5 d 4 d 8 2 d 9 d 7 d 7 3 d 6 d 9 d 10 4 d 4 d 5 d 1 5 d 1 d 6 d 4 6 d 3 d 3 d 5 7 d 8 d 10 d 2 8 d 7 d 1 d 9 9 d 2 d 2 d 6 10 d 10 d 8 d 3 Table 5.1. Three example system rankings of ten documents. Each element in a set is the rank of the document corresponding to the element s index. J 10 are the true relevance judgments. X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 AP A AP B AP C R 3 x A B C x C A B x C A B x A B C x C B A x B C A x B C A x C B A x A B C x A B C Table 5.2. Values of average precision depend on the assignment of relevance to 10 documents. Assignments x 10 i are numbered arbitrarily. The fifth line (x 10 95) shows the actual judgments and the true ranking of systems. two systems are still equivalent), its variance is as good a quantity as any to use in a decision-making process; the previous chapters have largely been concerned with selecting documents to reduce that variance and then understanding it in terms of missing judgments. When there are more than two systems to compare, the problem becomes harder: there is no natural mapping of rankings of three or more systems to numeric values such that a particular ranking has a single representation. Since the variance of a ranking depends heavily on what mapping we choose, our decisions would depend on whatever mapping we choose. Decisions about which document to choose and whether to stop judging should be able to be made independently of any mapping of outcomes to labels, and thus we require a different loss function. Let R m be a random variable over the space of possible rankings of m systems (Kochanski (2007) shows that the number of total pre-orderings of m items is approximately m!/(2(log 2) m+1 ) = O (m! 1.44 m )). Each ranking has some proba-

107 5.1. Hypotheses About Rankings 87 bility, depending on the judgments available and the prior probabilities of relevance; Table 5.2 shows how the ranking of the three example systems by average precision depends on the relevance judgments. This variable can have an expectation and variance only if we provide some mapping of rankings to numeric values, but it has an entropy that is independent of any mapping. Shannon (1948) defined the information entropy of a discrete random variable X over a space X as H(X) = p(x) log 2 p(x). x X The ranking random variable has an entropy: H(R m ) = σ p(σ) log 2 p(σ) (5.1) where σ is a partially-ordered permutation and the sum is over all O (m! 1.44 m ) permutations, e.g. {A B C}.As we did for confidence, we can expand this as a sum over possible assignments of relevance x n = {x 1, x 2,..., x n } X n. H(R m ) = ( ) ( ) p(σ x n )p(x n ) log 2 p(σ x n )p(x n ). (5.2) σ x n X n x n X n Because p(σ x n ) is either 1 or 0 (depending on whether relevance assignment x n results in ranking σ), this is effectively a sum over all 2 n possible assignments of relevance. Assuming a flat prior given no judgments, as in Chapter 4, all rankings are a priori equally probable. The distribution over rankings is uniform, and therefore the entropy of R m is maximized (Cover and Thomas, 1991). Each judgment made helps differentiate the systems, contributing some information about the ranking of systems and thereby decreasing the entropy of R m. When all documents have been judged, the probability of one ranking is 1 and the rest zero; entropy is minimized. Thus, intuitively, the best document to judge is the one that carries the most information about the distribution of rankings, i.e. the one that is expected to reduce the entropy of R m by the greatest amount. One way of figuring this is by calculating the mutual information between the distribution of R m and the distribution of relevance of each document X i. In general, the mutual information between two random variables X and Y is defined as I(X; Y ) = x X p(x, y) p(x, y) log p(x)p(y). y Y It can be thought of as the amount of information that would be learned about X by observing Y (Cover and Thomas, 1991). We can define the mutual information between the rank hypotheses R m and the relevance of document i X i as: I(R m ; X i ) = σ p(σ, X i = 0) log p(σ, X i = 0) p(σ)p(x i = 0) + p(σ, X i = 1) log p(σ, X i = 1) p(σ)p(x i = 1).

108 88 Chapter 5. Ranking Retrieval Systems Mutual information I(X; Y ) is equivalent to H(X) H(X Y ), where H(X Y ) is the conditional entropy of X given Y. This is slightly easier to work with: I(R m ; X i ) = H(R m ) H(R m X i ) = H(R m ) p(x i = 1)H(R m X i = 1) p(x i = 0)H(R m X i = 0) where H(R m X i = 1) is Eq. 5.1, summing over only the assignments of relevance with x i = 1 (and H(R m X i = 0) is found by summing over the assignments with x i = 0). Calculating mutual information involves summing over factorially many rankings or exponentially many assignments of relevance; no matter whether the sum is factorial or merely exponential, it is not efficiently computable for large m or n. There has been some recent work on efficiently estimating factorial probability distributions, but it is not nearly efficient enough for the on-line algorithms we will need it for. In order to be able to use this information-theoretic framework, then, we will need efficient approximations of the mutual information Ranking Hypothesis Space It will prove useful to have a geometric interpretation of a ranking and of the entropy of a ranking. Suppose there is an (m 1)-dimensional space, with each dimension representing the difference between a pair of systems. If m = 3, for example, we have a 2-dimensional space with dimensions in φ A φ B and φ B φ C. In fact the dimensions can be chosen differently; they could equivalently be φ A φ C and φ B φ C, or in general any of the m! ways to choose (m 1) pairs of differences between systems. A point in the space with dimensions in φ A φ B and φ A φ C, then, is determined by those differences, which in turn are determined by a particular assignment of relevance. Therefore there are 2 n possible points, each the result of a different possible set of judgments to all documents. Figure 5.1 shows six different rank plots, each with different dimensions depending on which differences between the example systems are picked. Each plot has 2 10 = 1024 points. Note that there are regions bounded by lines (hyperplanes in general) defined by the difference between two systems; every assignment of relevance that produces a point in a particular region defines a ranking. The regions in these plots represent the rankings of systems that they are labeled by. Entropy can therefore be visualized as a sum over all points in a plot, and is therefore a function of the number of different regions the points fall into and how densely populated each region is: at one extreme, all regions have the same number of points and entropy is maximized every total ordering has the same probability. At the other, all possible points fall into a single region and entropy is minimized. In between, some regions will have more points than others; a few regions with many points indicates lower entropy than many regions with a few points. Any judgment cuts the space of points in half, reducing the number of points in some of the regions and thereby decreasing the entropy. The plots in Figure 5.2 have the same dimensions as the top left plot in Figure 5.1; each plot shows the result of judging a different document relevant. Note that some judgments, despite cutting the space in half, do not reduce the dispersion of points, i.e. they do not

109 5.1. Hypotheses About Rankings 89 B C B > C > A C > B > A B > A > C C > A > B A > B > C A > C > B C B C > B > A B > C > A C > A > B B > A > C A > C > B A > B > C A C A > C > B C > A > B A > B > C C > B > A B > A > C B > C > A A B A C B A C A C > A > B A > C > B C > B > A A > B > C B > C > A B > A > C A B A > B > C B > A > C A > C > B B > C > A C > A > B C > B > A B A B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C C A C B Figure 5.1. Rank plots : Each 2-dimensional plot has axes determined by the difference in φ between two systems. Each point is determined by an assignment of relevance to 10 documents. All of the points in a region bounded by solid lines represent the ranking of systems that labels that region. appreciably reduce the number of regions that points appear in or the density of points within those regions. Other judgments create compact spaces that span a few regions, with most of the mass concentrated in 1 3 regions. Intuitively, these are the ones we want to focus on. In this example in particular, the 4th, 5th, and 8th plots (which represent the result of judging documents 4, 5, or 8 relevant, respectively) would put most of the ranking mass into four regions, while any other would keep most of the mass in five or six regions. Documents 2 and 3, in contrast, do not seem to reduce the entropy at all. Therefore any of 4, 5, or 8 would be preferable to 2 or 3; by inspection, it seems that 4 and 8 are superior to 5, but 4 and 8 are roughly equal. This depends to some extent on the dimensions chosen. The plots are not shown, but in some, 5 looks a little tighter than 4 or 8. The exact calculation of mutual information indicates that a judgment on document 8 would provide the most information about the rank distribution, with document 4 very close behind. Judging documents 4, 5, and 8 is enough to prove that A is the worst of the three systems; see Figure 5.3. It will take more judgments to prove the difference between B and C. This example demonstrates the connection between the entropy of the ranking, the mutual information with a judgment, and the geometric interpretation of a ranking: the judgment that results in the densest dispersion of points across regions

110 90 Chapter 5. Ranking Retrieval Systems B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B A B A B A B A B A B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B B C B > A > C A > B > C B > C > A A > C > B C > B > A C > A > B A B A B A B A B A B Figure 5.2. Rank plots, each highlighting the effect of judging a different document relevant (documents 1 5 across the top; documents 6 10 across the bottom). Red points indicate values that are still possible after judging a document. B C B > C > A C > B > A B > A > C C > A > B A > B > C A > C > B A B Figure 5.3. A rank plot after judging documents 4, 5, and 8. Dark red points indicate possible values after those all three of those judgments. Light red points are possible values after judging documents 4, 5, or 8 individually. is the one that would give the least entropy to the rank distribution, and thus has greatest mutual information with that distribution. This is the document that

111 5.2. An Efficient Algorithm 91 should be judged. We shall use this idea to construct an efficient algorithm for selecting documents to judge to determine the right ranking of systems. 5.2 An Efficient Algorithm Though there has been recent work on efficient estimation of distributions over factorial random variables, ours is literally an on-line problem we cannot keep assessors waiting while we pick the next document to judge and we need a fast algorithm that we can compute for every unjudged document. We are willing to sacrifice a great deal of precision in which document we select, especially once we factor in noise in judgments and the fact that no single judgments can have a very large effect. Thus our approximations can be quite loose Joint Distribution of Measures Equation 5.1 is a sum over the probability of factorially many rankings. The probability of a ranking is technically the sum of the probabilities of all assignments of relevance that result in that ranking. If n is large, it may be feasible to model φ as continuous rather than discrete and combinatorial, thereby potentially reducing the computational effort. Supposing that f(φ A, φ B, φ C ) is a continuous joint probability density function, we can calculate the probability of each possible permutation: p(a B C) = p(a C B) = φa φb φa φc f(φ A, φ B, φ C )dφ C dφ B dφ A f(φ A, φ B, φ C )dφ B dφ C dφ A As shown in Chapter 4, using the binomial prior for probability of relevance often produces an approximately normal posterior for evaluation measures. Extending this, f could be a multivariate normal density function: 1 f(x) = (2π) m/2 Σ exp ( 1 1/2 2 (x µ) Σ 1 (x µ) ) (5.3) µ = [ ] Eφ A Eφ B Eφ C (5.4) Var (φ A ) Cov (φ A, φ B ) Cov (φ A, φ C ) Σ = Cov (φ A, φ B ) Var (φ B ) Cov (φ B, φ C ) (5.5) Cov (φ A, φ C ) Cov (φ B, φ C ) Var (φ C ) where Var (φ A ) = Eφ 2 A (Eφ A) 2 and Cov (φ A, φ B ) = Eφ A φ B Eφ A Eφ B. There is no formal justification for this extension from univariate normality to multivariate normality, but it is convenient. I know of no way to compute these integrals. But, following the use of differences between systems to define a rank space, we can preserve the ranking by looking at differences between adjacent systems. Rather than calculate P (A B C), we

112 92 Chapter 5. Ranking Retrieval Systems can calculate P (φ A φ B > 0, φ B φ C > 0). We lose information about the values of the measures, but we do not lose any information about the ranking of systems. And the integrals are much easier to compute: letting AB φ = φ A φ B, p(σ 1 ) = p(a B C) = p(σ 2 ) = p(a C B) = f 1 ( AB φ, BC φ)d BC φd AB φ f 2 ( AC φ, CB φ)d CB φd AC φ (Rankings σ are numbered arbitrarily from 1 to m!.) This has the slight downside of requiring O (m!) different distributions f 1, f 2,..., f m!. If these are multivariate normal distributions, they are each parametrized by a different µ and Σ, for instance: µ 1 = [ E AB φ E BC φ ] (5.6) [ ] Var ( Σ 1 = AB φ) Cov ( AB φ, BC φ) (5.7) Cov ( AB φ, BC φ) Var ( BC φ) 1 f 1 ( AB φ, BC φ) = (2π) m/2 Σ 1 exp ( 1 1/2 2 (x µ 1) Σ 1 1 (x µ 1) ) (5.8) This framework cannot truly handle two systems being equal: that case will have probability zero. A possible resolution is to define e.g. P (A B C) so that it is integrating over a small interval for the differences between A and B, and between B and C. We will essentially ignore the problem; as in previous chapters, we will conclude that two systems are equal only after having judged every (informationcarrying) document they both retrieved. It should be clear that despite being defined by different distributions, these probabilities will nevertheless sum to 1: the complement of A B C is the set of remaining rankings, and clearly the probability of that ranking plus the probability of its complement is 1. Figure 5.4 shows the effect of judging the fourth document in our example relevant (as in the fourth upper plot of Figure 5.2). The bivariate normal distribution of system differences is represented by contour lines. This suggests that we can approximate the density of points by the area within the outermost ellipse (or more generally, the volume of the normal ellipsoid). The perpendicular red lines intersect at the mean of the bivariate distribution, and the two eigenvectors of the covariance matrix are parallel to these two lines. The area of any of the ellipses, then, is proportional to the product of the eigenvectors Approximate Mutual Information When considering the discrete distribution of R m we were concerned with the Shannon entropy. Having relaxed to a continuous joint distribution, we are now concerned with differential entropy h(x), the continuous extension to the ideas of information entropy (Cover and Thomas, 1991). Closed-form expressions for differential entropy are known for many distributions, and in particular, the differential entropy of a multivariate normal distribution is proportional to the determinant of the covariance matrix, also known as generalized variance. Therefore h(r m ) Σ,

113 5.2. An Efficient Algorithm 93 B C B > C > A C > B > A B > A > C C > A > B A > B > C A > C > B A B Figure 5.4. Possible values if document 4 is judged relevant. Contour lines show the bivariate normal approximation to the joint distribution of the difference in φ between pairs of systems. with Σ defined as for the distributions above. Mutual information, being a function of entropy, is also proportional to Σ, and therefore the best document to choose is the one that produces the biggest decrease in Σ. Let Σ Ri be the covariance matrix that results from judging document i relevant and Σ Ni be the result of judging document i nonrelevant. Then mutual information can be approximated as Σ p(x i = 0) Σ Ni p(x i = 1) Σ Ri. As discussed above, in Figure 5.4 the approximate mutual information is proportional to the area of one of the normal ellipses, which in turn is proportional to the product of the eigenvalues of the covariance matrix. The product of the covariance eigenvalues, is, in fact, equal to the determinant of the covariance matrix, the generalized variance. In each plot in Figure 5.2, approximate mutual information is proportional to the area of an ellipse that contains all of the red points. At least by inspection, selecting documents this way appears to be a good alternative to calculating the actual mutual information. Calculating the determinant of an m m matrix is in O ( m 3) Corman et al. (2001). Given that it must be calculated for each unjudged document to determine the effect of judging that document, for large enough m this is still not practical. Generalized Variance Bound As long as no two systems are identical (or more precisely, as long as there is no pair A, B such that Var (φ A φ B ) = 0), the covariance matrix is Hermitian and positive definite. The determinant of a Hermitian positive definite matrix is no greater

114 94 Chapter 5. Ranking Retrieval Systems than the product of its diagonals (Shilov, 1977). Using this approximation would therefore only entail calculating variances of differences between pairs of systems rather than the entire covariance matrix and its eigenvalues, reducing computation time to that of calculating m 1 variances of φ. For most measures, including precision, recall, and NDCG, variance is in O (n), so this is efficient. The document that when judged results in min i,j Var ( ijφ) is therefore the one that puts the least upper bound on the volume of the normal ellipsoid. The tighter this bound is, the better the selection will be; we will return to this below. When calculating variance is computationally expensive, as it is for average precision (O ( n 3) ), this approximation is still far too time-consuming. In these cases, an alternative option is to return to the selection mechanism of the previous chapters, picking the document that has the greatest weight over all pairs of systems. This corresponds to the document that reduces the variance for one pair. If we can select the pair that contributes the most to generalized variance, then this is equivalent to minimizing the maximum eigenvalue which, geometrically, is proportional to the length of the major axis of the normal ellipsoid. So this also has the effect of reducing the density of points over regions. Dimension Selection When selecting the document that minimizes an upper bound on generalized variance, we would like that bound to be as tight as possible. Since Σ i,j Var ( ijφ), the tightness of the bound depends on which m 1 of the m(m 1)/2 pairs have been chosen as bases recall that this choice affects the visualization of the rank plots (Figure 5.1). (Note that Σ itself does not depend on the dimensions, since the determinant is invariant under changes of basis.) Therefore we would like to pick the m 1 pairs that have the least variance but that still carry all information about the maximum likelihood ranking. Not every set of m 1 pairs preserves the ranking; with four systems A, B, C, D, the three pairs A B, A C, B C tell us everything about the relative ordering of systems A, B, C, but nothing about the position of D relative to them. Consider a graph G in which the nodes A, B, C,... represent systems. Between every two nodes there is an edge e AB, e AC,... with weight equal to the variance of the difference in φ over the missing judgments. An example is shown in Figure 5.5, with edge weights underlined and linked to their respective edges. We propose the following: Taking pairs of systems that form a spanning tree in this graph preserves all information about the maximum likelihood ranking. The minimum spanning tree comprises the set of pairs that will produce the tightest upper bound on generalized variance. Theorem 5.1. Let G = (V, E) be a fully-connected graph with edge weights such that e ij + e jk = e ik for all nodes i, j, k. Given a spanning tree T E, all edge weights e ij are recoverable. Proof. Let u, v V be arbitrary nodes. We will show that if e uv / T, then e uv can be determined from T. Since T is a spanning tree, it contains a path connecting v

115 5.2. An Efficient Algorithm 95 E D F C A B Figure 5.5. System graph. Nodes are systems; edge weights are the variance of the difference in a measure. Bold edges are in the graph s minimum spanning tree. and u. Suppose {e ui1, e i1i 2,..., e ik v} T is the sequence of edges connecting u to v. Then e uv = e ui1 + e i1v = e ui1 + e i1i 2 + e i2v = e ui1 + e i1i 2 + e i2i 3 + e i3v... = e ui1 + e i1i e ik v and the proof is complete: e uv has been recovered only from edges in T. Corollary 5.1. If G is a graph with retrieval systems as nodes and edges e AB equal to E[φ A φ B ], the ranking of systems by Eφ can be recovered from any spanning tree T. Proof. Theorem 5.1 allows recovery of all edge weights e ij = E[φ i φ j ]. Since E[φ i φ j ] > 0 and E[φ j φ k ] > 0 implies E[φ i φ k ] > 0 (i.e. evaluation measures are transitive), the ranking can be recovered. Theorem 5.2. Let G = (V, E) be a fully-connected graph. Let T E be a minimum spanning tree of G. Let M be an adjacency matrix with diagonal entries equal to the weights of the edges in T and off-diagonal entries equal to the weights of edges not in T. If M is Hermitian and positive semidefinite, then M M ii, and furthermore, this bound is tight for all spanning trees T. Proof. The bound M M ii is known as Hadamard s inequality; a proof can be found in linear algebra texts (e.g. Shilov (1977)).

116 96 Chapter 5. Ranking Retrieval Systems The product M ii could be decreased by replacing an arbitrary M ii with a smaller number. But since M ii must be the weight of an edge in T, this could only happen by replacing an edge in T with one of smaller weight. If there is an edge outside of T that has weight less than an edge in T, then either T could not be a minimum spanning tree or that edge would create a cycle and cause T to no longer span G. Therefore there is no spanning tree and associated adjacency matrix N such that N ii < M ii, and therefore the bound is tight over all spanning trees. Corollary 5.2. If G is a graph with retrieval systems as nodes and edges e AB = Var (φ A φ B ), T is a minimum spanning tree in G, and M is the adjacency matrix described above, then Σ e T e = Var (φ A φ B ) i,j Var (φ i φ j ). Proof. Since Σ is a covariance matrix, it is Hermitian and positive semi-definite. The theorem holds. Although the minimum spanning tree procedure produces the tightest Hadamard bound, that bound can still become arbitrarily bad depending on the covariance matrix. In particular, more similarity between systems generally produces looser bounds. It is much faster than calculating the generalized variance for each document, though, as the minimum spanning tree can be found in O ( m 2) time using Prim s algorithm (Corman et al., 2001). At first glance this would appear to defeat the purpose of using the Hadamard bound in the first place: the MST requires calculating the variance between all pairs of systems, not just the m 1 that the bound requires. The difference is that the bound needs to be calculated for every document at every iteration, while the spanning tree need only be found once at each iteration, or even after every k iterations Maximum Likelihood Ranking The maximum likelihood ranking is the one with greatest probability given the relevance judgments. This is the ranking that p(r m ) is peaked at, and from the geometric interpretation we can see that it is the ranking represented by the region within which is the greatest number of possible points. If the probability density function f is normal, the ML ranking is simply the one that results from ranking systems by the expected value of φ as defined in Chapter 4. If there are missing judgments, this ranking, which is denoted σ ML, is the most natural to take as the hypothesized ranking; additional judgments will either prove or disprove that the maximum likelihood ranking is the correct one. Note that this is the ranking recovered from the system graph s minimum spanning tree by Corollary Stopping Condition In Section 4.3 we defined a stopping condition based on the probability that two systems are ordered according to some null hypothesis. That idea is extended here: the null hypothesis is that any ranking but the maximum likelihood one is correct ;

117 5.2. An Efficient Algorithm 97 it will only be rejected if the ML ranking has sufficiently high probability that very few assignments of relevance could result in a change. H 0 : H 1 : σ σ ML σ = σ ML T = P (J not σ ML) P (J σ ML ) = P (not σ ML J ) P (σ ML ) P (σ ML J ) P (not σ ML ) = 1 P (σ ML J ) P (σ ML ) P (σ ML J ) 1 P (σ ML ) A priori we may say that P (σ) = 1/m! for all σ, so P (σ ML )/(1 P (σ ML )) = 1/(m! 1). As in Section 4.3, a sequential stopping condition is that the value of T falls outside the range (A, B), A = 1 β α B = β 1 α with α and β the desired false positive and false negative rates. There is a problem with this test: it is really a test of O (m!) different hypotheses rather than a test of one versus an alternative. This means that the desired error rates α and β will not necessarily translate into the same observed error rates; in fact, they will translate into much higher error rates. The null hypothesis is very likely to be true when few judgments are available, for the simple reason that the maximum-likelihood ranking based on a handful of judgments is unlikely to be the correct ranking. One solution is to set β = 0, ensuring that we never decide in favor of the null hypothesis. Then, because there is a tradeoff between Type I errors and Type II errors, we must also select a very small (but nonzero) α to ensure that we do not reject the null hypothesis out of hand, even though we know a priori that we will reject it at some point. We just need to ensure that we have enough time to make sure we are rejecting the right null hypothesis. This means that though there is only one parameter to worry about, it is not clear what the right value of it is, except that it is on the order of 1/m!. All is not lost, however, as we can still test hypotheses about individual pairs of systems, thus determining that we are in some subset of regions with some high probability.

118 98 Chapter 5. Ranking Retrieval Systems Algorithm Putting the pieces together produces the generic sequential algorithm below: Algorithm 5.1. Rank-MTC. Select documents to judge to rank a set of m retrieval systems. 1: S n m [ A B ] 2: while A < T < B do 3: calculate Σ m m using S and x. 4: m m Mst (Σ) 5: S n m 1 S 6: for each unjudged document i do 7: calculate Σ R i and Σ N i using S and x. 8: w i Σ R i + Σ N i (by approximation) 9: i arg max i w i 10: x i judgment on document i The Mst subroutine finds the maximum spanning tree for the variance graph. It returns an m m matrix in which each row represents an edge in the spanning tree. Each row has a 1 and a 1 in the positions of the nodes it connects. The matrix product of S and this matrix produces a matrix of system differences S. It is from this matrix that we determine the maximum likelihood ranking σ ML and the covariance matrix Σ as defined in Eq Note that we do not actually calculate Σ ; we only calculate its diagonal entries, which are multiplied together to approximate its determinant. If the measure is linear, like precision or DCG, or can be treated as linear divided by a constant, like recall and NDCG with the first prior, then there are some shortcuts that can be taken to reduce computation time. The initial computation of Σ (line 3) is O ( nm 2), but it can be updated to reflect new judgments with O ( m 2) operations. Likewise, calculating the diagonals of Σ (lines 7 and 8) is O (nm), but only needs to be done once each round; the effect of each judgment can be computed in O (m) operations. Thus for these measures the running time of the while block is O ( nm 2). This is an indication of the amount of time an assessor would have to wait for the next document to be served. A measure like average precision, however, cannot feasibly work with this algorithm. Calculating a single variance alone is simply too expensive (O ( n 3) ). One possible alternative is to calculate a weight w i for each document as in Section for each pair of systems, and pick the document with the greatest weight over all pairs. The maximum w i across all pairs can be found in only O (m) operations: since loosely speaking w i = c ij x j = a ij x j b ij x j, the maximum would simply be the maximum a ij x j minus the minimum b ij x j. Therefore we can compute those quantities for each system and then find the maximum weight over all pairs. Another alternative is to approximate average precision with a linear DCGlike measure. An instance of DCG with g(x) = x and d(x) = x is the sum of the reciprocal ranks at which relevant documents appear, and thus is equivalent to calculating sum precision while ignoring the off-diagonal entries of the matrix C (Section 3.4.3). Alternatively, an instance of DCG with g(x) = x and d(x) =

119 5.3. Conclusion 99 H n H x 1 is equivalent to the average of the precisions at every rank, not restricted only to the ranks at which relevant documents appear. Both of these correlate well with average precision, though of course it is where the correlations break down that problems can arise. 5.3 Conclusion This chapter has introduced a mathematical formalism and algorithm for selecting documents to judge to reduce the volume of a space of hypotheses about relative orderings of systems. There is a piece missing that prevents it from working in practice: estimates of the probability of each possible assignment of relevance. By focusing on the documents that are most informative about the hypothesis space, the algorithm focuses on those systems that are most different from the others. It will thus tend to ask for judgments on the least-frequently retrieved documents, or the documents that systems disagreed about the most. In practice this means that a large set of systems that are based on similar assumptions (e.g. that documents that contain the query terms with high frequency are more likely to be relevant, modulo the frequency of those terms throughout the corpus) and thus retrieved similar documents will not have their documents judged to a significant extent. Because relevant documents appear at a much lower rate than the uniformity assumption would predict, these systems will end up looking much better than the others from which many documents were judged. This method cannot be applied to an evaluation of systems until this is addressed.

120

121 Chapter 6 Robust Evaluation In Chapter 4 we argued that prior to making any judgments, relevance should be assumed to be uniformly distributed, whether by the number of relevant documents in the corpus or by the assignments of relevance themselves. But after acquiring some judgments, we have some information about what makes a document relevant and what doesn t, and thus some knowledge about the distribution of relevance for each document as well as the distribution of assignments of relevance to every document. Rather than continue with the same uniformity assumption, we can use this information to update our distributions. In fact, for these methods to work in practice for arbitrarily large sets of systems over a period of time for them to be robust we must update the distributions of relevance to reflect known judgments. Suppose a researcher creates a new retrieval system that has not retrieved any of the documents judged in previous evaluations over the same topics. If the existing test collection is formed from a reasonable set of systems, the a priori hypothesis about the new system, before making any additional judgments, should be that it is bad, simply because it has failed to retrieve any of the previously-judged relevant documents. But if relevance is assumed to be uniformly distributed, it will look quite good, both on its own and possibly in comparison to other systems that did retrieve judged documents, depending on the proportion of those that are relevant. To resolve this, we need a way to estimate relevance reliably. In this chapter we introduce three models for estimating the relevance of unjudged documents. This, of course, is the classic question of information retrieval how likely it is that a document is relevant to a given query and has inspired a great deal of excellent work. This evaluation setting has some differences from the retrieval setting, namely that it requires true probabilities of relevance and that there are existing judgments to work from. Estimating relevance also ties into the question of reusability, which we will address as well. 6.1 Confidence and Probability of Relevance Section 4.3 presented a probabilistic stopping condition for the document selection algorithms based on the confidence in the difference between two systems. The stopping condition is parametrized by α and β, which are chosen as the desired expected Type I and Type II (false positive and false negative) error rates, respec-

122 102 Chapter 6. Robust Evaluation tively. As touched on in that section, this traditionally requires that the data in this case judgments J are sampled randomly. Clearly the algorithms that we have developed are not sampling randomly. It is not even particularly clear what it means to judge a random sample of documents: it depends on the measure and the systems being evaluated. Certainly we do not want judgments to be dependent on one particular measure or one particular set of systems; these would have very low value beyond the single evaluation they were acquired for. While a random sample is sufficient to attain the expected error rates, it is not necessary. An alternative condition is that the distribution P ( φ < 0 J ) is calibrated to those error rates. Recall that confidence is defined in terms of probabilities of assignments of relevance: P ( φ < 0 J ) = P ( φ < 0 x n, J )p(x n J ). x n X n If p(x n ) is 1/2 n, the semantics of confidence are that it is the proportion of possible assignments of relevance that result in φ < 0. The semantics required to achieve the desired expected error rate are different. Instead of summing over possible assignments of relevance for a given pair of systems, it must be the case that no matter what the systems are, what the evaluation measure is, and what judgments are in the set J, if P ( φ < 0 J ) = α, the sign of φ is negative with probability α. But if the two semantics are calibrated i.e. a probability of α under one semantics implies a probability of α under the other then α and β will be the expected error rates no matter what. The distribution of relevance p(x n ) offers a solution. Probabilities can be assigned to assignments of relevance in such a way that, for a given distribution of systems, the two semantics match. This raises the question of the semantics of the probability of relevance. Consider a single document i, and let p i = P (X i = 1). A traditional ( frequentist ) view might be that if p i = ρ, then out of n judgments to document i, nρ of them will be that it is relevant and n(1 ρ) that it is not much like a coin flip. But relevance is less like a coin flip and more like the weather, to use another common example in probability and statistics literature. A meteorologist can predict a 70% chance of rain, but when the day is done, it either will or will not have rained; the day cannot be repeated n times to see if it rains 0.7n times. Likewise, a document is judged relevant or not, and while assessors may disagree about the relevance of some documents, the majority are clearly not relevant. Instead, we adapt a Bayesian view: the probability of relevance should be thought of as a belief that a document is relevant. After a judgment the belief will be 1 or 0, but before that, the belief may be anything between those two values, depending on features of the document. Documents are not relevant in a vacuum. Two documents on the same topic are likely to be relevant to the same queries. The documents ranked by a system that is very similar to one known to be good are more likely to be relevant than those ranked by a known bad system. Documents that are clicked on more often, or viewed longer, or come from trusted sources may be more likely to be relevant. Such features, in conjunction with documents known to be relevant or not, can be leveraged to assign beliefs. This idea can be contorted to fit with a frequentist view by considering α to be the proportion of documents that are relevant among all those for which the belief in

123 6.1. Confidence and Probability of Relevance 103 relevance is α. Then assigning beliefs to documents given features becomes a problem of constraint satisfaction: find a model M parametrized by θ that assigns beliefs so that marginal distributions over feature values will match frequencies of relevance in the training judgments J. We will refer to this as the calibration condition. The natural model is the one that makes the fewest additional assumptions about the data, i.e. the maximum entropy model. A set of judgments augmented by a set of probabilities of relevance is considered robust if the calibration condition is achieved. Let us consider the probability of a ranking given a set of judgments as well as an n d feature matrix F. P (R = σ J, F) = P (σ x n, J, F)p(x n J, F). (6.1) The distribution p(x n J, F) is estimated from a model of relevance, and that model is what we wish to find. If p(x n J, F) is parametrized by θ, we have: p(x n J, F) = p(x n J, F, θ)p(θ J, F) We can apply Bayes rule to find the distribution of θ: θ p(θ J, F) p(j θ, F)p(θ) Assuming i J are conditionally independent given θ and F, then L(θ) = p(θ J, F) i J p(x i θ, F i )p(θ) How we proceed from here depends on the assumptions we are willing to make. If x i is a linear function of θ and F i (row i of the feature matrix), maximizing the likelihood L(θ) is equivalent to maximizing the entropy. Specifically assuming that x i = F i θ = θ 0 + θ 1 F i θ d F id creates a (Bayesian) linear regression problem; if the prior p(θ) is Gaussian with zero mean, finding θ is a standard least-squares problem. Alternatively assuming that logit(x i ) = log x i 1 x i = F i θ creates a (Bayesian) logistic regression problem. Since judgments are either 0 or 1, the logistic regression formulation makes the most sense Logistic Regression Logistic regression is a powerful tool in statistics and machine learning. Because it is defined for 0-1 response variables, it is useful in IR; two notable instances are as a probabilistic retrieval model (Gey, 1994) and as a document classifier (Genkin et al., 2007).

124 104 Chapter 6. Robust Evaluation In logistic regression, the response variables X i are assumed to have a Bernoulli distribution with parameter p i. Given observed data x, the likelihood of a particular vector of parameters p is therefore L(p) = p xi i (1 p i) 1 xi. Logistic regression is a linear model, which means that the expected values of the parameters p i are a linear function of features (covariates) F i and model parameters θ that are to be found: p i = g 1 (θ 0 + θ 1 F i θ d F id ) = g 1 (F i θ) The most natural choice for g 1 is the function that has the log-odds as its inverse, so that p i g(p i ) = log = F i θ 1 p i Then we may write the likelihood as p i = L(p) = L(θ) = ( exp(f i θ) 1 + exp(f i θ) exp(f iθ) 1 + exp(f i θ). ) xi ( ) 1 xi exp(f i θ) This expression can be maximized using the iteratively reweighted least squares (IRLS) algorithm, in which the problem is treated as a weighted linear regression (solved by least squares) and the weights are updated iteratively. Given a θ found by IRLS, then, the probability of every sampled instance is found by applying the function g: p n = EX n = exp(fθ) 1 + exp(fθ) ; p i = EX i = exp(f iθ) 1 + exp(f i θ). Finding the probability of a particular assignment to all x n variables (as in Eq. 6.1) is a simple product over these fitted parameters: p(x n J, F) = p xi i (1 p i) 1 xi. In the Bayesian regression framework, a prior distribution is specified for θ, and Markov-Chain Monte Carlo (MCMC) simulation is used to obtain a posterior distribution (Gelman et al., 2004). MCMC is too slow for an online problem such as ours, so we sidestep the issue by assuming a Gaussian prior on θ and ignoring all information about the posterior except its expectation. Since logistic regression seems to be such a natural fit in our framework, that is the model structure we will assume. Within that model structure, different sets of features may produce very different results; which features are best is an empirical question to be addressed in Section 6.2. There is the slight problem that the assumption of the data being randomly sampled has not really disappeared. Instead, it has been shifted from the stopping condition itself to training the logistic regression model. However, it is easier to deal with the non-randomness in training, through prior selection and feature selection, smoothing, training with cross-validation, and semi-supervised techniques.

125 6.1. Confidence and Probability of Relevance Robustness and Reusability Supposing there is a set of features that can satisfy the relevance belief conditions under the modeling assumptions of logistic regression, robustness and therefore reusability is achieved: since the probability of a particular ranking is based on global beliefs about the relevance of documents and not tied to the systems being ranked, a set of judgments and a set of beliefs can be ported to any other set of systems or any other evaluation measure without biasing the error rates. This is reusability in the sense that confidence estimates can be trusted : a high confidence means that few or no additional judgments are needed to make a reliable decision about the relative quality of systems; low confidence means that more judgments are probably needed before any strong conclusions can be made. The key is that even though the judgments may have come from a completely different set of systems, they are still trustworthy if the model is good (i.e. if the calibration condition is satisfied). How is reusability achieved across sites in practice? Appendix A discusses qrels, files of relevance judgments made available to the research community. In this framework, in addition to the qrels, an additional set of relevance estimates would also be distributed. These estimates can be augmented as necessary by other sites Distribution of Measures The distribution results of Chapter 4 relied on the priors being nice uniform either over possible assignments of relevance or over numbers of relevant documents. If p i can be different for every document, the problem would seem to regress back to a problem of summing over O (2 n ) possible assignments. However, it is not that bad: the binomial prior, by virtue of assuming that documents are independently relevant, can accommodate these probabilities of relevance quite easily, and in fact it requires a weaker assumption: not that the relevance of two documents is independent, but that the relevance of those documents is conditionally independent given the features. The more information about relevance the features contain, then, the more likely the assumption is to be met, and the more applicable that prior will be. Additionally, even without uniform p i, the approximate normality distribution results for the first prior generally hold. Take, for instance, precision at k, written as prec@k = 1 n X i I(A i k). k i=1 Even if p i 1/2, it is still approximately normal, because the Central Limit Theorem only requires that the variates be independent. Here are the expectations and variances of evaluation measures of interest under the new regime (q i = 1 p i ): Eprec@k = 1 k Var (prec@k) = 1 k 2 n p i I(A i k) i=1 n i=1 p i q i I(A i k)

126 106 Chapter 6. Robust Evaluation = p i g(1)i(a i k) d(a i ) Var (DCG@k) = p i q i g(1) 2 I(A i k) d(a i ) 2 Erec@k 1 pi Var (rec@k) n i=1 1 ( p i ) 2 p i I(A i k) n p i q i I(A i k) i=1 EAP 1 pi ( aii p i + a ij p i p j ) Var (AP) 1 ( p i ) 2 ( a ii p i q i + a ij p i p j (1 p i p j ) + 2a ii a ij p i p j q i + 2a ij a ik p i p j p k q i ). The approximations, which had previously been on the order of n2 n, are now on the order of n p i. The expectations and variances of the differences in measures are easily construed from the above. 6.2 Features for Modeling Relevance In this section we investigate three different feature sets that can be used within the logistic framework developed above. One is based on similarities between documents, the other two on the performance of the retrieval systems themselves: one on the document weights with respect to the evaluation measure, one on preferences between documents expressed by retrieval systems Document Similarity The cluster hypothesis states that closely associated documents tend to be relevant to the same requests (van Rijsbergen, 1979). If true, it suggests that a document similar to a judged relevant document is likely to be relevant itself. Though the hypothesis says nothing about nonrelevant documents, it is probable that a document similar to a judged nonrelevant document is also likely to not be relevant. We could take advantage of this for learning a model of relevance by using the similarities between documents as features of relevance (Carterette and Allan, 2007b). This is similar to an approach taken by Diaz (2005), though that model was unsupervised and based on scores returned by retrieval systems, while ours can be supervised by the judgments and based on probabilities of relevance. Let F be an n n matrix in which entry F ij = sim(i, j) is an estimate of the similarity between documents i and j. In the logistic regression framework, then, log p i 1 p i = θ 0 + θ 1 sim(i,1) + θ 2 sim(i,2) θ n sim(i,n).

127 6.2. Features for Modeling Relevance 107 Smoothing Because there are more features than judged documents, the system is overspecified; the likelihood is unbounded. There are two solutions. The first is regularization, in which an extra term is added to the log likelihood: log L(θ) = x i log exp(f i θ) 1 + exp(f i θ) + (1 x 1 i) log 1 + exp(f i θ) + λθ θ where λ is the regularization parameter, which can be interpreted as a beta prior on the coefficients θ. Larger values of λ are stronger priors, keeping the values of θ closer to zero. The second solution is to place a prior on relevance X i that reflects the proportion of documents that are known to be relevant. This is a semi-supervised approach, allowing the use of every document, even those that have not been judged, for training. In this model we set: 1 if i has been judged relevant, x i = 0 if i has been judged nonrelevant, Measures of Similarity R +1 R + N +2 if i has not been judged. There are many ways to measure the similarity between documents, depending on document models, assumptions about the measure space, parameter settings, and other factors. Among the most well-known and well-studied is the cosine similarity, in which documents are modeled as vectors in V -dimensional space and the similarity between two documents is the angle between their respective vectors (Salton and McGill, 1983). t V sim(i, j) = cos(i, j) = w itw jt t V it w2 t V w2 jt where w it is the weight of term t in document i. Term weights are generally calculated using the term frequency in the document (tf) and the inverse document frequency in the collection (idf). Figure 6.1 shows a simple example of the logistic regression mapping vector space similarities to probabilities. There are five documents, each represented by a vector in the left plot. Similarities are calculated between each document and the two documents labeled FT (a nonrelevant document) and FBIS (a relevant document); these similarities are the features. The right plot shows how the logistic regression model learns that increasing similarity to FBIS and decreasing similarity to FT result in increasing probability of relevance. Each of the five documents maps to a point on the surface based on its similarity to those two documents; FT , being very similar to FT and not similar to FBIS , is not likely to be relevant. FT , being somewhat similar to FBIS and not similar to FT , is more likely to be relevant. There are other measures of similarity based on other document models. If documents are modeled as multinomial distributions over the vocabulary (as in

108 Chapter 6. Robust Evaluation Figure 6.1. An example mapping of vector space similarities to probabilities by logistic regression. The left plot shows document vectors in 3-dimensional space.

128 108 Chapter 6. Robust Evaluation Figure 6.1. An example mapping of vector space similarities to probabilities by logistic regression. The left plot shows document vectors in 3-dimensional space. The right shows probability of relevance increasing with similarity to FBIS (a relevant document) and decreasing with similarity to FT (nonrelevant). language modeling approaches), similarity could be defined as proportional to the KL-divergence between two document models. Similarity could also be defined on a manifold rather than Euclidean space, or as a projection from a higher-dimensional space. Practical Considerations For a large corpus it is impractical to train over every document: for one thing, the feature matrix would be too large to work with; for another, it is likely that the relevant class would be too small to learn effectively. A practical solution is to train using the union of the judged documents and the top k unjudged documents retrieved by the systems being evaluated. It may also be prudent to limit the number of features. The cluster hypothesis could be taken verbatim (as a statement about relevant documents only), leaving the features to be the similarities between a document i and the judged relevant documents. Or the features could be limited to similarities to only judged documents. In the vector space model, very common words like the and of may cause two documents to appear more similar than they really are. A list of such words a stopword list can be consulted so that these words are not counted. Similarly, if two documents contain two different words that are clearly morphological variants of the same root word (e.g. running and ran ), their similarity may be underestimated. Stemming algorithms such as Porter s or Krovetz s create equivalence classes of such words for better estimates of similarity. Both of these techniques have the additional advantage of reducing the dimensionality of the vocabulary space, creating more parsimonious models. This can be taken to an extreme, as in binning models (Anh and Moffat, 2005): terms are placed into k bins by some combination of tf and idf, and similarity calculated over only those bins.

Research Methodology in Studies of Assessor Effort for Information Retrieval Evaluation

Research Methodology in Studies of Assessor Effort for Information Retrieval Evaluation Ben Carterette & James Allan Center for Intelligent Information Retrieval Computer Science Department 140 Governors