Axiometrics: Axioms of Information Retrieval Effectiveness Metrics

Size: px

Start display at page:

Download "Axiometrics: Axioms of Information Retrieval Effectiveness Metrics"

Mervin Armstrong
5 years ago
Views:

1 Proceeings of the Secon Australasian Web Conference (AWC 214), Aucklan, New Zealan Axiometrics: Axioms of Information Retrieval Effectiveness Metrics Ey Maalena Stefano Mizzaro Department of Maths an Computer Science University of Uine Uine, Italy Abstract The evaluation of retrieval effectiveness has playe an is playing a central role in Information Retrieval (IR). A specific issue is that there are literally ozens (most likely more than one hunre) IR effectiveness metrics, an counting. In this paper we propose an axiomatic approach to IR effectiveness metrics. We buil on the notions of measure, measurement, an similarity; they allow us to provie a general efinition of IR effectiveness metric. On this basis, we provie a efinition of some common metrics an we then propose an justify some axioms that every effectiveness metric shoul satisfy. We also iscuss some future evelopments. 1 Introuction Effectiveness evaluation is of paramount importance in Information Retrieval (IR). IR has become one of the most evaluation-oriente fiels in computer science since the first IR systems (IRS) were evelope in the late 195 s. Several effectiveness metrics have been propose so far. A survey in 26 (Demartini & Mizzaro 26) counte more than 5 metrics, taking into account only the system oriente effectiveness metrics. In an extene version of the survey (Demartini et al. n..), yet unpublishe, about one hunre metrics are collecte, let alone user-oriente ones or metrics for tasks somehow relate to IR, like filtering, clustering, recommenation, summarization, etc. As state for example in (Robertson 26), there is nothing close to agreement on a common metric that everyone will use. It is a iffuse opinion that ifferent metrics evaluate ifferent aspects of retrieval behavior (Buckley & Voorhees 2, Robertson 26). Each of these metrics has its own avantages but also limitations. Metric choice is neither a simple task, nor it is without consequences: an inaequate metric might mean to waste research efforts improving systems towar a wrong target. However, some researchers simply o not investigate into the suitability of the metric for the problem itself an they seem to choose just the most popular metrics for their experiments. We cannot exclue the temptation for researchers to choose, among all available metrics, those that help corroborating their claims, or even to esign a new metric to this aim. It is not clear what to o when two metrics isagree. Copyright c 214, Australian Computer Society, Inc. This paper appeare at the The Secon Australasian Web Conference (AWC214), Aucklan, New Zealan, January 214. Conferences in Research an Practice in Information Technology (CRPIT), Vol. 155, S. Cranefiel, A. Trotman, J. Yang, E. Reprouction for acaemic, not-for-profit purposes permitte provie this text is inclue. It is clear that a better unerstaning of the formal properties of effectiveness metrics woul help to avoi wasting time in tuning retrieval systems accoring to effectiveness metrics inaequate to specific purposes, an it will also inuce researchers to make explicit an clarify the assumptions behin metrics. This paper proposes an axiomatic approach to effectiveness metrics, presenting some basic axioms that any reasonable metric shoul satisfy an that are formulate in a general way. The paper is structure as follows. Although there has not been much work on formal accounts of IR effectiveness metrics, in Sect. 2 we briefly recall such previous work. To have a common language to state the axioms, we then efine a general framework. In Sect. 3 we rely upon the notions of measurement, measure, an scale of measurement to provie a common efinition of both the output of an IR system an a relevance assessment by a human juge; the notion of similarity between the two is then analyze in Sect. 4. An analysis of the measurement scales use in IR an their implications for similarity is also propose. Measure an similarity allow us to efine the notion of effectiveness metric in Sect. 5. In Sect. 6 some metrics are efine within the framework, to emonstrate its effectiveness power. In Sect. 7, some axioms are state. Conclusions an future work are presente in Sect. 8. This stuy is one of the first steps within a larger project, name Axiometrics, that is a mix of the wors axioms (or axiomatic) an metric an stans for axiomatic approach to IR effectiveness metrics. Axiometrics is one of the research irections iscusse an selecte as interesting at the recent SWIRL meeting ( One companion paper has alreay been publishe on this topic (Busin & Mizzaro 213); although the present work is similar in the first part, we propose a rather ifferent set of axioms an several refinements an improvements: since Sects. 6 an 7 are completely rewritten, about 4% of the paper is novel. 2 Relate Work Although formal approaches have high importance in the IR fiel, they have mainly focusse on the retrieval process rather than on effectiveness metrics themselves (see, e.g., (Fang et al. 24, Fang & Zhai 25)). However, some research specific to effectiveness metrics oes exist, an it is briefly iscusse here. Early attempts have been mae by Swets (Swets 1963), who liste some properties of IR effectiveness metrics, an van Rijsbergen (van Rijsbergen 1979, Ch. 7), who followe an axiomatic approach. In (Bollmann 1984), Bollmann iscusses the risk of obtaining inconsistent evaluations on a ocument collection an 39

2 CRPIT Volume The Web 213 on its subcollections. Two axioms on effectiveness metrics, name the Axiom of monotonicity an the Archimeean axiom, are propose, an their implication is presente as a theorem. These approaches are evelope on the basis of binary relevance (either a ocument is relevant or it is not) an binary retrieval (either a ocument is retrieve or it is not) only. Here we o not make any assumption on the notions of relevance an retrieval (binary, ranke, continuous, etc.). Our approach is meant to be more general. Yao (Yao 1995) focusses on the notion of user preferences to measure the relevance (or usefulness) of ocuments. He aopts a framework where user jugements are escribe as a weak orer. On this basis he then proposes a new effectiveness metric that compares the relative orer of ocuments. The propose metric is prove to be appropriate through an axiomatic approach. More recently, Amigó et al. in (Amigó et al. 29) focus their formal analysis on evaluation metrics for text clustering algorithms fining four basic formal constraints. These constraints shoul be intuitive an coul point out the limitations of each metric. Moreover, it shoul be possible to prove formally which constraint is satisfie by a metric or a metric family, not just empirically. They foun BCube metrics (BCube precision an Bcube recall) to be the only ones satisfying the four propose constraints. Still Amigó et al. in (Amigó et al. 211) iscuss, with a similar approach, a unifie comparative view of metrics for ocument filtering. They obtain that no metric for ocument filtering can satisfy all esirable properties unless a smoothing process is performe. Finally, in an even more recent work (Amigó et al. 213), they start from a set of formal constraints to efine a general metric for ocument organization tasks, that inclue retrieval, clustering, an filtering. 3 Measurement 3.1 Measurement, Measures, an Scales Measurement can be efine as a process aime at etermining a relationship between a physical quantity an a unit of measurement (Wikipeia 212). In 1946, Stevens (Stevens 1946) efine measurement for social sciences as the assignment of numerals to objects or events accoring to some rule. A more recent an wiely aopte efinition of measurement, propose by Michell (Michell 1997), is the numerical estimation an expression of the magnitue of one quantity relative to another. A particularly iscusse issue is how the measurement is expresse. Stevens propose the four stanar measurement scales (Stevens 1946): Nominal, Orinal, Interval, Ratio. This classification has become a traition in various fiels an it provies useful insights (Robertson 26), although it is rather simple, leaves asie some subtleties, an it has been criticize (Velleman & Wilkinson 1993). Inee, ifferent scales an classification have been introuce through time (Michell 1997). A slightly ifferent an quite common classification inclues: Nominal, Orinal, Interval, Log-Interval, Ratio, Absolute. A further one is mae up of ten levels of measurement (Chrisman 1998): (1) Nominal, (2) Grae membership, (3) Orinal, (4) Interval, (5) Log-Interval, (6) Extensive Ratio, (7) Cyclical Ratio, (8) Derive Ratio, (9) Counts, an (1) Absolute. Also terminology is iscusse: for example, Chrisman s position (Chrisman 1998) is that the term level shoul be preferre to scale (in this paper we use scale ). Another iscipline that provies a useful backgroun is Software Measurement (Zuse 1997), where a istinction is mae between measurement an measure: a measurement is the process through which values are assigne to attributes of entities of the real worl; a measure is the result of that process, so it is the assignment of a value to an entity with the goal of characterizing a specifie attribute. In the rest of this paper we refer to measurement an measure with these meanings. 3.2 IR as Relevance Measurement The evaluation process in IR is base on two quantities: (i) an automate evaluation, by an IR system, of the possible relevance of a ocument, an (ii) the user s (or assessor s) estimation of the relevance of a ocument. We propose to use the above concepts to moel these two quantities: given a query, a system tries to measure the relevance of the ocuments to the query, for example to rank the ocuments; given (a escription of) an information nee, an assessor tries to measure the relevance of the ocuments to the nee. We therefore have two kins of relevance measurements (an measures as well): one mae by a system an referre to in the following as system relevance measure(ment), an one mae by a human an referre to in the following as user / assessor / human relevance measure(ment). A notion of measure / measurement common to both quantities (system an assessor) will allow us to efine a notion of similarity among them. In IR, alternative terms to measurement have been an are use, e.g., assessment, jugment, preiction, score, estimate, amount. Some of what follows coul be expresse also without reference to measurement; however, our choice provies a goo groun an allows us to exploit the measurement machinery. Yet on terminological issues, we also note that since in the IR fiel it is common to speak of effectiveness metrics, we stick with this terminology, also because using metrics for the effectiveness metrics avois confusion with the measure(ment) of relevance mae by IRSs an humans. However, let us remark that effectiveness measure woul perhaps be more appropriate because metrics satisfies some peculiar properties that are not require for measures in their general efinitions: not all measures are metrics. Turning to measurement scales, in IR it is common to use two ifferent scales for system an assessor measurements. Moreover, all the items of the traitional scales make sense, as shown by some examples: Nominal: categorizing the ocuments, e.g., as long / short, or on one specific topic among the topics in a set (e.g., Java vs. C++) or from a specific author, etc. A more subtle question is whether relevance classification (into relevant an nonrelevant) is a nominal classification as well: we will come back shortly on this. Orinal: the usual rank of retrieve ocuments by an IRS, but also grae relevance assessments like in the four level relevance scale HRP N ( Highly relevant, Relevant, Partially relevant, an Non-relevant ). Interval: IRSs that try to calibrate their Retrieval Status Values (RSV) may use internal interval scales before proviing the ranke output. Ratio an Absolute. IRSs that try to estimate the amount of relevance in a ocument, as has been suggeste several times (Swets 1963, 4

3 Nominal: Orinal: Interval Ratio / Absolute Unrelate categories Relate Categories: Across scales Within scales < (strict orer) (orer with equality) P. O. (Partial Orer) Ranke Categories Figure 1: Measurement scales in IR Della Mea & Mizzaro 24), or human juges that use magnitue scale estimation (Eisenberg 1988). On a more careful look, however, there are some peculiar features in IR. The usual relevant / nonrelevant binary scale, when use by both IRS an human assessor, can inee be seen as a nominal scale (an this leas to efining some well known metrics like precision an recall). However, the situation is more complex. When a grae relevance scale using more than two values is use, it seems unquestionable that it is not nominal but orinal. This is clear when comparing a grae relevance scale by a human assessor an the usual ranking of ocuments by an IRS. In other terms, in IR, the categories in a nominal scale are often (although not always) relate or, more precisely, ranke. For instance, let us consier a four level relevance scale HRP N, that is recently being use an iscusse quite often. Within this scale, the four categories are naturally ranke, an thus H is more similar to R, less to P an even less to N (an so on), whereas there is no rank among the categories in a classical nominal scale: the HRN P scale is actually orinal. Furthermore, it is quite customary to collapse H, R, P an N into the binary relevance scale R an N as either HR R an PN N or HRP R an N N ( rigi an relaxe mapping in (NTCIR Project 212)). This means that there exists a relationship across the two scales HRPN an RN, an also that the RN scale is orinal itself. However, in the classical binary relevance an binary retrieval case, the RN scale is nominal: the scale of a relevance measurement can be nominal or orinal epening on the scale of the other relevance measurement. Finally, the rank of categories can be partial: if a ocument is on Java, a misclassification into the C++ category is a smaller error than a classification into Sport, but there is not a total ranking of the scale C++, Java, Sport. To make things even more complex, the relationship across two scales may concern ifferent scales: for example, it is possible to compare a ranke output by an IRS (orinal scale) with a relevance jugment by a human assessor using magnitue scale estimation (Eisenberg 1988) (ratio or absolute scale). Thus an IRS an a relevance assessor express measurements of relevance of the ocuments in a set. Each measurement can be expresse in one of the scales in Fig Notation In the following, q represents a query, a ocument, Q a set of queries, an D a set of ocuments. We also use subscripts an primes with the obvious meaning. To istinguish ifferent measurements, we enote by ρ a relevance measurement, by a system relevance measurement, an by an assessor, user or, generally, human relevance measurement. We use Proceeings of the Secon Australasian Web Conference (AWC 214), Aucklan, New Zealan ρ,, an to represent both the measurement process an the measure itself. Given a query q an a ocument, the relevance measurement ρ applie to q an returns the relevance measure ρ(q, ). To represent the relevance measure of many ocuments (those in the set D q ) to a single query q we simply write ρ (q, D q ), that can be efine in set theoretic terms as ρ (q, D q ) = {ρ (q, ) : D q }, an recursively as ρ (q, D q {}) = {ρ (q, )} ρ (q, D q ). The relevance measure of the ocuments D q in the set of ocuments D for the corresponing queries q in Q is ρ (Q, D). It can be efine in set theoretic terms as ρ (Q, D) = {ρ (q, D q ) : ρ (q, ) ρ (q, D q ), q Q, D q D, D q is the subset corresponing to q}, an recursively as ρ (Q {q}, D) = ρ (q, D q ) ρ (Q, D). ρ (q, ) is usually a known value. When it is not known, often it can be compare, for instance by stating that ρ (q, ) < ρ (q, ), i.e., the ocument is more relevant to the query q than the ocument, accoring to the measurement ρ. The scale of a measurement ρ is enote by scale (ρ). Scales are represente by ouble square brackets an. For example, if ρ is on the above mentione four level relevance scale HRP N, then we write scale(ρ) = H, R, P, N ; a rank scale is enote by Rank ; an so on. 4 Similarity By exploiting the previous notions we can frame the notion of similarity between two relevance measurements. We nee a criterion to juge when two relevance measurements, one from the IRS an one from the human assessor, are similar. Ieally, an IRS shoul use the same measurement scale of the human assessor an provie the same measurement of the human assessor. However, IRSs are far from being perfect, an therefore the very same measurement is almost never provie. The aim of an IRS is to provie the measurement that is most similar to the human assessor / user one. Moreover, often the scales are ifferent: scale () can be fixe a priori, e.g., when a test collection provies human relevance assessments, an scale () epens on the retrieval algorithm at han, an ifferent approaches have ifferent scales. Of course, two measurements expresse on two ifferent scales can not be ientical (e.g., a rank can not be ientical to a measurement expresse on a category scale, the usual a-hoc retrieval situation). Summarizing, the IRS shoul provie the measurement that, accoring to the chosen scales, is the most similar to the human one. 4.1 Notation The similarity of two relevance measurements ρ (on a set of ocuments D an a set of queries Q) an ρ (on D an Q ) is enote as sim(ρ(q, D), ρ (Q, D )). If Q = Q then we use the notation sim Q (ρ(d), ρ (D )), or sim Q (ρ(d), ρ (D )), to inicate that Q is common. If D = D then we write sim D (ρ (Q), ρ (Q )). If both Q = Q an D = D we enote the similarity as sim Q,D (ρ, ρ ). We also use the same notation for single ocuments an queries. For example, if both ocument an query 41

4 CRPIT Volume The Web 213 sets contain only one element, i.e., Q = Q = {q} an D = D = {}, we write sim (ρ, ρ ). If there is no ambiguity on the sets of queries an ocuments, then we simply write sim (ρ, ρ ). Following the notation of Sect. 3.3, the similarity between two systems is given by sim (, ), the agreement among juges is sim (, ), while the similarity of a human relevance measurement an a system relevance measurement is sim (, ), that correspons to the effectiveness of an IRS in measuring relevance as the user oes. In the following, for the sake of simplicity, we mainly eal with this last case, sim (, ), but most of the results hol for any pair of relevance measurements. It is important to note that, in this context, it is not always possible express similarity as a number, i.e., it is possible that we can not know, nor even express, the value of sim (, ). More often, similarities can be compare so we can say if sim (, ) < sim (, ) or sim (, ) > sim (, ). For example, let us consier two IRSs that behave in exactly the same way on a set of ocuments D an a set of queries Q, an provie two relevance measurements an. Their similarities with an assessor measure are exactly the same for each ocument. Therefore, sim Q,D (, ) = sim Q,D (, ). If a new ocument / D is available an the two systems have a ifferent behavior on it such that sim Q, (, ) > sim Q, (, ) ( evaluates the ocument in a more similar way to the assessor than ), we can now infer that the similarity on the new whole collection of ocuments D {} is higher for than for : sim (, ) > sim (, Q,D {} Q,D {} ). Note that we i not nee to assign specific values to the various measurements to be able to compare them. Also note that, in general, sim q,d {} (, ) sim q,d (, ) + sim (, ). In the following of this paper we aopt a ifferent approach from (Busin & Mizzaro 213) an we will not use the similarity of sets of queries (Q) an/or ocuments (D): we will nee sim (, ) only, since we will be ealing with sets when working on metrics. 4.2 Similarity an Measurement Scale This generic notion of similarity can be specialize by taking into account the ifferent scales to obtain operational efinitions. The scenario is quite complex. Since is fixe an varies (e.g., by choosing a ifferent retrieval algorithm), similarity is not symmetrical: sim(ρ, ρ ) an sim(ρ, ρ) nee to be efine inepenently. Moreover, each of an can be over nine scales (see Fig. 1): this leas up to 9 9 = 81 cases, see Tab. 1. Actually, some combinations are rule out, because either the combination oes not make sense (e.g., categories can not be both relate an unrelate), or no notion of similarity can be efine (a nominal scale with unrelate categories oes not allow to efine a notion of similarity). However, all the other combinations make sense: although it might seem strange to have a human assessors that ranks the ocuments an an IRS that categorizes them into H, R, P, N, this is not impossible in principle. For space limitations, we only hint at how sim can be efine for some pairs of scales; the escription is at an intuitive level, but it can be formalize. (a) (b) Figure 2: Categories with across-scales relations If an are expresse on ifferent an unrelate category scales, nothing can be sai. For instance, if scale () = A, B (i.e., is a category scale with two categories A an B), an scale () = C, D, an there is no relationship neither across scales (i.e., A an B have nothing to o with neither C nor D) neither within scales (i.e., no relations between A an B, nor between C an D), then sim (, ) can not be efine. For instance, if is a measurement of the topicality of the ocuments an classifies the source of the ocuments (e.g., web page, article, technical report, etc.) then of course an can not be compare. Even if one an only one of scale () an scale () has some within scale relationship, this is not enough to efine any sensible similarity measure (if the two measures share the same categories then a similarity can be efine, but this woul mean that an acrossscales relationship exists). The following postulate simply rules out the unrelate categories scale from further analysis. Postulate 1 (Unrelate categories). Given two measurements, if the scale of at least one of them is an unrelate categories scale, then it is not possible to efine a similarity between the two measurements. There might exist across-scales relationships: some categories of scale () can be relate to some categories of scale (), or even equal (i.e., a particular case of across-scales relationships is scales with the same categories). In this case, similarity can be efine an it epens on the number of correctly an wrongly classifie ocuments: moving one ocument from a wrong category into a correct one (i.e., correctly classifying one more ocument) increases similarity. For instance, let scale() = A, B (i.e., it is expresse on a binary scale), an scale() = A, B, C, D. Let us also assume that it makes sense that ocuments in A shoul also be in A or B, an ocuments in B shoul also be in C or D (see Fig. 2(a)). In such a case, sim (, ) can be efine. Moving items from A to B (or viceversa) or from C to D (or vice-versa) oes not affect sim (, ); similarity varies by moving items from A or B to C or D (or vice-versa), or from A to B (or vice-versa). We can increase sim (, ) by moifying in in such a way that correctly classifies more ocuments. Any other moification (e.g., moving a wrongly classifie item into another wrong category) oes not affect the similarity. Let us consier another ifferent case. Let an be two system relevance measurements such that scale() = scale( ) = Rank an let be a human relevance measurement. If is obtaine from by swapping two ajacent ocuments an moving a more relevant ocument above a less relevant one without affecting anything else (see Fig. 2(b)), then sim(, ) > sim(, ). The notions of more an less relevant are efine if scale() is any orinal, interval, 42

5 Nominal Proceeings of the Secon Australasian Web Conference (AWC 214), Aucklan, New Zealan Nominal Orinal Interval Ratio / Un- Relate < P.O. Ranke Absolute relate Acr. With. Categ. Unrelate Relate Across Within < Orinal P.O. Ranke Categories Interval Ratio / Absolute Table 1: Similarity between relevance measurement scales. Empty cells mean that similarity can be efine. means that the combination oes not make sense. means that no notion of similarity can be efine. or ratio / absolute scale. The move of a less relevant ocument below a more relevant one is equivalent. Other changes can always be obtaine by several ajacent swaps (as it is well known, any permutation is a sequence of swaps). 5 Effectiveness Metric On the basis of the concepts of measurement, measurement scales, an similarity we now turn to moeling the effectiveness metrics itself. An effectiveness metric provies a numerical representation of the similarity between two relevance measurements. A metric is then a function that takes as arguments two measurements an, a set of ocuments D, an a set of queries Q, an provies as output a numeric value (usually in R): metric : D Q R. (1) A metric is efine on the basis of five components: scale(), scale(), a notion of similarity sim, how the values on single ocuments are average over the set D (we enote the corresponing averaging function with avgd), an how these averages are average over the set Q (avgq). We can write metric (scale(), scale(), sim, avgd, avgq). In most cases we o not nee to specify all the components of the effectiveness metric. Also, we o not nee to refer to the metric itself, but rather to the value of the metric in a specific measurement (the context will resolve the ambiguity). For instance we enote with metric((q, D), (Q, D)) the effectiveness metric value obtaine when evaluating with respect to on the set of queries Q an on the set of ocuments D. Usually, Q an D are common to an, so we often write metric Q,D (, ).When not neee, we omit Q an D as in metric(, ), an sometimes in place of sets Q an D we also use single elements q an, as in metric (, ). 6 Some Metrics In this section we efine the sim, avgd an avgq functions for some common metrics, to emonstrate that the framework an notation shoul be general enough to moel most (if not all) effectiveness metrics. Table 2 presents tentative efinitions of some common metrics. The table shoul be unerstanable, but we briefly iscuss some metrics. For instance, for both Precision an Recall, scale() = scale() = R, N. Let Rel = { D () = R} an Ret = { D () = R} be the sets of relevant an retrieve ocuments, respectively. Similarity is (see the table) 1 if a ocument is both retrieve an relevant an otherwise., the avgd functions for Precision an Recall are the arithmetic means over the sets Ret an Rel, respectively, an both the avgq functions are the arithmetic mean over the set Q. For MAP an MAP-like metrics (GMAP, logitap, yaap, etc.) we have two ifferent scales: scale() = R, N an scale() = Rank. Similarity is more complex on a rank (see the formula in the table), since to unerstan the similarity of a ocument we nee to analize also other ocuments in the rank. Similarity can then be use to efine AP values an then MAP is obtaine using as avgq the arithmetic mean of the AP values. GMAP is similar to MAP, the only ifference being on avgq since in GMAP the geometric mean is use. GMAP can also be efine in an equivalent way as the average of logarithms, an logitap efinition is similar. (an yaap shoul be similar as well). The table also inclues MAP@n, i.e., MAP compute averaging only the AP values of the relevant ocuments retrieve in the first n rank positions, an consiering as the AP of ocuments retrieve after rank n (this is the metric use in TREC-like settings). The last row efines ADM (Della Mea & Mizzaro 24). 7 Axioms We now can list some axioms: they efine properties that, ceteris paribus, any effectiveness metric shoul satisfy. Axioms can also be interprete as a set of constraints on a search space. We formalize as axioms the properties of similarity between relevance measurements (Subsection 7.1), we then present some axioms that efine the relationships between similarity an metrics (Subsection 7.2), an we then present metric-specific axioms (Subsection 7.3). 7.1 Similarity The first axioms represent basic constraints on similarity, an metrics are not concerne yet. Axiom 1 (Similarity of ocuments). Let q be a query, an two ocuments, a human relevance measurement an a system relevance measurement such that (q, ) = (q, ) an (q, ) = (q, ). sim (, ) = sim (, ). Axiom 2 (Similarity of queries). Let q an q be two queries, a ocument, a human relevance measurement an a system relevance measurement such that (q, ) = (q, ) an (q, ) = (q, ). sim (, ) = sim (, ). q, 43

6 CRPIT Volume The Web 213 Metric scale() scale() sim(, ) avgd avgq Precision R, N { 1 if () = () otherwise, Pq = 1 Ret Recall Rq = 1 Ret sim 1 (, ) Rel sim 1 (, ) q Q P q q Q R q P@n = 1 n Rel () n sim 1 (, ) q Q q R-Prec R-Precq = 1 R, N Rank 1 if () = R 1 if () = N ( ) = R ( ) > () otherwise. MAP APq = 1 MAP@n APq = 1 Rel () sim 1 (, ) Rel sim 1 (, ) Rel & () n sim 1 (, ) q Q R-Prec q q Q AP q q Q AP q GMAP APq = 1 Rel sim (, ) q Q AP q 1 q Q log AP q logitap APq = 1 Rel sim 1 (, ) q Q AP +ɛ log 1 AP +ɛ ADM [, +1] [, +1] (q, ) (q, ) ADMq = 1 1 D i D sim (, ) i q Q ADM q Table 2: Metrics on the basis of their components as per formula. 1 44

7 Proceeings of the Secon Australasian Web Conference (AWC 214), Aucklan, New Zealan (a) Divergent (b) Divergent (c) Overestimate (a) Overestimate (b) Unerestimate () Unerestimate (c) Divergent () Divergent Figure 3: Two systems having equal similarity to Figure 4: Two systems with ifferent similarity to Axiom 3 (Similarity of two systems). Let q be a query, a ocument, a human relevance measurement an an two system relevance measurements such that Axiom 5 (Systems with ifferent effectiveness). Let q be a query, a ocument, a human relevance measurement an an two system relevance measurements such that (q, ) = (q, ). sim (, ) > sim (, ) (4) sim (, ) > sim (, ). (5) (2) sim (, ) = sim (, ). (3) Let us remark that (3) oes not entail (2). Diagrams like those in Fig. 3 can be helpful to intuitively unerstan the situation: Figs. 3(a) an 3(b) represent the cases in which an respectively overestimate an unerestimate (or vice-versa) by the same amount; then the similarity (represente in the figure by the two arcs on the right) of the two systems is the same, but obviously (2) oes not hol. Conversely, as state by the axiom, when (2) hols then (3) hols as well (Figs. 3(c) an 3()) From Similarity to Metric Different systems an metric (, ) > metric (, ). Remark 2. Conition (4) means that is less wrong than. The combination of (4) an (5) means that the two systems are wrong in the same irection: if overestimates (unerestimates), then overestimates (unerestimates) it even more. Figs. 4(a) an 4(b) show these two cases. Conition (4) rules out the other two situations, shown in Figs. 4(c) an 4(), in which no constraint on the metric can be state for the same reasons mentione in Remark 1. The following axiom sets a constraint on the metric in one of the two last cases of Fig. 3 (Figs. 3(c) an 3()) Axiom 4 (Systems with equal effectiveness). Let q be a query, a ocument, a human relevance measurement an an two system relevance measurements such that We now turn to compare a system measurement for two ocuments an. Let us assume, without loss of generality, that is more relevant than (() > ( )). We can consier two cases: Different ocuments sim (, ) > sim (, ) (see Fig. 5(a)); (q, ) = (q, ). sim (, ) < sim (, ) (Fig. 5(b)). metric (, ) = metric (, ). Remark 1. Note that by using, in this axiom, a conition like (2) an not like (3) the first two cases of Fig. 3 are rule out, an inee in those cases we cannot state any constraint on the metric: a recalloriente metric woul give a higher value to a system overestimating all the ocuments (retrieving all ocuments means that recall is 1), whereas a precisionoriente metric woul o the opposite. In the first case we have a smaller error in the more relevant ocument an a larger error in less relevant ocument. In such a case, no constraint can be state on the metric since, as it is often state, earlier rank positions are more important than later ones. Conversely, the secon case allows to state some axioms. We analyze it an we start by observing that, since the system coul overestimate or unerestimate the ocuments an, the case of Fig. 5(b) can be subivie into four cases: 45

8 CRPIT Volume The Web 213 an (6) an (8) hol (i.e., both an are overestimate), then metric (, ) > metric (, ). The secon axiom concerns the case of Fig. 6(b). Axiom 7 (Unerestimate ocuments). Let q be a query, an two ocuments, a human relevance measurement an a system relevance measurements such that () > ( ), (a) Best top similarity (b) Best bottom similarity () > ( ), sim (, ) < sim (, ), Figure 5: Two ocuments with ifferent similarity an (7) an (9) hol (i.e., both an are unerestimate), then metric (, ) > metric (, ). (c) Divergent () Divergent (a) Both over- (b) Both estimate unerestimate Remark 3. The conition (1) rules out the critical case in which both ocuments an are unerestimate but there is a swap as shown in Fig. 7. In such a case, although the similarity is higher for, no constraint can be impose on the metric, again for the reason of Remark 1 about top rank positions. In Axiom 6, this aitional conition is not necessary, because if both ocuments are overestimate an similarity is higher for the less relevant ocument, then no swap is possible. Figure 6: Four possible cases overestimates both an (see Fig. 6(a)); unerestimates both an (Fig. 6(b)); overestimates (Fig. 6(c)); unerestimates (Fig. 6()). an an unerestimates overestimates Only the first two cases allow to express some constraints on the metric, again for the same reason of Remark 1 (an 2). We analyze the first two cases. Let us start by noting that: if overestimates then sim (( ), ()) < sim ((), ( )); In the following we will nee to write that a metric value is more affecte by a ocument than by another ocument. Formally, we efine: Definition 1. We write that Ametric(,) if an only if sim (( ), ()) > sim ((), ( )); (7) metric (, ) metric (, ) > q,d q,d {} if overestimates then sim ((), ( )) > sim ((), ( )); metric (, ) metric (, ) sim ((), ( )) < sim ((), ( )). (9) We can now state the following two axioms. The first concerns the case of Fig. 6(a). Axiom 6 (Overestimate ocuments). Let q be a query, an two ocument, a human relevance measurement an a system relevance measurements such that () > ( ), q,d (to be rea as affects metric value more than ). sim (, ) < sim (, ) q,d { } (8) an if unerestimates then 46 Figure 7: Critical situation (6) if unerestimates then (1) Analogously, we will write wmetric(,) an we will also v, an with similar meanings. A similar notation hols for queries. Axiom 8 (System relevance). Let q be a query, an two ocuments, a human relevance measurement an a system relevance measurement such that sim (, ) = sim (, ), () > ( ), an () ( ). Ametric(,). (11)

9 This means that if system relevance measures on two ocuments an are equally correct, an system relevance of is higher than system relevance of, then the effectiveness metric shoul be more affecte by than by (provie that is not less relevant than ). As alreay mentione, it is usually state that early rank positions affect a metric value more than later rank positions. This can be erive as a corollary of the previous axiom (that states a more general principle, inepenent of the scales) simply by taking scale() = Rank. A symmetric axiom can also be state on user relevance measurement: a metric shoul weigh more, an be more affecte, by more relevant ocuments. This is perhaps less intuitive than the previous one, but it oes inee seem natural in this framework. Moreover, it is quite easy for an IRS to evaluate a nonrelevant ocument as non-relevant, since the vast majority of ocuments in the atabase are non-relevant. Thus, an IRS stating that a non-relevant ocument is non-relevant is somehow oing an easy job, an shoul not be reware too much for it. On the other han it shoul be reware when correctly ientifying a relevant ocument. This is generalize an formalize as follows. Axiom 9 (User relevance). Let q be a query, an two ocuments, a human relevance measurement an a system relevance measurement such that: sim (, ) = sim (, ), () > ( ), an () ( ). (12) metric(,). Remark 4. Conitions (11) in Axiom 8 an (12) in Axiom 9 an are neee to rule out the case in which the two axioms woul result inconsistent. Finally, the following axiom eals with the last case. Axiom 1 (Same relevance). Let q be a query, an two ocuments, a human relevance measurement an a system relevance measurement such that sim (, ) = sim (, ). If () = ( ) an () = ( ) then 7.3 Metrics metric(,). We now turn to the last set of axioms, that are specifically about metrics. The following axiom formalizes Swets s properties (see Sect. 2). To simplify its formulation we enote by the theoretically worst performance, i.e., the relevance measure that gives the worst possible performance accoring to a given assessor relevance measure. Axiom 11 (Zero an maximum). An effectiveness metric shoul have a true zero in an a maximum value M. The theoretically worst (best) performances shoul give (M) as the metric value. As a normalization convention let M = 1 such that metric, range(metric) = [, 1], metric (, ) = 1, an metric (, ) =. Axiom 12 (Document monotonicity). Let q be a query, D an D two sets of ocuments such that D D =, a human relevance measurement an Proceeings of the Secon Australasian Web Conference (AWC 214), Aucklan, New Zealan an two system relevance measurements such that: 1 an metric (, ) > metric (, q,d (=) q,d ) (13) (>) metric (, ) > metric (, ). (14) q,d (=) q,d (=) metric (, ) > metric (, q,d D (=) q,d D ). (15) (>) A similar axiom hols for queries, as follows. Axiom 13 (Query monotonicity). Let Q an Q be two query sets such that Q Q =, D a ocument set, a human relevance measurement an an two system relevance measurements such that: an metric (, ) > metric (, Q,D (=) Q,D ) (>) metric (, ) > metric Q,D (=) (=) Q,D (, ). metric (, ) > metric (, Q Q,D (=) Q Q,D ). (>) These two last axioms can also be interprete as constraints on the avgd an avgq functions, respectively. 8 Conclusions an Future Work Builing on measure, measurement, an similarity, we have efine a framework that has been teste by using it to efine some common metrics an to propose some axioms on IR effectiveness metrics. Our contribution is fourfol: (i) the proposal of using measurement to moel in a uniform way both system output an human relevance assessment, an the analysis of the ifferent measurement scales use in IR; (ii) the notions of similarity among ifferent measurement scales an the consequent efinition of metric; (iii) the efinitions of some metrics within the framework; an (iv) the axioms themselves. A future irection concerns the measurement scales: we propose a set of scales specific for IR, an this proposal has been aequate for this paper; however, the literature on measurement scales is quite rich, an the IR case shoul perhaps be linke more carefully with it. An obvious future irection is the efinition of some theorems, that we have omitte for space limitations. Again for space limitations we have omitte axioms on iversity, novelty an session metrics, as well as metrics taking into account the notion of ifficulty / ease of query an ocuments. On these issues, something is hinte in (Busin & Mizzaro 213). Strictness of an effectiveness metric is another interesting property: a metric is strict if having a high value of it implies that also the other metrics will have a high value (i.e., being effective accoring to a 1 In this axiom the equal = an less than < signs have obviously to be paire in the appropriate way, row by row. We use this notation for the sake of brevity an to avoi to state three ifferent an very similar axioms. 47

10 CRPIT Volume The Web 213 strict metric means also being effective accoring to other metrics). The set of axioms coul be more structure: some of them coul be more basic an some others coul be erive from the basic ones. It is also possible that more funamental axioms can be foun, for example on the basic notions of measurement an similarity, an that the axioms state in this paper can inee be theorems erive from those more funamental axioms. On similar issues, although we have state our axioms in a general way, axioms (an theorems) for specific IR tasks an scenarios coul probably be erive from the set of general axioms. Of course, it woul be interesting to analyze other existing metrics, to see if they satisfy the axioms. It is also possible that ifferent sets of axioms can be ientifie. In this paper we have shown a possible set, but the question is left open whether there exist other ifferent sets, an if they are equivalent or perhaps even contraictory. Inee the axioms in (Busin & Mizzaro 213) are completely ifferent from those propose here; we leave as future work a etaile comparison of the two sets, but we remark that the combination of this paper an (Busin & Mizzaro 213) emonstrates that the framework is expressive an allows to formally reason on effectiveness metrics. Besies axioms an theorems, it woul be interesting to think of esierata (esirable properties, supporte by common sense, that coul be useful in some scenarios) an empirical properties (those that emerge from ata, i.e., from actual test collections an system comparisons). Those coul cover aspects like robustness or statistical correlation between effectiveness metrics. We believe that our research has also shown that basic features of metrics might be quite ifferent from those usually iscusse in the classical a-hoc retrieval situation (binary / category relevance an ranking retrieval), where we usually speak of early rank positions, rank swaps, etc. Finally, it woul be interesting to implement a software to analyze metrics by specifying some parameters corresponing to the values of specific components. Acknowlegments We thank Julio Gonzalo an Enrique Amigó for long an interesting iscussions, Evangelos Kanoulas an Enrique Alfonseca for helping to frame the Axiometrics research project, Arjen e Vries for suggesting the name Axiometrics, an organizers of (an participants to) SWIRL 212. This work has been partially supporte by a Google Research Awar. References Amigó, E., Gonzalo, J., Artiles, J. & Verejo, F. (29), A comparison of extrinsic clustering evaluation metrics base on formal constraints, Information Retrieval 12(4), Amigó, E., Gonzalo, J. & Verejo, F. (211), A comparison of evaluation metrics for ocument filtering, in CLEF, Vol of LNCS, Springer, pp Amigó, E., Gonzalo, J. & Verejo, F. (213), A general evaluation measure for ocument organization tasks, in G. J. F. Jones, P. Sherian, D. Kelly, M. e Rijke & T. Sakai, es, SIGIR, ACM, pp Bollmann, P. (1984), Two axioms for evaluation measures in information retrieval, in SIGIR 84, British Computer Society, Swinton, UK, pp Buckley, C. & Voorhees, E. M. (2), Evaluating evaluation measure stability, in SIGIR, ACM, New York, NY, USA, pp Busin, L. & Mizzaro, S. (213), Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics, in ICTIR 213 Proceeings of the 4th International Conference on the Theory of Information Retrieval. To appear. Chrisman, N. R. (1998), Rethinking levels of measurement for cartography, Cartography an Geographic Information Science 25(4), Della Mea, V. & Mizzaro, S. (24), Measuring retrieval effectiveness: A new proposal an a first experimental valiation, Journal of the American Society for Information Science an Technology 55(6), Demartini, G., Kazai, G. & Mizzaro, S. (n..), A survey an classification of information retrieval effectiveness metrics. Draft. Demartini, G. & Mizzaro, S. (26), A Classification of IR Effectiveness Metrics, in ECIR 26, Vol of LNCS, pp Eisenberg, M. B. (1988), Measuring relevance jugments, Informat. Process. & Management 24(4), Fang, H., Tao, T. & Zhai, C. (24), A formal stuy of information retrieval heuristics, in SIGIR 4, ACM, New York, NY, USA, pp Fang, H. & Zhai, C. (25), An exploration of axiomatic approaches to information retrieval, in SI- GIR 5, pp Michell, J. (1997), Quantitative science an the efinition of measurement in psychology, British Journal of Psychology 88(3), NTCIR Project (212), jp/ntcir/inex-en.html. [Last visit: August 213]. Robertson, S. (26), On GMAP: an other transformations, in CIKM 6, New York, USA, pp Stevens, S. S. (1946), On the theory of scales of measurement, Science 13 (2684), Swets, J. A. (1963), Information retrieval systems, Science 141, van Rijsbergen, C. J. (1979), Information Retrieval, 2n en, Butterworths. Velleman, P. F. & Wilkinson, L. (1993), Nominal, Orinal, Interval, an Ratio Typologies Are Misleaing, The American Statistician 47(1), Wikipeia (212), Measurement Wikipeia, the free encyclopeia, wiki/measurement. [Last visit: August 213]. Yao, Y. Y. (1995), Measuring retrieval effectiveness base on user preference of ocuments, Journal of the American Society for Information Science 46(2), Zuse, H. (1997), A Framework of Software Measurement, Walter e Gruyter. 48

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics ABSTRACT Ey Maalena Department of Maths Computer Science University of Uine Uine, Italy ey.maalena@uniu.it There are literally ozens most