Lower-Bounding Term Frequency Normalization

Lower-Bounding Term Frequency Normalization Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 68 ylv@uiuc.edu ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 68 czhai@cs.uiuc.edu ABSTRACT In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency TF normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM5, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries. Categories and Subject Descriptors H.3.3 [Information Search and ]: models General Terms Algorithms, Theory Keywords Term frequency, lower bound, formal constraints, data analysis, document length, BM5+, Dir+, PL+. INTRODUCTION Optimization of retrieval models is a fundamentally important research problem in information retrieval because Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM, October 4 8,, Glasgow, Scotland, UK. Copyright ACM 978--453-77-8//...$.. an improved retrieval model would lead to improved performance for all search engines. Many effective retrieval models have been proposed and tested, such as vector space models [6, 8], classical probabilistic retrieval models [3, 8, 4, 5], language models [, ], and the divergence from randomness approach []. However, it remains a significant challenge to further improve these state-of-the-art models and design an ultimately optimal retrieval model. In order to further develop more effective models, it is necessary to understand the deficiencies of the current retrieval models [4]. For example, in [8], it was revealed that the traditional vector space model retrieves documents with probabilities different from their probabilities of relevance, and the analysis led to the pivoted normalization retrieval function which has been shown to be substantially more effective than the traditional vector space model. In this work, we reveal a common deficiency of existing retrieval models in optimizing the TF normalization component and propose a general way to address this deficiency that can be applied to multiple state-of-the-art retrieval models to improve their retrieval accuracy. Previous work [4] has shown that all the effective retrieval models tend to rely on a reasonable way to combine multiple retrieval signals, such as term frequency TF, inverse document frequency IDF, and document length. A major challenge in developing an effective retrieval model lies in the fact that multiple signals generally interact with each other in a complicated way. For example, document length normalization is to regularize the TF heuristic which, if applied alone, would have a tendency to overly reward long documents due to their high likelihood of matching a query term more times than a short document. On the other hand, document length normalization can also overly penalize long documents [8, 4]. What is the best way of combining multiple signals has been a long-standing open challenge. In particular, a direct application of a sound theoretical framework such as the language modeling approach to retrieval does not automatically ensure that we achieve the optimal combination of necessary retrieval heuristics as shown in [4]. To tackle this challenge, formal constraint analysis was proposed in [4]. The idea is to define a set of formal constraints to capture the desirable properties of a retrieval function related to combining multiple retrieval signals. These constraints can then be used to diagnose the deficiency of an existing model, which in turn provides insight into how to improve an existing model. Such an axiomatic approach has been shown to be useful for motivating and developing more effective retrieval models [6, 7, ]. 7

In this paper, we follow this axiomatic methodology and reveal a common deficiency of the current retrieval models in their TF normalization component and propose a general strategy to fix this deficiency in multiple state-of-the-art retrieval models. Specifically, we show that the normalized TF may approach zero when the document is very long, which often causes a very long document with a non-zero TF i.e., matching a query term to receive a score too close to or even lower than the score of a short document with a zero TF i.e., not matching the corresponding query term. As a result, the occurrence of a query term in a very long document would not ensure that this document be ranked above other documents where the query term does not occur, leading to unfair over-penalization of very long documents. The root cause for this deficiency is that the component of TF normalization by document length is not lower-bounded properly, i.e., the score gap between the presence and absence of a query term could be infinitely close to zero or even negative. In order to diagnose this problem, we first propose two desirable constraints to capture the heuristic of lower-bounding TF in a formal way, so that it is possible to apply them to any retrieval function analytically. We then use constraint analysis to examine several representative retrieval functions and show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. Motivated by this understanding, we propose a general and efficient methodology for introducing a sufficiently large lower bound for TF normalization, which can be applied directly to current retrieval models. Constraint analysis shows analytically that the proposed methodology can successfully fix or alleviate the problem. Our experimental results on multiple standard collections demonstrate that the proposed methodology, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM5 [4, 5], language models [, ], and the divergence from randomness approach [], to significantly improve their average precision, especially when queries are verbose. Due to its effectiveness, efficiency, and generality, the proposed methodology can work as a patch to fix or alleviate the problem in current retrieval models, in a plug-and-play way.. RELATED WORK Developing effective retrieval models is a long-standing central challenge in information retrieval. Many different retrieval models have been proposed and tested, such as vector space models [6, 8], classical probabilistic retrieval models [3, 8, 4, 5], language models [, ], and the divergence from randomness approach []; a few representative retrieval models will be discussed in detail in Section 3.. In our work, we reveal and address a common bug of these retrieval models i.e., TF normalization is not lower-bounded properly, and develop a general plug-and-play patch to fix or alleviate this bug. Term frequency is the earliest and arguably the most important retrieval signal in retrieval models [5, 8,,, 7,, 4, ]. The use of TF can be dated back to Luhn s pioneer work on automatic indexing [9]. It is widely recognized that linear scaling in term frequency puts too much weight on repeated occurrences of a term. Thus, TF is often upper-bounded through some sub-linear transformations [5, 8,,,,, ] to prevent the contribution from repeated occurrences from growing too large. Particularly, in Okapi BM5 [4, 5], it is easy to show that there is a strict upper bound k + for TF normalization. However, the other interesting direction, lower-bounding TF, hasnotbeen well addressed before. Our recent work [] appears to be the first study that notices the inappropriate lower-bound of TF in BM5 through empirical analysis, but there is no theoretic diagnosis of the problem. Besides, the approach proposed in [] is not generalizable to lower-bound TF normalization in retrieval models other than BM5. In this paper, we extend [] to show analytically and empirically that lower-bounding TF is necessary for all representative retrieval models and develop a general approach to effectively lower-bound TF in these retrieval models. Document length normalization also plays an important role in almost all existing retrieval models to fairly retrieve documents of all lengths [8, 4], since long documents tend to use the same terms repeatedly higher TF. For example, both Okapi BM5 [4, 5] and the pivoted normalization retrieval model [8] use the pivoted length normalization schema [8], which uses the average document length as the pivoted length to coordinate the normalization effects for documents longer than this pivoted length and documents shorter than it. The PL model, a representative of the divergence from randomness models [], also uses the average document length to control document length normalization. A common deficiency of all these existing length normalization methods is that they tend to force the normalized TF to approach zero when documents are very long. As a result, a very long document with a non-zero TF could receive a score too close to or even lower than the score of a short document with a zero TF, which is clearly unreasonable. Although some exiting studies have attempted to use a sublinear transformation of document length e.g., the squared root of document length [3] to heuristically replace the original document length in length normalization, they are not guaranteed to solve the problem and often lose to standard document length normalization such as the pivoted length normalization in terms of retrieval accuracy. Our work aims at addressing this inherent weakness of traditional document length normalization in a more general and effective way. Constraint analysis has been explored in information retrieval to diagnostically evaluate existing retrieval models [4, 5], introduce novel retrieval signals into existing retrieval models [9], and guide the development of new retrieval models [6, ]. The constraints in these studies are basic and are designed mostly based on the analysis of some common characteristics of existing retrieval formulas. Although we also use constraint analysis, the proposed constraints are novel and are inspired by our empirical finding of a common deficiency of the existing retrieval models. Moreover, although some existing constraints e.g., LNCs and TF-LNC in [4, 5] are also meant to regularize the interactions between TF and document length, they tend to be loose and cannot capture the heuristic of lower-bounding TF normalization. For example, the modified Okapi BM5 satisfies all the constraints proposed in [4, 5], but it still fails to lower-bound TF normalization properly. In this sense, the proposed two new constraints are complimentary to existing constraints [4, 5]. 8

Model G ct, Q F ct, D, D,tdt k BM5 3 + ct,q k +ct,d N+ log k 3 +ct,q k b+b D /+ct,d df t tfn D t log tfn D t λt+log e /λt tfnd t +.5log π tfn D t if tfn PL ct, Q D tfn D t + t > andλ t > otherwise Dir ct, Q log μ D +μ + ct,d { D +μpt C +log+logct,d log N+ if ct, D > Piv ct, Q s+s D / df t otherwise Table : Document and query term weighting components of representative retrieval functions. Notation ct, D ct, Q N df t tdt D ct, C pt C Description Frequency of term t in document D Frequency of term t in query Q Total number of docs in the collection Number of documents containing term t Any measure of discrimination value of term t Length of document D Average document length Frequency of term t in collection C Probability of a term t given by the collection language model [] Table : Notation 3. MOTIVATION OF LOWER-BOUNDING TF NORMALIZATION In this section, we discuss and analyze a common deficiency i.e., lack of appropriate lower bound for TF normalization of four state-of-the-art retrieval functions, which respectively represent the classical probabilistic retrieval model Okapi BM5 [4, 5], the divergence from randomness approach PL [], the language modeling approach Dirichlet prior smoothing [], and the vector space model pivoted normalization [8, 7]. An effective retrieval function is generally comprised of two basic separable components: a within-query scoring formula for weighting the occurrences of a term in the query and a within-document scoring formula for weighting the occurrences of this term in a document. We will represent each retrieval function in terms of these two separable components to make it easier for us to focus on studying the document side weighting: SQ, D = t Q G ct, Q F ct, D, D,tdt where SQ, D is the total relevance score assigned to document D with respect to query Q, andg andf are within-query scoring function and within-document scoring function respectively. In Table, we show how this general scheme can be used to represent all the four major retrieval models. Other related notations are listed in Table. Note that most of the notations were also used in some previous work, e.g., [4], and will be adopted throughout our paper. 3. Deficiency of Existing Functions 3.. Okapi BM5 BM5 The Okapi BM5 method [4, 5] is a representative retrieval function that represents the classical probabilistic retrieval model. The BM5 retrieval function is summarized in the second row of Table. Following work [4], we modify theoriginalidfformulaofbm5toavoidtheproblemof possibly negative IDF values. The within-document scoring function of BM5 can be re-written as follows: F BM5 ct, D, D,tdt = k + tfn D t k + tfn D t log N + df t where k is a parameter, and tfn D t is the normalized TF by document length using pivoted length normalization [8]. tfn D t = ct, D b + b D where b is the slope parameter in pivoted normalization. When a document is very long i.e., D is much larger than, we can see that tfn D t could be very small and approach. Consequently, F BM5 will also approach as if t did not occur in D. It can be seen clearly in Figure : when D becomes very large, the score difference between D and D appears to be very small. This by itself, would not necessarily be a problem, but the problem is that, the occurrence of t inaverylongdocumentd fails to ensure D to be ranked above other documents where t does not occur. It suggests that the occurrences of a query term in very long documents may not be rewarded properly by BM5, and thus those very long documents could be overly penalized, which as we will show later, is indeed true. 3.. PL Method PL The PL method is a representative retrieval function of the divergence from randomness framework []. In this paper, we use the modified PL formula derived by Fang et al. [5] instead of the original PL formula []. The only difference between this modified PL function and the original PL function is that the former essentially ignores nondiscriminative query terms. It has been shown that the modified PL is more effective and robust than the original PL [5]. The modified PL still called PL for convenience in the following sections is presented in the third row of Table, where λ t = N is the term discrimination value, and ct,c tfn D t is the normalized TF by document length: 3 tfn D t = ct, D log +c D where c> is a retrieval parameter. We can see that, when a document is very long, tfn D t could be very small and approach, which is very similar to the corresponding component in BM5. What is worse is that, when tfn D t is sufficiently small, the within-document score F PL will be a negative number surprisingly. However, as shown in Table, even if the term is missing, i.e., ct, D =,F PL can still receive a default zero score. This interesting observation is illustrated in Figure. It suggests that a very long document that matches a query term 4 9

BM5 b=.75, k =. PL c=, λ t = Dir μ =, pq C =. Piv.5 D D 4 3 D D D D.5 D D s =. D s =.5 F BM5 F PL F Dir - F Piv.5 -.5-3 3 4 5-3 4 5-4 3 4 5 3 4 5 D / D / D / D / Figure : Comparison of the within-document term scores, i.e., F, ofdocumentsd and D w.r.t. query term t against different document lengths, where we assume ct, D =and ct, D =. Here, x-axis and y-axis stand for the length of documents and the within-document term scores respectively. Probability of /.5.4.3.. BM5 for short queries Probability of /.5.4.3.. BM5 for verbose queries Figure : Comparison of retrieval and relevance probabilities against all document lengths when using BM5 on short left and verbose right queries. may be penalized even more than another document the length can be arbitrary that does not match the term; consequently, those very long documents tend to be overly penalized by PL. 3..3 Dirichlet Prior Method Dir The Dirichlet prior method is one of the best performing language modeling approaches []. It is presented in the fourth row of Table, where μ is the Dirichlet prior. It is observed that, the within-document scoring function F Dir is monotonically decreasing with the document length variable. And when a document D is very long, say 5, even if it matches a query term, the withindocument score of this term could still be arbitrarily small. And this score could be smaller than that of any averagelength document D which does not match the term. This is shown clearly in Figure 3. Thus, the Dirichlet prior method can also overly penalize very long documents. 3..4 Pivoted Normalization Method Piv The pivoted normalization retrieval function [7, 4] represents one of the best performing vector space models. The detailed formula is shown in the last row of Table, where s is the slope parameter. Similarly, the analysis of the pivoted normalization method also shows that it tends to overly penalize very long documents, as shown in Figure 4. 3. Likelihood of / Our analysis above has shown that, in principle, allthese retrieval functions tend to overly penalize very long documents. Now we turn to seeking empirical evidence to see if this common deficiency hurts document retrieval in practice. Inspired by Singhal et al. s finding that a good retrieval function should retrieve documents of all lengths with similar chances as their likelihood of relevance [8], we compare the retrieval pattern of different retrieval functions with the relevance pattern. We follow the binning analysis strategy proposed in [8] and plot the two patterns against all document lengths on WTG in Figure, where the bin size is set to 5. Due to the space reason, we only plot BM5 as an example. But it is observed that other retrieval functions have similar trends as BM5. The plot shows clearly that BM5 retrieves very long documents with chances much lower than their likelihood of relevance. This empirically confirms our previous analysis that very long documents tend to be overly penalized. 4. FORMAL CONSTRAINTS ON LOWER- BOUNDING TF NORMALIZATION A critical question is thus how we can regulate the interactions between term frequency and document length when a document is very long so that we can fix this common deficiency of current retrieval models? To answer this question, we first propose two desirable heuristics that any reasonable retrieval function should implement to properly lower bound TF normalization when documents are very long: there should be a sufficiently large gap between the presence and absence of a query term, i.e., the effect of document length normalization should not cause a very long document with a non-zero TF to receive ascoretooclosetoorevenlowerthanashortdocument with a zero TF; a short document that only covers a very small subset of the query terms should not easily dominate over a very long document that contains many distinct query terms. Next, in order to analytically diagnose the problem of over-penalizing very long documents, we propose two formal constraints to capture the above two heuristics of lower bounding TF normalization so that it is possible to apply them to any retrieval function analytically. The two constraints are defined as follows: LB: Let Q be a query. Assume D and D are two documents such that SQ, D = SQ, D. If we reformulate the query by adding another term q / Q into Q, where cq, D =andcq, D >, then SQ {q},d < SQ {q},d. LB: Let Q = {q,q } be a query with two terms q and q. Assume tdq =tdq, where tdt can be any reasonable measure of term discrimination value. If D and D are two documents such that cq,d =cq,d =, cq,d >, cq,d >, and SQ, D =SQ, D, then SQ, D {q } {t } <SQ, D {q } {t }, for all t and t such that t D, t D, t / Q and t / Q.

The first constraint LB captures the basic heuristic of - gap in TF normalization, i.e., the gap between presence and absence of a term should not be closed by document length normalization. Specifically, if a query term does not occur in document D but occurs in document D,andboth documents receive the same relevance score from matching other query terms, then D should be scored lower than D, no matter what are the length values of D and D. In other words, the occurrence of a query term in a very long document should still be able to differentiate this document from other documents where the query term does not occur. In fact, when F, D,tdt is a document-independent constant, LB can be derived from a basic TF constraint, TFC [4]. Here, F, D,tdt is the document weight for a query term t not present in document D, i.e., t Q but t/ D. This property is presented below in Theorem. Theorem. LB is implied by the TFC constraint, if the within-document weight for any missing term is a document independent constant. Proof: Let Q be a query. Assume D and D are two documents such that SQ, D =SQ, D. We reformulate query Q by adding another term q/ Q into the query, where cq, D = and cq, D >. If D is another document, which is generated by replacing all the occurrences of q in D with a non-query term t/ Q {q}, then cq, D =andsq, D =SQ, D =SQ, D. Due to the assumption that the document weight for the missing term q is a document independent constant, it follows that SQ {q},d =SQ {q},d. Finally, since D = D and cq, D = < cq, D, according to TFC, we get SQ {q},d =SQ {q},d <SQ {q},d. However, when the document weights for missing terms are document dependent, LB will not be redundant in the sense that it cannot be derived from other constraints such as the proposed LB and the seven constraints proposed in [4]. For example, the Dirichlet prior retrieval function, as shown in Table, has a document-dependent weighting function for a missing term, which is log. As will be shown μ D +μ later, the Dirichlet prior method violates LB, although it satisfies LB and most of the constraints proposed in [4]. The second constraint LB states that if two terms have the same discrimination value, a repeated occurrence of one term is not as important as the first occurrence of the other. LB essentially captures the intuition that covering more distinct query terms should be rewarded sufficiently, even if the document is very long. For example, given a query Q = { computer, virus }, if two documents D and D with identical relevance scores with respect to Q both match computer, but neither matches virus, then if we add an occurrence of virus to D to generate D and add an occurrence of computer to D to generate D, we should ensure that D has a higher score than D. This intuitively makes sense because D is more likely to be related to computer virus, while D may be just about other aspects of computer. LB and LB are two necessary constraints to ensure that very long documents would not be overly penalized. When either is violated, the retrieval function would likely not perform well for very long documents and there should be room to improve the retrieval function through improving its ability of satisfying the corresponding constraint. 5. CONSTRAINT ANALYSIS ON CURRENT RETRIEVAL MODELS 5. Okapi BM5 BM5 BM5 satisfies TFC [4], and the within-document weight for any missing term is always. Therefore, BM5 satisfies LB unconditionally according to Theorem. We now examine LB. Due to the sub-linear property of TF normalization, we only need to check LB in the case when cq,d =, since when cq,d >, it is even harder to violate the constraint. Consider a common case when D =. It can be shown that the LB constraint is equivalent to the following constraint on D : k + D < k b + 5 This means that LB is satisfied only if D is smaller than a certain upper bound. Thus, a sufficiently long document would violate LB. Note that the upper bound of D is a monotonically decreasing function with both b and k. This suggests that a larger b or k wouldleadbm5to violate LB more easily, which is confirmed by our experiments. 5. PL Method PL In Fang et al. s work [5], the TFC constraint is regarded equivalent to that the first partial derivative of the formula w.r.t. the TF variable should be positive, which has been shown to be satisfied by the modified PL [5]. However, the PL function is not continuous when the TF variable is zero, and what is worse is that, lim F PL ct, D, D,tdt <F PL, D,tdt = 6 ct,d which shows that even the modified PL still fails to satisfy TFC. So we cannot use Theorem for PL. We thus check both LB and LB directly. Since the optimal setting of parameter c is usually larger than [4], we consider a common case when D = c. Similar 3 to the analysis on BM5, we only need to examine LB for cq,d =. The LB constraint is approximately equivalent to D < c exp λ.84 t and LB is approximately equivalent to D < c exp.7 logλ t.7.6 λ t Due to space limit, we cannot show all the derivation details. We can see that both LB and LB set an upper bound for document length, suggesting that a very long document would violate both LB and LB. However the upper bound introduced by LB is always larger than that introduced by LB. So we focus on LB in the following sections. The upper bound of document length in LB is monotonically increasing with both c and λ t. It suggests that, when c is very small, there is a serious concern that long documents would be overly penalized. On the other hand, a more discriminative term also violates the constraint more easily. These analyses are confirmed by our experiments. 7 8

5.3 Dirichlet Prior Method Dir With Dir, the within-document weight for a missing term μ is log, which is document dependent. So Theorem D +μ is not applicable to the Dirichlet method. We thus need to examine LB and LB directly. First, we only check LB at the point of cq, D =, which is the easiest case for LB to be violated. By considering the common case that D =, thelbconstraint is equivalent to the following constraint on D : D <+ + 9 pq C μ It shows that the Dirichlet method can only satisfy LB if D is smaller than a certain upper bound, suggesting again that a very long document would violate LB. And this upper bound is monotonically decreasing with both pq C and μ. On the one hand, a non-discriminative i.e., large pq C term q violates LB easily; for example, if μ pq C =,the upper bound appears to be as low as + μ. Thus, the Dirichlet method would overly penalize very long documents more for verbose queries. On the other hand, a large μ would also worsen the situation according to Formula 9. These are all confirmed by our experimental results. Next, we turn to check LB, which is equivalent to n ++μ pq C n + μ pq C < +μ pq C μ pq C where n {,, }. Interestingly, this inequality is always satisfied, suggesting that the Dirichlet method satisfies LB unconditionally. We thus expect that the Dirichlet method would have some advantages in the cases when other retrieval functions tend to violate LB. 5.4 Pivoted Normalization Method Piv It is easy to show that the pivoted normalization method also satisfies LB unconditionally. We now examine LB. Similar to the analysis on BM5, we only need to check LB in the case of cq,d =. By considering a common case when D =, weseethat LB is equivalent to the following constraint on D :.899 D < + s This means that LB is satisfied only if D is smaller than a certain upper bound. And this upper bound is a monotonically decreasing function with s. So, in principle, a larger s would lead the pivoted normalization method to violate LB more easily, which can also explain why the optimal setting of s tends to be small [4]. Of course, if s is set to zero, LB would be satisfied, but that would be to turn off document length normalization completely, which would clearly lead to non-optimal retrieval performance. 6. A GENERAL APPROACH FOR LOWER- BOUNDING TF NORMALIZATION The analysis above shows analytically that all the state-ofthe-art retrieval models would tend to overly penalize very long documents. In order to avoid overly penalizing very long documents, we need to lower-bound TF normalization to make sure that the gap of the within-document scores F ct, D, D,tdt between ct, D =and ct, D > is sufficiently large. However, we do not want that the addition of this new constraint changes the implementations of other retrieval heuristics in these state-of-the-art retrieval functions, because the implementations of existing retrieval heuristics in these retrieval functions have been shown to work quite well [4]. We propose a general heuristic approach to achieve this goal by defining an improved within-document scoring formula F as shown in Equation, where δ is a pseudo TF value to control the scale of the TF lower bound, and l is a pseudo document length which is document-independent. In this new formula, a retrieval model-specific, but documentindependent value F δ, l, tdt F,l,tdt would serve as an ensured gap between matching and missing a term: if ct, D >, the component of TF normalization by document length will be lower-bounded by such a documentindependent value, no matter how large D would be. F ct, D, D,tdt { = F ct, D, D,tdt + F,l,tdt if ct, D = F ct, D, D,tdt + F δ, l, tdt otherwise It is easy to verify that F ct, D, D,tdt is able to satisfy all the basic retrieval heuristics [4] that are satisfied by F ct,d, D, tdt: first, it is trivial to show that, if F ct, D, D,tdt satisfies TFCs, F ct, D, D,tdt will also satisfy them; secondly, F δ, l, tdtandf, l, tdt, as two special points of F ct, D, D,tdt, satisfy the TDC constraint in exactly the same way as F ct, D, D,tdt, so does F ct, D, D,tdt; finally, since the newly introduced components are document-independent, they raise no problem for LNCs and TF-LNC. The proposed methodology is very efficient, as it only adds a retrieval model specific but document-independent value to those standard retrieval functions. For a query Q, weonly need to calculate Q such values, which can even be done offline. Therefore, our method incurs almost no additional computational cost. Finally, we can obtain the corresponding lower-bounded retrieval function through substituting F ct, D, D,tdt for F ct, D, D,tdt in each retrieval function, Take BM5 as an example. Obviously F, D,tdt =. In F δ, l, tdt, since l is a constant document length variable used for document length normalization, its influence canbeabsorbedintothetfvariableδ, wethussetl = simply. Then, we obtain F δ,, tdt = k +δ log N+. k +δ df t Clearly parameter k canalsobeabsorbedintoδ, andthe above formula is simplified again as δ log N+. Finally, we df t derive a lower-bounded BM5 function, namely BM5+, as shown in the following Formula 3. k 3 +ct, Q k +ct, D k 3 + ct, Q t Q D t Q D log N + df t [ ct, Q log + k b + b D ct, D +log + μ pt C + ct, D δ μ pt C μ + Q log D + μ [ ] +log+logct, D ct, Q t Q D s + s D + δ log N + df t + δ ] 3 4 5

tfn D t ct, Q log tfn D t λt +log e t Q D,λ t > + log π tfn D t λ tfn D t t tfn D t + + δ log δ λ t+log e λ δ t δ + + log πδ 6 Similarly, we can derive a lower-bounded Dirichlet prior method Dir+, a lower-bounded pivoted normalization method Piv+, and a lower-bounded PL PL+, which are presented in Formulas 4, 5, and 6 respectively. Next, we check LB and LB on these four improved retrieval functions. 6. Lower-Bounded BM5 BM5+ It is trivial to verify that BM5+ still satisfies LB unconditionally. To examine LB, we apply an analysis method that is consistent with our analysis for BM5 in Section 5.. The LB constraint on BM5+ is equivalent to k k + < k + + δ k b + b D 7 + which can be shown to be satisfied unconditionally if δ k 8 k + Clearly, if we set δ to a sufficiently large value, BM5+ is able to satisfy LB unconditionally, which is also confirmed in our experiments that BM5+ works very well when we set δ =. 6. Lower-Bounded PL PL+ We only need to check LB on PL+, since it is easier to violate than LB. With a similar analysis strategy as used for analyzing PL, the LB constraint on PL+ is equivalent to D < c exp.7 δ+ δ logλ.7 t δ+ λ t.6 gδ 9 δ+ log δ δ+logπ where gδ = δ+. Due to space limit, we cannot show all the derivation details in this section. We can see that, given a δ, the right side of the Formula 9 i.e., the upper bound of D is minimized when λ t =.7δ+.7. This suggests that, in contrast to PL, the upper.73δ.7 bound of D is not monotonically decreasing with λ t.this interesting difference is shown clearly in Figure 3. Thus, if we set δ to an appropriate value to make the minimum upper bound still large enough e.g., larger than the length of the longest document, PL+ would not violate LB. 6.3 Lower-Bounded Dirichlet Method Dir+ It is easy to show that Dir+ also satisfies LB unconditionally. We analyze Dir+ in the same way as analyzing Dir, and obtain the following equivalent constraint of LB: D <+ +δ + δ + pt C μ μ p + t C μ We can see that, although Dir+ does not guarantee that LB is always satisfied, it indeed enlarges the upper bound of document length as compared to Dir in Section 5.3, and thus makes the constraint harder to violate. Generally, if we set δ to a sufficiently large value, the chance that very long documents are overly penalized would be reduced. Upper bound of document length c* 6 4 8 6 4 PL PL+δ=.5 PL+δ=.8 PL+δ=. 5 5 5 3 λ t Figure 3: Comparison of upper bounds of document length in PL and PL+ to satisfy LB. Terabyte WTG Robust4 WTG queries 7-85 45-55 3-45 6-7 4-45 #qrywith qrel 49 49 5 avgql short 3.3 4.4.74.46 avgql verb.55.6 5.47 3.86 #total qrel 8, 64 5, 98 7, 4, 79 #documents 55k 69k 58k 47k 949 6 48 56 stddl/.63.3.9.4 Table 3: Document set characteristic 6.4 Lower-Bounded Pivoted Method Piv+ It is easy to verify that Piv+ also satisfies LB. Regarding LB, similar to our analysis on Piv, the LB constraint on Piv+ is equivalent to which is always satisfied if log + log < s + s D + δ δ log + log.53 This shows that Piv+ can be able to satisfy LB unconditionally with a sufficiently large δ. 7. EXPERIMENTS 7. Testing Collections and Evaluation We use four TREC collections: WTG, WTG, Terabyte, and Robust4, which represent different sizes and genre of text collections. WTG, WTG, and Terabyte are small, medium, and large Web collections respectively. Robust4 is a representative news dataset. We test two types of queries, short queries and verbose queries, which are taken from the title and the description fields of the TREC topics respectively. We use the Lemur toolkit and the Indri search engine http://www.lemurproject.org/ to carry out our experiments. For all the datasets, the preprocessing of documents and queries is minimum, involving only Porter s stemming. An overview of the involved query topics, the average length of short/verbose queries, the total number of relevance judgments, the total number of documents, the 3

Query Short Verbose Method WTG WTG Terabyte Robust4 MAP P@ MAP P@ MAP P@ MAP P@ BM5.879.898.34.484.93.5785.544.4353 BM5+.96 4.34.37.48.34.5685.553.4357 BM5+ δ =..97 3.3.378.484.997 4.578.548.4349 BM5.745.35.484.438.34.5.6.436 BM5+.85.336.64 3.44.336 4.539.74.456 BM5+ δ =..84.334.565.434.339 4.539.75.45 Table 4: Comparison of BM5 and BM5+ using cross validation. Superscripts //3/4 indicate that the corresponding MAP improvement is significant at the.5/././. level using the Wilcoxon test. average document length, and the standard deviation of document length in each collection are shown in Table 3. We employ a -fold cross-validation for parameter tuning, where the query topics are split into even and odd number topics as the two folds. The top-ranked documents for each run are compared in terms of their mean average precisions MAP, which also serves as the objective function for parameter training. In addition, the precision at top- documents P@ is also considered. Our goal is to see if the proposed general heuristic can work well for improving each of the four retrieval functions. 7. BM5+ VS. BM5 In both BM5+ and BM5, we train b and k using cross validation, where b is tuned from. to.9 inincrements of., and k is tuned from. to 4. in increments of.. Besides, in BM5+, parameter δ is trained using cross validation, where δ is tuned from. to.5 inincrementsof., but we also create a special run in which δ is fixed to. empirically labeled as BM5+ δ =.. The comparison results of BM5+ and BM5 are presented in Table 4. The results demonstrate that BM5+ outperforms BM5 consistently in terms of MAP and also achieves P@ scores better than or comparable to BM5. The MAP improvements of BM5+ over BM5 are much larger on Web collections than on the news collection. In particular, the MAP improvements on all Web collections are statistically significant. This is likely because there are generally more very long documents in Web data, where the problem of BM5, i.e., overly-penalizing very long documents, would presumably be more severe. For example, Table 3 shows that the standard deviation of the document length is indeed larger on the three Web collections than on Robust4. Another interesting observation is that, BM5+, even with afixedδ =., can still work effectively and stably across collections and outperform BM5 significantly. This empirically confirms the constraint analysis results in Section 6. that, when δ> k k +, BM5+ can satisfy LB unconditionally. It thus suggests that the proposed constraints can even be used to guide parameter tuning. We further plot the curves of MAP improvements of BM5+ over BM5 against different δ values in Figure 4, which demonstrates that, when δ is set to a value around., BM5+ works very well across all collections. Therefore, δ can be safely eliminated from BM5+ by setting it to a default value.. Regarding different query types, we observe that BM5+ improves more on verbose queries than on short queries. For example, the MAP improvements on Web collections are often more than 5% for verbose queries and are around % for short queries. We hypothesize that BM5 may overly- Probability of / Probability of / Query WTG WTG Terabyte Robust4 b k b k b k b k Short.3...8.3..4.6 Verbose.6..6.6.4.8.7. Table 5: Optimal settings of b and k in BM5..5.4.3...5.4.3.. BM5 for short queries BM5 for verbose queries Probability of / Probability of /.5.4.3...5.4.3.. BM5+ for short queries BM5+ for verbose queries Figure 5: Comparison of retrieval and relevance probabilities against all document lengths when using BM5 left and BM5+ right for retrieval. It shows that BM5+ alleviates the problem of BM5 that overly penalizes very long documents. penalize very long documents more seriously when queries are verbose, and thus there is more room for BM5+ to boost the performance. To verify our hypothesis, we collect the optimal settings of b and k for BM5 in Table 5, which show that the optimal settings of b and k are clearly larger for verbose queries than for short queries. Recall that our constraint analysis in Section 5. has shown that the likelihood of BM5 violating LB is monotonically increasing with parameters b and k. We can now conclude that BM5 indeed tends to overly penalize very long documents more when queries are more verbose. So far we have shown that BM5+ is more effective than BM5, but if it is really because BM5+ has alleviated the problem of overly-penalizing very long documents? To answer this question, we plot the retrieval pattern of BM5+ as compared to the relevance pattern in a similar way as we have done in Section 3.. The pattern comparison is presented in Figure 5. We can see that the retrieval pattern of BM5+ is more similar to the relevance pattern, especially 4

Relative MAP improvement % 7 6 5 4 3 - WTG-odd WTG-even WTG-odd WTG-even Robust4-odd Robust4-even BM5+ for short queries Relative MAP improvement % 8 6 4 - WTG-odd WTG-even WTG-odd WTG-even Robust4-odd Robust4-even BM5+ for verbose queries..4.6.8..4 δ..4.6.8..4 δ Figure 4: Performance Sensitivity to δ of BM5+, where y-axis shows the relative MAP improvements of BM5+ over BM5, and suffix -even / -odd indicates that only even/odd-number query topics are used. Query Method WTG WTG Robust4 PL.883.33.53 Short PL+.9.37.549 PL+ δ =.8.9.355.54 Verbose PL.695.473.85 PL+.886 4.595.348 4 PL+ δ =.8.886 4.639.347 4 Table 6: Comparison of PL and PL+ using cross validation. Superscripts //3/4 indicate that the corresponding MAP improvement is significant at the.5/././. level using the Wilcoxon test. Query WTG WTG Robust4 Short 9 3 9 Verbose 3 Table 7: Optimal settings of c in PL. for the retrieval of very long documents. This suggests that BM5+ indeed retrieves very long documents more fairly. 7.3 PL+ VS. PL In both PL+ and PL, we train parameter c using cross validation, where c is tuned from.5 to 57 values. Besides, in PL+, parameter δ is also trained using cross validation, where δ is tuned from. to.5 inincrementsof.. Also we create a special run of PL+ in which δ is fixed to.8 empirically without training. The comparison results of PL+ and PL are presented in Table 6. The results show that PL+ outperforms PL consistently, and even if we fix δ =.8, PL+ can still achieve stable improvements over PL. Specifically, PL+ improves significantly over PL for about % on verbose queries, yet it only improves slightly on short queries; PL+ appears to be less sensitive to the genre of collections, since it also improves significantly over PL on news data verbose queries. We hypothesize that, PL may overly-penalize very long documents seriously on verbose queries but works well on short queries, and thus there is more room for PL+ to improve the performance on verbose queries than on short queries. To verify it, we collect the optimal settings of c in PL and show them in Table 7. We can see that the optimal settings of c are huge for short queries as compared to that for verbose queries, presenting an obvious contrast. As a result, recalling the upper bound of document length in Formula 8, Query Method WTG WTG Robust4 Dir.93.388.5 Short Dir+.96.3.53 Dir+ δ =.5.967.33 3.55 Dir.79.74.39 Verbose Dir+.874 3.867.44 4 Dir+ δ =.5.87 3.87.44 4 Table 8: Comparison of Dir and Dir+ using cross validation. Superscripts //3/4 indicate that the corresponding MAP improvement is significant at the.5/././. level using the Wilcoxon test. verbose queries would be more likely to violate LB even if a document is not very long e.g., a news article, while short queries would only have a very small chance to violate LB even if a document is very long. Again, we can see that our constraint analysis is consistent with empirical results. 7.4 Dir+ VS. Dir In both Dir+ and Dir, we train parameter μ using cross validation, where μ is tuned in a parameter space of values from 5 to. Besides, in Dir+, parameter δ is also trained, the candidate values of which are from. to.5 in increments of.. Similarly, we also create a special run in which δ is fixed to.5 empirically without training. The comparison of Dir+ and Dir is presented in Table 8. Overall, we observe that Dir+ improves over Dir consistently and significantly across different collections, and even if we fix δ =.5 without training, Dir+ can still outperform Dir significantly in most cases. Note that, similar to BM5+ and PL+, Dir+ works more effectively on verbose queries, which is consistent with our constraint analysis that Dir is more likely to overly penalize very long documents when a query contains more non-discriminative terms. In addition, we further compare Dir+ and Dir thoroughly by varying μ from 5 to. It shows that Dir+ is consistently better than Dir no matter how we change the μ value. Moreover, comparing Table 8 with Table 4 and 6, we can see that Dir works clearly better on verbose queries than BM5 and PL. One possible explanation is that Dir satisfies LB unconditionally, but BM5 and PL do not. 7.5 Piv+ VS. Piv In both Piv+ and Piv, we train s using cross-validation, where s is tuned from. to.5 in increments of.. 5

Query Method WTG WTG Robust4 Piv.87.95.4 Short Piv+.869.945.455 Verbose Piv.493.48.44 Piv+.493.54.5 Table 9: Comparison of Piv and Piv+ using cross validation. Superscripts indicates that the corresponding MAP improvement is significant at the.5 level using the Wilcoxon test. Query WTG WTG Robust4 Short.5..5 Verbose.5..9 Table : Optimal settings of s in Piv. Besides, in Piv+, parameter δ is also trained, the candidate values are from. to.5 in increments of.. The comparison results of Piv+ and Piv are presented in Table 9. Unfortunately, Piv+ does not improve over Piv significantly in most of the cases, which, however, is also as we expected: although there is an upper bound of document length for Piv to satisfy LB as shown in Formula, this upper bound is often very large because the optimal setting of parameter s is often very small as presented in Table. Nevertheless, Piv+ would work much better than Piv when s is large, as observed in our experiments. 7.6 Summary Our experiments demonstrate empirically that, the proposed general methodology can be applied to state-of-theart retrieval functions to successfully fix or alleviate their problem of overly-penalizing very long documents. We have derived three effective retrieval functions, BM5+ Formula 3, PL+ Formula 6, and Dir+ Formula 4. All of them are as efficient as but more effective than their corresponding standard retrieval functions, i.e., BM5, PL, and Dir, respectively. There is an extra parameter δ in the derived formulas, but we can set it to some default values i.e., δ =. for BM5+, δ =.8 for PL+, and δ =.5 for Dir+, which perform quite well. The proposed retrieval functions can potentially replace its corresponding standard retrieval functions in all retrieval applications. 8. CONCLUSIONS In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency TF normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overlypenalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. We find that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results on standard collections demonstrate that the proposed methodology, incurring almost no additional computational cost, can be applied to state-ofthe-art retrieval functions, such as Okapi BM5 [4, 5], language models [], and the divergence from randomness approach [], to significantly improve the average precision, especially for verbose queries. Our work has also helped reveal interesting differences in the behavior of these state-ofthe-art retrieval models. Due to its effectiveness, efficiency, and generality, the proposed methodology can work as a patch to fix or alleviate the problem in current retrieval models, in a plug-and-play way. 9. ACKNOWLEDGMENTS We thank the anonymous reviewers for their useful comments. This material is based upon work supported by the National Science Foundation under Grant Numbers IIS- 7358, CNS-83479, and CNS 838, by NIH/NLM grant R LM953-, a Sloan Research Fellowship, and a Yahoo! Key Scientific Challenge Award.. REFERENCES [] G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., :357 389, October. [] S. Clinchant and E. Gaussier. Information-based models for ad hoc ir. In SIGIR, pages 34 4,. [3] R. Cummins and C. O Riordan. An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions. Artif. Intell. Rev., 8:5 68, June 7. [4] H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In SIGIR 4, pages 49 56, 4. [5] H. Fang, T. Tao, and C. Zhai. Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst., 9:7: 7:4, April. [6] H. Fang and C. Zhai. An exploration of axiomatic approaches to information retrieval. In SIGIR 5, pages 48 487, 5. [7] H. Fang and C. Zhai. Semantic term matching in axiomatic approaches to information retrieval. In SIGIR 6, pages 5, 6. [8] N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35:43 55, 99. [9] H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev., :39 37, October 957. [] Y. Lv and C. Zhai. Adaptive term frequency normalization for bm5. In CIKM,. [] Y. Lv and C. Zhai. When documents are very long, bm5 fails! In SIGIR, pages 3 4,. [] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR 98, pages 75 8, 998. [3] S. E. Robertson and K. S. Jones. weighting of search terms. Journal of the American Society of Information Science, 73:9 46, 976. [4] S. E. Robertson and S. Walker. Some simple effective approximations to the -poisson model for probabilistic weighted retrieval. In SIGIR 94, pages 3 4, 994. [5] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC 94, pages 9 6, 994. [6] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 8:63 6, 975. [7] A. Singhal. Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 4:,. [8] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR 96, pages 9, 996. [9] T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR 7, pages 95 3, 7. [] C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, pages 334 34,. 6