Probability Kinematics in Information. Retrieval: a case study. F. Crestani. Universita' di Padova, Italy. C.J. van Rijsbergen.

Size: px

Start display at page:

Download "Probability Kinematics in Information. Retrieval: a case study. F. Crestani. Universita' di Padova, Italy. C.J. van Rijsbergen."

Marianna Tate
5 years ago
Views:

1 Probability Kinematics in Information Retrieval: a case study F. Crestani Dipartimento di Elettronica e Informatica Universita' di Padova, Italy C.J. van Rijsbergen Department of Computing Science University of Glasgow, Scotland Abstract In this paper we discuss the dynamics of probabilistic term weights in dierent IR retrieval models. We present four dierent models based on dierent notions of retrieval. Two of these models are classical probabilistic models long in use in IR, the two others are based on a logical technique of evaluating the probability of a conditional called Imaging, one is a generalisation of the other. We analyse the transfer of probabilities occuring in the representation space at retrieval time for these four models, compare their retrieval performance using classical test collections, and discuss the results. 1

2 Contents 1 Introduction 3 2 The representation space Possible World Semantics : : : : : : : : : : : : : : : : : : : : : : The term space : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 3 Probability kinematics for retrieval models Retrieval by Joint Probability : : : : : : : : : : : : : : : : : : : : Retrieval by Conditional Probability : : : : : : : : : : : : : : : : Retrieval by Logical Imaging : : : : : : : : : : : : : : : : : : : : : Retrieval by General Logical Imaging : : : : : : : : : : : : : : : : 13 4 Probability, similarity and opinionated functions 17 5 A case study 19 6 Experiments with test collections 22 7 Comparative evaluation 24 8 Conclusions 26 A A case study of probability dynamics 29 A.1 Retrieval by Joint Probability : : : : : : : : : : : : : : : : : : : : 30 A.2 Retrieval by Conditional Probability : : : : : : : : : : : : : : : : 31 A.3 Retrieval by Logical Imaging : : : : : : : : : : : : : : : : : : : : : 32 A.4 Retrieval by General Logical Imaging : : : : : : : : : : : : : : : : 33 2

3 1 Introduction In Information Retrieval (IR), probabilistic modelling means the use of a model to rank documents in decreasing order of their evaluated or, more often, estimated probability of relevance to a user's information need that is expressed by means of a query. In an IR system based on a probabilistic model, the user is always guided to examine rst the documents which are the most likely to be relevant to their need, that is those at the top of the ranked list. The use of a probabilistic model in IR assures that we can obtain \optimal retrieval performance" if we rank documents according to their probability to be judged relevant to a query [15]. However, this rule, called The Probability Ranking Principle, refers only to \optimal retrieval", which is dierent from \perfect retrieval". Optimal retrieval can be dened precisely for probabilistic IR because optimality can be proved theoretically, owing to a provable relationship between ranking and the probabilistic interpretation of precision and recall. Perfect retrieval relates to the objects of the IR systems themselves, i.e. documents and information needs, but as IR systems use representations of these objects, perfect retrieval is not an appropriate goal for computer-based systems. Despite that, probabilistic models based on the Probability Ranking Principle have been shown to give the highest levels of retrieval eectiveness currently available. Although there are a few operative IR systems based on probabilistic or semiprobabilistic models, there are still obstacles to getting many probabilistic models accepted in the commercial IR world. One major obstacle is that of nding methods for estimating the probabilities of relevance that are both eective and computationally ecient. Past and present research has made much use of formal probability theory and statistics in order to solve the problems of estimation. In mathematical terms the problem consists in estimating the probability P (R j q; d), i.e. the probability of relevance given a query q and a document d for every document in the collection, and ranking the documents according to this measure. This is very dicult because of the large number of variables involved in the representation of documents in comparison with the small amount of feedback data available about the relevance of documents (sometimes refered to as the \curse of dimensionality"). In 1986 Van Rijsbergen [22] proposed for use in IR techniques based on nonclassical conditional logic. This would enable the estimation of P (R j q; d) by the evaluation of P (d! q) instead, so proposing to use the probability of a conditional to estimate the conditional probability. The evaluation of the probability P (d! q) should follow the following logical uncertainty principle: \Given any two sentences x and y; a measure of the uncertainty of 3

4 y! x related to a given data set is determined by the minimal extent to which wehave to add information to the data set, to establish the truth of y! x." That proposal initiated a new line of research (see for example [12, 13, 3, 2]) but in that paper nothing was said about how \uncertainty" and \minimal" might be quantied. A few years later, moving into Modal Logic, Van Rijsbergen proposed to estimate the probability of the conditional by a process called Logical Imaging [23]. This techniques has been explored in more details later in [5]. In this paper we explore further the use of the probability of a conditional, namely P (d! q), to estimate the conditional probability P (R j q; d). We propose the use in IR of a technique called General Logical Imaging (or simply \General Imaging"), proposed by Gardenfors [7] in the context of the Belief Revision theory. This techniques is a generalisation of Imaging [10] that enables a more general redistribution of probabilities than Imaging. We analyse and compare the probability kinematics of Imaging and General Imaging with more classical probabilistic models that have been the basis of many IR probabilistic models: the Joint Probability model and the Conditional Probability model. The paper is structured as follows. Section 2 describes the model that will be used to represent documents and queries in the rest of the paper. Section 3 describes the probability kinematics of four dierent retrieval models. An example of the dierent results that can be achieved using these four models is given in Section 5. The retrieval performance of these models are evaluated using the probability and similarity functions described in Section 4 on some standard test collections described in Section 6. The results are presented, compared, and discussed in Section 7. Section 8 ends the paper giving the conclusions of the experimental investigation. 2 The representation space In probabilistic IR the task of the system can be formalised as follows. If we assume binary relevance judgements, i.e. R the set of possible relevance judgements contains only the two possible judgements: relevant (R) and not-relevant (R), then according to The Probability Ranking Principle, the task of the system is to rank the documents according to their probability of being relevant P (R j q;d), where q and d are the real query and the real document. Of course we can only estimate this probability by using the available query and documents representations, q and d. The probability P (R j q; d), is the Retrieval Status Value (RSV) that will be used to rank documents. 4

5 The diculty of applying probabilistic IR lies therefore in two dierent problems: estimation. representation; The problem of estimating P (R j q; d) is tackled in this paper from a theoretical point of view. In the past many researchers have tried to estimate P (R j q; d) in many ways. It is in fact from these attempts that many probabilistic models for IR have sprung. In Section 3 we will report on four models, we will analyse their dierences, and we will draw some interesting conclusions. Representing documents and queries is very dicult to tackle. We will ignore the problems related to representation in this paper. We will use a representation model that helps us in the analysis of the probabilistic models presented in Section 3. This representation model is based on a particular semantics: the \Possible World Semantics" [11]. 2.1 Possible World Semantics Possible World Semantics (PWS) was introduced by Kripke [9] in the context of Modal Logic. In this semantics the truth value of a logical sentence is evaluated in the context of a world. The word \world" has been used by a number of logicians in this connection, and seems to be the most convenient one, but perhaps some such phrase as \conceivable or envisageable state of aairs" ([8], p. 75) would convey the idea more clearly. PWS has been used in Modal systems to give a semantics for Necessity (a sentence is true in every possible world) and Possibility (a sentence is true in at least one possible world) 1. Without entering into the details of this semantics, we would like to point out the main reasons why we use it. We use PWS because it enables the evaluation of a conditional sentence without explicitly dening the operator \!". What it requires is a clustering on the space of events (worlds) by means of a primitive relation of neighbourhood. This clustering enables us to dene an accessibility relation that is necessary for the evaluation of the conditional. According to the PWS the truth value of the conditional y! x inaworld w is equivalent to the truth value of the consequent x in the closest world w y to w where the antecedent y is true [19]. In Section 3.3 we will explain how we can use this result in the context of IR, but rst let us examine how we can use the PWS to model our probabilistic retrieval space. 1 Here we simply refer to the Modal System S5 and not to more complex models. 5

6 2.2 The term space One of the most often used IR models is the Vector Space Model (VSM) [17]. In this model a document is represented by means of a vector whose elements are numbers representing the presence/absence of certain features in that document like for example, the presence or absence of some index terms 2. The document representation space is therefore multidimensional, with as many dimensions as the number of features used to represent documents. A document is represented in this space as a single point. A query is also a point, and document retrieval is performed as a function of the distance in the representation space between query and documents. The semantics of the VSM is therefore that of a multidimensional geometrical space. Many IR models use this representation space as the underlying space. In this paper we use a dierent approach. The semantics of our representation space is based on the PWS. We can use the PWS in the context of IR by considering a term as a possible world. This view was proposed in [5]. According to this view, a term is represented as a \vector of documents". This is the inverse of the representation model used in the VSM. Intuitively this can be understood as \if you want toknow the meaning of a term then look at all the documents in which that term occurs". This idea is not new in IR (see for example [1, 14]) and it has been widely used for the evaluation of term{term similarity (see Section 4). More formally,we assume a set of terms T, the set of our possible worlds. We also assume a \prior" probability distribution P assigning to each term (world) t 2 T a probability P (t) so that P P (t) =1. We then assume we have a document collection D. The document we use are represented using terms in T,as it is common in IR, but in our representation space we use documents to represent terms, in such a way that a term t is represented by a binary vector whose elements denote the presence or absence of documents in the term representation. Finally, we assume we have a query q also represented using terms in T. 3 Probability kinematics for retrieval models In the following sections we examine the dierentmovements of probability transfers that takes place in four dierent retrieval models. Our purpose is to show how the probability associated with terms changes and shifts in dierent ways in dierent models. We do not intend to associate directly any of the models here with existing IR models. However, these four models can be considered to 2 For simplicity of exposition we only consider the binary case. 6

7 be the archetypes of the most common IR models. The probability kinematics of the rst two models shows how VSM and probabilistic retrieval model may be explained in terms of the framework adopted for probability transfer. The last two models are new and are based on a completely dierent approach for transfer of probability. Their origin lies in the eld of non-classical logics. We show that in principle, i.e. without entering into complex \ad hoc" weighting and retrieval schemas, these last two models perform better than the rst two. This result suggests that an improvements in retrieval eectiveness can be obtained by designing IR systems based on probabilistic models that use a completely new kind of probability kinematics. We explain the changes in the probability space by using the representation model described in Section 2, and taking as an example a particular document d and a query q. We suppose we have a document d represented by terms t 1, t 5, and t 6 and a query q represented by the t 1, t 4, and t 6. Each of these terms has a \prior" probability associated with it, this is indicated by P (t). In the following we show how the RSV of document d is evaluated by dierent retrieval models and we concentrate our attention on how the probabilities associated with terms change and move from term to term during the evaluation of the RSV. We will indicate the new \posterior" probability associated with terms by P d (t) to highlight the fact that it is obtained by taking into consideration a particular document d. In the following sections we make extensive use of the following function: I(w; y) =( 1 if w occurs in y 0 otherwise We will use this function to evaluate to presence or absence of a representation element w in the representation of y. In the context of the PWS, assuming that y is a logical sentence and w is a possible world, we can also interpret I(w; y) as follows: I(w; y) =( 1 if y is true at w 0 otherwise 3.1 Retrieval by Joint Probability We call Retrieval by Joint Probability (RbJP) the ranking and retrieval of documents obtained estimating the probability of relevance with the probability of the joint event consisting of having both the query and the document true for a term. P (R j q; d) P(q; d) 7

8 t P (t) I(t; d) I(t; q) P (t) I(t; d) I(t; q) Table 1: Evaluation of P (q; d) RbJP can be obtained by evaluating the RSV of a document in accordance with the following formula: P (q; d) = X t P(t)I(t; d) I(t; q) where, for a document d, we compute the sum of the probabilities of all terms that are both present in that document and in the query. In the PWS this is equivalent to the sum of the probabilities of the worlds in which both the document and the query are true. In RbJP model there is no transfer of probabilities. The \prior" probability P (t) associated with term t does not change, it remains the same. That is why we do not need to evaluate the \posterior" probability P d (t). It is important to note that the RbJP model is the archetype of many IR models currently in use. Most IR models that are based on the evaluation of a similarity between documents and query are based upon the idea of a joint probability measure. Both Dice's and Jaccard's coecients, for example, are based on this idea, as it can be easily seen once we remove the normalisation factors (see [21], p. 39). The Cosine Correlation used by the the Vector Space Model is again a normalised version RbJP. Of course, we do not intend to undervalue the importance of normalisation factors, but we just want to point out that the probability kinematics of all these IR models does not change once a normalisation factor is introduced, it substantially remains the same as that for RbJP. To show howweevaluate the RSV in the case of RbJP, we show an example in Table 1, where we report the evaluation of P (q; d). The evaluation process is the following: 1. Identify the terms occurring in the document d (third column of the table). 2. Identify the terms occurring in the query q (fourth column). 8

9 p1 t1 t2 p2 p1 t1 t2 p2 p3 t3 t4 p4 p3 t3 q p4 t4 p5 t5 d t6 p6 p5 t5 t6 p6 a b Figure 1: Graphical interpretation of the evaluation of P (q; d). 3. Evaluate the P (d; q) by summing the probability of all terms present in both document and query (fth column). A graphical interpretation of RbJP using the PWS is given in Figure 1, where each term is represented by a world with its \prior" probability measure expressing the importance of the term in the term space T. The shadowed terms are those occurring in document d (Figure 1(a)), P (q; d) is obtained by summing the probability of all terms occurring in the document and in the query representations, that is summing the probabilities of the shadowed terms also occurring in q (Figure 1(b)). 3.2 Retrieval by Conditional Probability In the case of Retrieval by Conditional Probability (RbCP) we estimate the probability of relevance of a document by evaluating the conditional probability of the query given that document. P (R j q; d) P(q j d) P(q j d) can be evaluated as follows: P (q j d) = P d (q) = P t P d (t) I(t; q) = P t P (t)(1 + d ) I(t; q) where P d (t) is the \posterior" probability distribution over the set of terms occurring in d obtained by conditioning on the document d itself, and d is the ratio 9

10 t P (t) I(t; d) P d (t) I(t; q) P d (t) I(t; q) Table 2: Evaluation of P (q j d) by which the \prior" probability is modied. The value d is the ration between the sum of the probabilities of the terms not occurring in d and the sum of the probabilities of those occurring in d: d = 62d P (t) 2d P (t) It should be noticed that the transfer of probabilities that takes place in this model provides the minimal revision of the \prior" probability necessary to make d certain that does not distort the prole of probability ratios. In fact the \posterior" probability is directly proportional to the \prior" probability, so leaving constant the ratios among probabilities associated with the terms after the contraction of the representation space due to the certainty of the conditional event d. Let us now see an example of this evaluation using query q and document d. Table 2 reports the evaluation of P (q j d). The evaluation process is the following: 1. Identify the terms occurring in the document d (third column of the table). 2. Evaluate the \posterior" probability P d (t) by transferring the probabilities from terms not occurring in the document to terms occurring in it. The probabilities are transfered in a proportional way, so that each term occurring in the document d receives a portion of the sum of the probability of the terms not occuring in the document proportional to its \prior"probability (fourth column). 3. Evaluate I(t; q) for each term, i.e. determine the terms occurring in the query (fth column). 4. Evaluate P d (t) I(t; q) for all terms (sixth column) and evaluate P d (q) by summation (bottom of sixth column). 10

11 p1 t1 t2 p2 p1 t1 t2 p2 p 1 t1 p3 t3 t4 p4 p3 t3 p4 t4 q p5 t5 d t6 p6 p5 p6 p 5 p 6 t5 t6 t5 t6 a b c Figure 2: Graphical interpretation of the evaluation of P (q j d). It is interesting to see a graphical interpretation of this process. In Figure 2(a) each term is represented by a world with its \prior" probability expressing the importance of the term in the term space. The shadowed terms occur in document d. The conditioning process transfers the probability from terms not occurring in the document d to those occurring in it as depicted in Figure 2(b). In Figure 2(c) the terms with null probability disappear and those occurring in the query q are taken into consideration. The \posterior" probabilities P d (t) of terms occurring in the query are summed to evaluate P (q j d). 3.3 Retrieval by Logical Imaging Imaging is a process developed in the framework of Modal Logic that enables the evaluation of a conditional sentence without explicitly dening the operator \!" [19]. Imaging has been extended to the case where there is a probability distribution on the worlds by Lewis [10]. In this case the evaluation of P (y! x) causes a shift of the original probability P from a world w to the closest world w y where y is true. Probability is neither created nor destroyed, it is simply moved from a \not-y-world" to a \y-world" to derive a new probability distribution P y. This process is called \deriving P y from P by imaging on y". We will not go into a detailed explanation of the Imaging process. The interested reader can look at papers by Stalnaker [19], Lewis [10], and Gardenfors [7] for more details. We use Imaging in IR with the purpose of estimating the probability of relevance of a document by means of the probability of the conditional d! q: P (R j q; d) P(d! q) We call Retrieval by Logical Imaging (RbLI) the model based on such RSV. A detailed explanation of this model can be found in [5]. Briey, in RbLI we consider 11

12 t P (t) I(t; d) t d P d (t) I(t; q) P d (t) I(t; q) Table 3: Evaluation of P (d! q) by imaging on d the process of Imaging on d over all the possible terms t in T. More formally: P (d! q) = P d (q) = P t P d (t) I(t; q) = P t P (t) I(t d ;q) where t d is the closest term to t for which d is true, or in other words, the most similar term to t that occurs in the document d. The application of the above technique to IR requires an appropriate measure of similarityover the term space T to enable the identication of t d.we will tackle this problem in Section 4. It should be noticed that Imaging provides the minimal revision of the \prior" probability in the sense that it involves no gratuitous movement of probability from world to dissimilar worlds. In fact, the revision of the \prior" probability necessary to make d certain is obtained by adopting the least drastic change in the probability space. This is achieved by transfering probabilities from each term not occuring in the document d to its closest (most similar) term occurring in it, so that the total amount of the distance covered in the transfer is minimal. A detailed comparison between Conditionalisation and Imaging can be found in [6]. For a practical example of the evaluation of RbLI let us suppose wehave the same query q and document d of the previous sections. Table 3 reports the evaluation of P (d! q) by imaging on d. The evaluation process is the following: 1. Identify the terms occurring in the document d (third column of the table). 2. Determine for each term in T the t d, i.e. the most similar term to t for which I(t; d) =1. This is done using a similarity measure on the term space (fourth column). 3. Evaluate P d (t) by transferring the probabilities from terms not occurring in the document to terms occurring in it (fth column). 12

13 p1 t1 t2 p2 p1 t1 p2 t2 p 1 t1 p3 t3 p5 t5 d t4 t6 p6 p4 t3 p3 p5 t5 p4 t4 t6 p6 q p 5 p 6 t5 t6 a b c Figure 3: Graphical interpretation of the evaluation of P (d! q) by imaging on d. 4. Evaluate I(t; q) for each term, i.e. identify the terms occurring in the query (sixth column). 5. Evaluate P d I(t; q) for all terms (seventh column) and evaluate P d (q) by summation (bottom of seventh column). A graphical interpretation of this process is depicted in Figure 3. As we said, we assume wehave a measure of similarity on the term space, using it we can transfer probability from each term not occurring in the document d to its most similar one occurring in d. After the transfer of probability terms with null probability disappear and those occurring in the query q are taken into consideration, so that their \posterior" probabilities P d (t) can be summed to evaluate P d (q). 3.4 Retrieval by General Logical Imaging The idea of General Imaging originated from an attempt to overcome one of the restrictive assumptions Lewis made for Stalnaker's semantics of conditionals [10]. The assumption is related to the \uniqueness" of the world w y, that is the uniqueness of the world most similar to w where y is true. In [7] p. 110, Gardenfors propose a generalisation of the Imaging process that does not rely on this assumption 3. The starting point of the generalisation is the use of a (degenerate) probability function to represent the fact that in any possible world w a proposition y is either true or false. Using our formalism I(w; y) =P w (y). In this case P w (y) =1if yis true in w, and P w (y) =0if yis false in w. Lewis called such probability function opinionated because \it would represent 3 In that book he also characterised this generalisation of Imaging in terms of a homomorphic condition that does not presuppose any kind of possible world semantics, but we will remain faithful to our semantics in the rest of this paper. 13

14 the beliefs of someone who was absolutely certain that the world w was actual and who therefore held a rm opinion about every question" (see [10], p. 145). Suppose we have a set of possible worlds W. We can generalise Imaging by considering the fact that, instead of having P w (y) = 1 only for a single w y,we can have P w (y) > 0 for a few worlds. More formally: P w (y) =( 0 if y is false at w > 0 otherwise with the requirement that P w P w (y) = 1. We then have: P w (y! x) = X w P w y(x) Now we assume a probability distribution over the set of possible worlds W (the \prior" probability) so that, according to the classical rules of probability, we have P w P (w) = 1. Hence we dene P (y) as follows: P (y) = X w P(w)P w (y) From this probability distribution we can derive a new probability distribution P 0 so that: P 0 (w 0 )= X w P(w)P w (w 0 )I(w 0 ;w) where: I(w 0 ;w)=( 1 if w 0 belongs to the set W y of w 0 otherwise with W y as the set of the closest worlds to w where y is true. It could be proved, with a demonstration similar to the one reported in [10], p. 142, that P (y! x) =P 0 (x) or, using a terminology more appropriate to highlight the Imaging process on y, that: P (y! x) =P y (x) 14

15 t P (t) I(t; d) t d P d (t) I(t; q) P d (t) I(t; q) ; ; ; Table 4: Evaluation of P (d! q) by general imaging on d where P y is the \posterior" probability distribution derived from the \prior" probability P by General Imaging on y. In other words, P y can be obtained by transfering the probability from every world w to the W y, which in this case is the set of most similar (closest) worlds to w where y is true. The transfer of probability is performed according to the opinionated probability function P w. Lewis Imaging, as reported in Section 3.3, is a special case of General Imaging when P w (y) = 1 for some w. Retrieval by General Logical Imaging (RbGLI) can be regarded a the process of applying General Imaging on d in order to evaluate the RSV of the document d. More formally: P (d! q) = P d (q) = P t P d (t) I(t; q) = P t P (t) ( P t 0P t0 (d)) I(t d ;q) = P t ( P t 00 (t) P(t 0 )) I(t d ;q) = P t;t 0 P t0 (t) P (t 0 ) I(t d ;q) where P t0 (d) is the opinionated probabilityofdin t 0, and P t0 (t) is the opinionated probability of term t in t 0 which is necessary to evaluate the former, and I(t d ;q) is as dened in Section 3.3. The application of the above technique to IR requires again an appropriate measure of similarity over the term space T to enable the identication of the set of closest terms to t where d is true. We will tackle this problem in Section 4. Table 4 reports an example of the evaluation of P (d! q) by general imaging on d. The evaluation process is the following: 1. Identify the terms occurring in the document d (third column of the table). 15

16 p1 t1 t2 p2 p1 t1 t2 p2 p 1 t1 p3 t3 t4 p4 p3 t3 p4 t4 q p5 t5 d t6 p6 p5 p6 p 5 p 6 t5 t6 t5 t6 a b c Figure 4: Graphical interpretation of the evaluation of P (d! q) by general imaging on d. 2. Determine for each term not occuring in the document (with I(t; d) =0) the most similar terms (in this example only two terms) occuring in the document (those with I(t; d) = 1). This is done using a similarity measure on the term space (fourth column). 3. Evaluate P d (t) by transferring the probabilities from terms not occurring in the document to terms occurring in it (fth column) using an opinionated probability function, also called transfer function. In this example the transfer function prescribes that the most similar term to the one under consideration gets 2=3 of its probability, while the second most similar gets the remaining 1=3. 4. Evaluate I(t; q) for each term, i.e. determine the terms occurring in the query (sixth column). 5. Evaluate the probabilities P d I(t; q) for all the terms (seventh column) and evaluate P d (q) by summation (bottom of seventh column). A graphical interpretation of this process is depicted in Figure 4. As it can be seen in the picture, in the case of RbGLI the transfer of probability is performed from each term not occurring in the document d to the k t most similar terms occurring in d. In the above example k t = 2 for every term, but k t can be set to any other integer number as long as 1 k t l t, where l t is the number of documents in which the term t occurs. The value of k t can theoretically be dierent for every term. For k t = 1 for every term RbGLI defaults to RbLI. For k t = l t for every term the transfer may seem similar to RbCP, but it must be noted that the probability transfer in RbGLI is performed by taking into account the similaritybetween terms and not the ratio \prior" probabilities as in the case of RbCP. 16

17 4 Probability, similarity and opinionated functions In order to perform an experimental investigation into the previous four retrieval model we have three requirements: 1. a \prior" probability distribution over the set of worlds which should reect the importance of each world (the mass if we take an analogy with planets and stars) in the universe; 2. a measure of similarity (which is related to distance) between worlds; 3. an opinionated probability function over the set of possible worlds for every document in the collection. Notice that the rst requirement is present in everyone of the four models, the second one for RbLI and RbGLI, and the third only for RbGLI. According to our view in which a term is a world, these two requirements become: a probability distribution, a measure of similarity, and a set of opinionated functions on the term space T. The problem of determining an appropriate \prior" probability distribution over the set of terms used to index a document collection is one of the oldest in IR and many models have been proposed for this purpose. The problem could be translated into nding a measure of the importance of the term in the term space, where this importance is related to the ability of the term to discriminate between relevant and not relevant documents. In IR several discrimination measures have been proposed (see for example [21, 16]). For the experiments reported in this paper we used the Inverse Document Frequency (IDF), a measure which assigns high discrimination power to terms with low and medium collection frequency. IDF is dened as: IDF(t)= log n N where n is the number of documents in which t occurs, and N is the number of documents in the collection. Strictly speaking, this is not a probability measure since P t IDF(t) 6=1,however we can assume it to be monotone to P (t). We can use this estimate because we require only a ranking of the documents, the exact probabilityvalues are not used. We also chose this measure because it does not require relevance information. At 17

18 this stage of our work we prefer not to require relevance information which should come from Relevance Feedback with the user. The problem of measuring the similarity between terms and its use to dene the accessibility among worlds is more dicult. It is very important to chose the appropriate measure since much of RbLI and RbGLI depends on it. In this paper we decided to use the Expected Mutual Information Measure (EMIM) between terms as the measure of accessibility between worlds. The EMIM between two terms is often interpreted as a measure of the statistical information contained in the rst term about the other one (or vice versa, it being a symmetric measure). EMIM is dened as follows: I(i; j) = X t i ;t j P (t i ;t j ) log P(t i;t j ) P(t i )P(t j ) where i and j are binary variables representing terms. When we apply this measure to binary variables we can estimate EMIM between two terms using the technique proposed in [20]. This technique makes use of co-occurrence data which can be simply derived by a statistical analysis of the term occurrences in the collection. Using this measure we can then evaluate for every term a ranking of all the other terms according to their decreasing level of similarity with it. We store this information in a le which is used at run-time to determine for a term its nearest neighbour occurring in the document under consideration. The denition of a possibly dierent opinionated function on the term space for every document in the collection seems to be a very heavy requirements. However, it is quite common in IR to use indexing techniques that assign to every term a dierent weight for each document. The problems here is to identify the most appropriate measure given the semantics of the opinionated function. We will not tackle this problem in this paper. In the evaluation reported in Section 7 we use a monotonically decreasing transfer function that transfers from a term t a decreasing fraction of its probability to all the other terms in the term space once they are ordered in decreasing order of similarity. In particular, to simplify computations, in the evaluation of P (d! q) by general imaging on d, from each term not occuring in d we transfer its probability only to the rst 10 most similar terms occuring in d. The transfer function we use works in such away that the ith of these 10 terms ordered in decreasing order of similarity always gets a fraction of P (t) that is the double of the one that the i + 1th gets. A graphical model of this function is depicted in Figure 5. 18

19 Perc. of probability received Terms in order of similarity Figure 5: A graphical model of the term probability redistribution on its 10 closest terms. 5 A case study In order to understand better the dierences in the retrieval results that can be achieved by using dierent probabilistic models to evaluate the RSV, we will now give a simple example using a very small collection made on only 5 documents, and a term space made up of only 6 terms. We will show that we can obtain dierent rankings and dierent retrieval of these documents according to the retrieval model used. Let us suppose we have the following term space, T = ft 1 ;t 2 ;t 3 ;t 4 ;t 5 ;t 6 g and the following document collection. D = fd 1 ;d 2 ;d 3 ;d 4 ;d 5 g Let us suppose that both documents and query are represented by means of terms in T as explained in Section 2.2 as follows: d 1 = ft 1 ;t 5 ;t 6 g d 2 =ft 1 ;t 6 g d 3 =ft 1 ;t 2 ;t 3 ;t 5 g d 4 =ft 1 ;t 2 ;g d 5 =ft 1 ;t 3 ;t 6 g 19

20 and q = ft 1 ;t 4 ;t 6 g What our IR system is supposed to do is provide the user who submitted the query q with a ranking of the documents in D so that the ones most relevant to the query will be at the top of the ranking. The ranking is produced according to the RSV that is evaluated according to a particular retrieval model. In order to use the RbLI and RbGLI models we also need a measure of similarity between terms. This have been evaluated using a similarity function S(t i ;t j ). The similarity between terms is used to rank for every terms all other terms in T so that the closest one will be at the top of the ranking. For this example the ranking produced after evaluating the term{term similarity is the following: t 1 t 2 t 5 t 6 t 4 t 3 t 2 t 4 t 1 t 6 t 5 t 3 t 3 t 4 t 5 t 6 t 2 t 1 t 4 t 3 t 2 t 5 t 1 t 6 t 5 t 6 t 3 t 2 t 1 t 4 t 6 t 5 t 3 t 1 t 4 t 2 In order to use RbGLI we also need an opinionated function. We used this simple one: the probability ismoved from one term only to its two closest ones so that the closest gets 2=3 of the term probability while the second closest gets 1=3 of it. We do not report here the details of all the evaluations to determine the RSV of each document for each model. We suggest the interested reader to look at the Appendix of this paper. We will only discuss the results. The rankings produced by the retrieval models are the following: RbJP d 1 = d 2 = d 5 d 3 = d 4 RbCP d 2 d 5 d 4 d 1 d 3 RbLI d 2 d 5 d 1 d 4 d 3 RbGLI d 2 d 5 d 1 d 4 d 3 We can draw some interesting observations from these results: 1. RbJP produces some misleading results. Documents d 1, d 2, and d 5 are all ranked at the top and at the same level of estimated relevance. This is because the all share the same 2 terms in common with the query. The fact 20

21 that document d 2 have all of its terms in common with the query, while d 1 and d 5 have only 2 out of 3 terms in common with the query is not taken into the due account. Also documents d 3, and d 4 are ranked together since they have only one and the same term in common with the query. In this case, however, this is done quite correctly. 2. RbCP, RbLI, and RbGLI put at the top of the ranking the document d 2. This is correct since the document representation, i.e. terms t 1 and t 6,is completely included in the query representation. However the fact that the query also mention some other index term is not taken into consideration. In [5] we pointed out, in complete agreement with [13], how this fact could be taken into account byevaluating P (q! d). This could be done either using Imaging or General Imaging. A model that combines both P (d! q) and P (q! d) seems to be a very powerful tool for IR. 3. Both RbLI and RbGLI rank document d 1 before document d 4, as opposed to RbCP that ranks d 1 after d 4. The behaviour of RbCP is not correct since document d 1 shares 2 terms with the query while d 4 shares only 1. The reason why RbCP does that is due to the fact that d 4 is represented by only 2 terms, while d 1 is represented by 3 terms. This fact causes a movement of probability insuchaway that term t 1 (the only term d 4 has in common with the query) receives quite a large share of the probability moved from all other terms in the terms space that are not present ind 4. This is due to the fact that the \prior" probability oft 1 is double that of t 2 (the other term occuring in d 4 ), so the probability moved over to t 1 is double of that moved over to t 2. In RbLI and RbGLI the movement of probability is more balanced and it is also heavily inuenced by the fact that most terms in T are more similar to t 2 than to t 1. The term t 2 therefore gets a larger share of the probability moved from terms not occuring in d 4 to terms occuring in d 4 (see Appendix). 4. Both documents d 3 and d 4 have only one term in common with the query. They have been ranked at the same level by RbJP, while they have been ranked quite dierently by RbCP, RbLI and RbGLI. RbCP ranks d 4 quite high up in the ranking for reasons explained above, while d 3 is ranked last since the probability form terms not occuring in d 3 is moved over to a larger number of terms than d 4 and it is therefore more widespread. In particular, quite a lot of the probably moved around ends up on term t 5, which has a large prior probability. In RbLI and RbGLI the movement of probability is governed by the similarity between terms instead that by the \prior" probability. The fact that document d 4 is ranked before document d 3 is completely due to the similarities among terms in the term space. Dierent values of similarity among terms could have caused a dierent ranking. In the case of document d 3 none of the terms not occuring in it is very closed 21

22 to term t 1 (the term d 3 has in common with the query) and therefore no probability ismoved over to it. The same can be said of document d 4 where most of the probability moved around ends up on term t 2 that is not occuring in the query. 5. There is quite an interesting interpretation of the behaviour of RbLI and RbGLI in the context of word sense disambiguation. Let us imagine wehave a term t x that can be interpreted in two dierent senses: S 1 and S 2. Let suppose sense S 1 is more common than S 2 into the collection of documents D. In this case a similarity measure based on co-occurrence will pick up mainly the sense S 1 for t x, so giving high levels of similaritybetween t x and other terms used in the same context and with the same sense. Let now suppose term t x is used in a query; let us see how two documents using t x in the two dierent sense will be retrieved. Imagine, for simplicity, that the two documents d s1 and d s2 have the same number of terms related two their respective senses. According to RbLI and RbGLI term t x will receivea certain amount of probability coming from terms that are close to it but not occurring in the document. In document d s1 there will certainly be some terms closely related to the S 1 sense of t x, their probability will therefore be retained and not transfered over to t x. On the other hand, document d s2 will have fewer terms close to t x and there will be a large amount of probability shifted from terms close to t x but not occuring in d s2. This movement of probability will cause term t x to have a larger probability in d s2 than in d s1, and so document d s2 will ranked higher than d s1 in the list of (estimated) relevant documents. RbLI and RbGLI have therefore the eect of putting at the top of the ranking documents with uncommon senses of terms used in the query. This eect is however leveled down in queries and documents with a large number of terms. From this case study we believe we can conclude that RbLI and RbGLI perform more correctly than RbJP and RbCP. However, before giving such a strong statement we should test these four retrieval models using some standard testing technique, that is using some document test collection and produce comparative precision and recall tables. We will do this in the next following sections. 6 Experiments with test collections In order to study the retrieval eectiveness of the four models under considerations a series of tests were performed using some document test collections. Given the heavy computations necessary to evaluate the EMIM between every two terms, and give the theoretical nature of our goals, we decided to use some 22

23 Data Craneld CACM NPL documents queries terms in doc terms in query avg. doc. length avg. query length avg. rel. doc Table 5: Test collections data small but well studied test collection. The choice falls on the following three test collections that have been extensively used in the eld of IR: the Craneld 1400, the CACM, and the NPL test collections. The main characteristics of these three test collections are summarised in Table 5. Among the large number of references to these test collections we will only mention the followings: [4, 18]. Precision and recall 4 tables have been evaluated using their standard denition. The method of linear interpolation has been used to determine the standard values corresponding to intervals of 10% in recall gures. We are aware of the fact that linear interpolation estimates the best possible performance between two adjacent observed points (see [21], p. 152). However, since our experiments have the purpose to compare the retrieval eectiveness of the four proposed retrieval models only with each other, we think that this does not provide any misleading result. On the contrary, it enables us to compare the models taking into account their best possible performance. At the bottom of the precision and recall tables we also report two synthetic measures of eectiveness. They are the average precision, which is the simple average of all the precision values, and the average F measure (see [21], pp. 174), which is a combined eectiveness measure that takes into consideration both precision and recall as follows: F = 1 ( 1 P )+(1 ) 1 R where P and R are respectively the precision and recall values, and is a parameter that enables us to give more importance either to precision (when! 0) or to recall (when! 1). We use the value =0:5 so that equal importance to 4 Precision is the proportion of the retrieved set of documents that is relevant to the query. Recall is the proportion of all documents in the collections that are relevant to a query and that are actually retrieved. 23

24 Craneld 1400 Recall % Precision % RbJP RbLI RbCP RbGLI Avg %chg Max F %chg Table 6: Comparison of the performance of RbJP, RbLI, RbCP, and RbGLI using the Craneld 1400 test document collection precision and recall is given. We evaluated the F value for each pair of precision and recall values, and what is reported in the tables is the maximum values for each model. The percentage increase from one model to another is also reported. 7 Comparative evaluation We performed a comparative evaluation of the retrieval eectiveness of the four models presented above using the document collections and the experimental settings reported in Section 4 and 6. The results are reported in Tables 6, 7, and 8. The results are also reported in the form of Recall/Precision graphs in Figures 6, 7,and 8. We can observe that: any model in which there is a probability transfer (like RbLI, RbCP, and RbGLI) performs better than a model in which there is no such transfer (RbJP); any model that requires the probability transfer to be from one term to a set of terms (called \one-to-many" transfer, like in RbCP and RbGLI) performs better than models in which either there is no tranfer (RbJP) or 24

25 CACM Recall % Precision % RbJP RbLI RbCP RbGLI Avg %chg Max F %chg Table 7: Comparison of the performance of RbJP, RbLI, RbCP, and RbGLI using the CACM test document collection NPL Recall % Precision % RbJP RbLI RbCP RbGLI Avg %chg Max F %chg Table 8: Comparison of the performance of RbJP, RbLI, RbCP, and RbGLI using the NPL test document collection 25

26 Cranfield P/R graph RbJP RbLI RbCP RbGLI 70 Precision % Recall % Figure 6: Precision and recall graphs for the Craneld test collection. the transfer is from one term to a single other term (called \one-to-one" transfer, like in RbLI); a one-to-many transfer that takes into account the similarity between the donor and the receivers (RbGLI) performs better than a one-to-many tranfer that takes into account the probability ratio between the receivers (RbCP). These ndings are consistent over the three document collections, as the percentange changes in the average precision show. Despite the fact that we are using a simple term weighting schema and that we are experimenting of small test collections, we think we can nonetheless conclude that it is important to study further the probability kinematics of IR probabilistic retrieval models in order to be able to improve their performance. An interesting result of our study on the kinematics of the four models presented is that we can obtain higher levels of retrieval eectiveness by taking into consideration the similarity between the objects involved in the transfer of probability. We intend to pursue this result to design a new probabilistic model for IR. 8 Conclusions In this present study of the probability kinematics in IR, we believe we have shown that in principle a probability transfer that takes into account a measure of similaritybetween the donor and the recipient is more eective in the context of IR than a probability transfer that does not take that into account. Most current probabilistic retrieval models are based on a probability kinematics that does 26

27 CACM P/R graph RbJP RbLI RbCP RbGLI 70 Precision % Recall % Figure 7: Precision and recall graphs for the CACM test collection NPL P/R graph RbJP RbLI RbCP RbGLI 70 Precision % Recall % Figure 8: Precision and recall graph for the NPL test collection 27

28 not take into account similarity between terms or between documents, unless \ad hoc" weighting schemas, mostly based on clustering, are used. We would therefore like to suggest a further investigation into more complex and optimised models for probabilistic retrieval, where probability kinematics follows a nonclassical approach. The use of General Imaging is one of such approaches, but other ones can be developed using results achieved in other elds, such as Logics and Belief Revision theory. Our current results, summarised in this paper, seems to suggest that an improvements in retrieval eectiveness can be obtained by designing IR systems using probabilistic models that are based upon a dierent kind of probability kinematics. In our future work we intend to purse further this research direction by implementing and testing a new probabilistic IR system based upon General Imaging. Acknowledgements The authors would like to thank Iain Campbell and Mark Sanderson for the interesting and lively discussions on the topics of this paper. We will thank them even if they do not seem to believe that in IR as in most other elds theory ought to come before experimentation. 28

The troubles with using a logical model of IR on a large collection of documents

The troubles with using a logical model of IR on a large collection of documents Fabio Crestani Dipartimento di Elettronica ed Informatica Universitá a degli Studi di Padova Padova - Italy Ian Ruthven