Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1

What we ll learn in this lecture Probabilistic models for IR As a model of relevance, P(R d, q) Binary independence model Okapi BM25

Probabilistic vs. geometric models Fundamental calculations: Geometric How similar sim(d, q) is document d to query q? Probabilistic What is the probability P(R = 1 d, q) that d is relevant to q? Different starting points, but leading to similar formulations.

Probabilistic models Probabilistic models: Clearer theoretical basis that geometric Particularly when considering extensions & modifications Probabilistic models initially not very successful, but now widespread We look at classical probabilistic models up to BM25 Next lecture, language models

(Discrete) Probability Recap Random Variable variable that can take on a value, e.g., A {0, 1} Probability distribution likelihood of taking on a value, e.g., P(A = 1), P(A = 0) Shorthand: P(A) P(A = 1), P(Ā) P(A = 0). Joint probability distribution what is the likelihood of several RVs each taking on specific values. E.g., P(A, B). Conditional probability distribution given some side information, what is the likelihood of the RV taking on certain values. E.g., P(A B), P(A B), P(Ā B), P(Ā B).

Representing Probability Distributions as vectors and matrices Volcanic eruption (V) Cloudy given volcanic eruption, P(C V ) Cloudy and volcanic eruption, P(C, V ) V P(C) 0 0.001 1 0.999 C V = 0 V = 1 0 0.7 0.95 1 0.3 0.05 C V = 0 V = 1 0 0.699 0.00005 1 0.300 0.00095

Basic Rules of Probability Conditioning P(A B) = Chain rule P(A, B) P(B) = P(A) P(B) P(B) if A and B are independent P(A, B, C) = P(A)P(B A)P(C A, B) = P(C)P(B C)P(A B, C) =... Marginalisation property P(A) = b P(A, B = b) = P(A, B) + P(A, B)

Bayes Rule The conditioning forumlation and chain rule imply: P(A B) = = P(A, B) = P(B A)P(A) P(B) P(B) P(B A) P(A) X {A,Ā} P(B X )P(X ) Commonly referred to as Bayes rule or Bayes theorem. Volcano Example If we know it s cloudy, was this caused by a volcano? I.e., what is P(V C)? 0.95 0.001 P(V C) = 0.3 0.999 + 0.95 0.001 = 0.0032 Now 3 more likely than before!

Bayes theorem P(A B) = P(B A) P(B) P(A) P(A) is the prior probability (distribution) of A We then observe B P(B A)/P(B) is support that B provides for A (known as the likelihood and the evidence, resp.) P(B A) P(B) = 1 if A and B are independent P(A B) is posterior probability of A

Bayes theorem for relevance P(R d, q) = P(d R, q) P(d q) P(R q) (1) P(R q) can be understood as proportion of documents in collection that are relevant to query P(d R, q) is probability that a (retrieved) document relevant to q looks like d P(d q) = P(d R, q) P(R q) + P(d R, q) P( R q) is probability of observing (retrieved) document, regardless of relevance Effectively a normaliser such that (1) is valid, P(R d, q) + P( R d, q) = 1. OK, but how do we go about estimating these values?

Rank-equivalence given query Probability Ranking Principle (PRP) Assume output is ranking Assume that relevance of each document is independent Then optimal ranking is by decreasing probability of relevance For ranking, we only care about Relative probability for given query This allows various simplifications to Equation 1 Provided they are monotonic, i.e., for transformation f (), P(A) > P(B) f (P(A)) > f (P(B)) e.g., f = log, ± constant, a positive constant,...

Odds-based matching score Take odds ratio between relevance and irrelevance: O(R d, q) = P(R d, q) P( R d, q) = P(R q)p(d R,q) P(d q) P( R q)p(d R,q) P(d q) = P(R q) P(d R, q) P( R q) P(d R, q) P(d R, q) = O(R q) P(d R, q) where O(R q) captures the first term which does not involve d (and can be ignored in ranking). We have removed 2 of the 3 terms from Equation 1.

Binary independence model How to estimate P(d R, q) and P(d R, q)? Must be based on attributes of d and q Binary indendence model Binary Doc attributes are presence of terms (not frequency) Independence Term appearances independent given relevance Represent: Document as binary vector d Query as binary vector q over words.

BIM odds ratio Under BIM, odds ratio resolves to: O(R d, q) T t=1 = t:d t=1 = t:d t=1 P(d t R, q) P(d t R, q) P(d t R, q) P(d t R, q) p t u t t:d t=0 t:d t=0 1 p t 1 u t where p t = P(d t R, q) and u t = P(d t R, q). Note similarity to Naive Bayes. P( d t R, q) P( d t R, q)

Query terms only Assume non-query terms equally likely to occur in relevant as irrelevant documents prescence or abscence of these terms irrelevant many factors cancel, leaving O(R d, q) t:q t=d t=1 = t:d t=q t=1 t:d t=q t=1 p t u t t:q t=1 d t=0 p t (1 u t ) u t (1 p t ) p t (1 u t ) u t (1 p t ) t:q t=1 1 p t 1 u t where the last step drops the constant scaling factor. (1 p t ) (1 u t )

Retreival status value The retreival status value (RSV d ) is defined as the log transformation of the odds-ratio 1 RSV d = log = t:d t=q t=1 t:d t=q t=1 p t (1 u t ) u t (1 p t ) log p t(1 u t ) u t (1 p t ) The RSV can now be used for ranking. Note the similarity with geometric vector space models, here the weight of term t is: w t = log + log 1 u t 1 p t u t as used in RSV d = 1 up to a scaling constant p t t:d t=q t=1 w t

Assessment-time estimation Term weights w t still depends upon unknown p t = P(d t R, q) and u t = P(d t R, q). If we have relevance judgements, p t and u t can be estimated as ˆp t = 1 R û t = 1 R d R d t d R d t But generally R is unknown at retreival time. How to estimate?

Retrieval-time estimation: u t Probablity of term occurrence in non-relevant document Assume relevant documents rare If all documents in the collection are not relevant then Overall, this gives u t = f t N log 1 u t u t log N f t f t log N f t Look familiar?

Retrieval-time estimation: p t Setting log p t /(1 p t ) to a constant E.g., p t = 0.5 removes p t /(1 p t ) term entirely Relevance score of doc is just sum of IDFs Plausible for binary model Or based on analysis of empirical data, e.g., Greiff, A theory of term weighting, SIGIR, 1998

Summary: Binary independence model Document attributes in BIM are term occurrences {0, 1} Models p t = P(d t = 1 R, q), the chance of a relavent document containing t and u t = P(d t = 1 R, q), the chance of an irrelevant document containing t Weight w t of query term t occurring in document d is then: w t = log If p t = 0.5, this approximates IDF. p t 1 p t + log 1 u t u t

Okapi BM25 Inspired by the BIM probabilistic formulation Why not try other alteratives for term weighting? Can we capture various aspects in a simple formula? idf tf document length query tf Then seek to tune each component... Results in effective and widely used models, Okapi BM25

Okapi BM25 [ w t = log N f ] t + 0.5 f t + 0.5 (idf) (k 1 + 1)f d,t ) k 1 ((1 b) + b L d L ave + f d,t (tf and doc. length) (k 3 + 1) f q,t k 3 + f q,t (query tf) Parameters k 1, b, and k 3 need to be tuned (k 3 only for very long queries). k 1 = 1.5 and b = 0.75 common defaults. BM25 highly effective, most widely used weighting in IR Robertson, Walker et al. (1994, 1998).

What have we achieved? Pros Started from plausible probabilistic model of term distribution Shown how it can be made to fit something like TF*IDF Providing a probabilistic justification TF*IDF-like approaches Cons Directly trying to estimate P(f dt R) not practicable in retrieval (too many parameters, not enough evidence) Such approaches end up as ad-hoc as geometric model Progress requires letting query tell us what relevance looks like This the approach of language models

Looking back and forward Back Probabilistic IR models estimate P(R d, q) (or monotonic function thereof) Probability derived from attributes (term occurrences) of documents BIM naive Bayes assumptions Binary attributes (term occurs or doesn t) Term occurrences independent Provides probabilistic justification for IDF BM25 builds on this using heuristic weighting formula to account for term frequency, document length, query term frequency etc

Looking back and forward Forward Language models an alternative probabilistic IR framework

Further reading Chapter 11, Probabilistic information retrieval of Manning, Raghavan, and Schutze, Introduction to Information Retrieval. http://nlp.stanford.edu/ir-book/pdf/11prob.pdf Sparck Jones, Walker, and Robertson, A Probabilistic MOdel of Information Retrieval, IPM, 2000.