Chapter 4: Advanced IR Models

Size: px

Start display at page:

Download "Chapter 4: Advanced IR Models"

Audra Lorraine Haynes
5 years ago
Views:

1 Chapter 4: Advanced I Models 4.1 robabilistic I rinciples robabilistic I with Term Independence robabilistic I with 2-oisson Model (Okapi BM25) IDM WS

2 4.1.1 robabilistic etrieval: rinciples obertson and Sparck Jones 1976 Goal: anking based on sim(doc d, query q) = d = doc d is relevant for query q d has term vector X1,..., Xm Assumptions: elevant and irrelevant documents differ in their terms. Binary Independence etrieval (BI) Model: robabilities for term occurrence are pairwise independent for different terms. Term weights are binary {0,1}. For terms that do not occur in query q the probabilities for such a term occurring are the same for relevant and irrelevant documents. IDM WS

3 IDM WS robabilistic I with Term Independence: anking roportional to elevance Odds ) ( ), ( d d d O q d sim d d (Bayes theorem) (odds for relevance) ~ d d X X i i i (independence or linked dependence) q i log log log ), ( q d sim q i ( = 1 if d includes i-th term, 0 otherwise)

4 robabilistic etrieval: anking roportional to elevance Odds (cont.) iq ~ iq iq iq 1 1 log( pi ( 1 pi ) ) log( qi ( 1 qi ) ) (binary features) with estimators pi==1 and qi==1 log( pi ( 1 ( 1 pi ) pi log 1 pi pi log 1 pi pi ) ) log( iq iq qi ( 1 qi ) ) ( 1 qi ) 1 qi log qi 1 qi log qi iq 1 pi log 1 qi sim( d, q)'' IDM WS

5 robabilistic etrieval: obertson / Sparck Jones Formula Estimate pi und qi based on training sample (query q on small sample of corpus) or based on intellectual assessment of first round s result (relevance feedback): Let N be #docs in sample, be # relevant docs in sample ni #docs in sample that contain term i, ri # relevant docs in sample that contain term i ri ni ri Estimate: pi qi N ri 0.5 ni ri 0.5 or: pi qi (Lidstone smoothing 1 N 1 with =0.5) ri 0.5 N ni ri 0.5 sim( d, q)'' log log i ri 0.5 i ni ri 0.5 ( ri 0.5) ( N ni ri 0.5) Weight of term i in doc d: log ( ri 0.5)( ni ri 0.5) IDM WS

6 robabilistic etrieval: *idf Formula Assumptions (without training sample or relevance feedback): pi is the same for all i. Most documents are irrelevant. Each individual term i is infrequent. This implies: pi log c with constant c i 1 pi i df qi 1 i N 1 qi N dfi N qi df df sim( d, q)'' i log i pi 1 qi log i 1 pi i qi c i i idf IDM WS i Scalar product over the product of and dampend idf values for query terms

7 Example for robabilistic etrieval Documents with relevance feedback: t1 t2 t3 t4 t5 t6 d d d d ni ri pi 5/6 1/2 1/2 5/6 1/2 1/6 qi 1/6 1/6 1/2 1/2 1/2 1/6 q: t1 t2 t3 t4 t5 t6 =2, N=4 Score of new document d5 (with Lidstone smoothing with =0.5): d5q: < > sim(d5, q) = log 5 + log 1 + log log 5 + log 5 + log 5 pi 1 qi sim( d, q)'' log log 1 pi i i qi IDM WS

8 Laplace Smoothing (with Uniform rior) robabilities pi and qi for term i are estimated by MLE for binomial distribution (repeated coin tosses for relevant docs, showing term i with pi, epeated coin tosses for irrelevant docs, showing term i with qi) To avoid overfitting to feedback/training,the estimates should be smoothed (e.g. with uniform prior): Instead of estimating pi = k/n estimate (Laplace s law of succession): pi = (k+1) / (n+2) or with heuristic generalization (Lidstone s law of succession): pi = (k+) / ( n+2) with > 0 (e.g. =0.5) And for multinomial distribution (n times w-faceted dice) estimate: pi = (ki + 1) / (n + w) IDM WS

9 BM25: Motivations Estimates for term probabilistic weights based on assumptions on the Estimates about the relevance of a term based on the notion of Eliteness of terms Assumptions about the relationships between eliteness and document length IDM WS

10 Okapi BM25 Approximation of oisson model by similarly-shaped function: w: p(1 q) log q(1 p) k 1 finally leads to Okapi BM25 (which achieved best TEC results): ( k1 1) N df 0.5 w ( d) : log length ( d) k ((1 ) ) b b df avgdoclength or in the most comprehensive, tunable form: N df 0.5 ( k1 1) ( k3 1) q len ( d) score( d, q) : log k2 q 0.5 len ( d) 1.. q df k ((1 ) ) k3 len ( d) 1 b b with =avgdoclength and tuning parameters k 1, k 2, k 3, b, and non-linear influence of and consideration of doc length IDM WS

11 Eliteness in BM25 IDM WS

12 IDM WS

13 IDM WS

14 IDM WS

15 IDM WS

16 IDM WS

17 oisson Mixtures for Capturing Distribution Katz s K-mixture: distribution of values for term said Source: Church/Gale 1995 IDM WS

18 Averaging Eliteness according to document length info IDM WS

19 IDM WS

20 IDM WS

21 Okapi BM25 Approximation of oisson model by similarly-shaped function: w: p(1 q) log q(1 p) k 1 finally leads to Okapi BM25 (which achieved best TEC results): ( k1 1) N df 0.5 w ( d) : log length ( d) k ((1 ) ) b b df avgdoclength or in the most comprehensive, tunable form: N df 0.5 ( k1 1) ( k3 1) q len ( d) score( d, q) : log k2 q 0.5 len ( d) 1.. q df k ((1 ) ) k3 len ( d) 1 b b with =avgdoclength and tuning parameters k 1, k 2, k 3, b, and non-linear influence of and consideration of doc length IDM WS

22 4.1.3 robabilistic I with oisson Model (Okapi BM25) Generalize term weight into w log p q q p 0 0 p(1 q) w log q(1 p) with p, q denoting prob. that term occurs times in rel./irrel. doc ostulate oisson (or oisson-mixture) distributions: p e! q e! IDM WS

23 Additional Literature robabilistic I: Grossman/Frieder Sections 2.2 and 2.4 S.E. obertson, K. Sparck Jones: elevance Weighting of Search Terms, JASIS 27(3), 1976 S.E. obertson, S. Walker: Some Simple Effective Approximations to the 2-oisson Model for robabilistic Weighted etrieval, SIGI 1994 K.W. Church, W.A. Gale: oisson Mixtures, Natural Language Engineering 1(2), 1995 C.T. Yu, W. Meng: rinciples of Database Query rocessing for Advanced Applications, Morgan Kaufmann, 1997, Chapter 9 D. Heckerman: A Tutorial on Learning with Bayesian Networks, Technical eport MS-T-95-06, Microsoft esearch, 1995 IDM WS

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for