Chapter 4: Advanced I Models 4.1 robabilistic I 4.1.1 rinciples 4.1.2 robabilistic I with Term Independence 4.1.3 robabilistic I with 2-oisson Model (Okapi BM25) IDM WS 2005 4-1
4.1.1 robabilistic etrieval: rinciples obertson and Sparck Jones 1976 Goal: anking based on sim(doc d, query q) = d = doc d is relevant for query q d has term vector X1,..., Xm Assumptions: elevant and irrelevant documents differ in their terms. Binary Independence etrieval (BI) Model: robabilities for term occurrence are pairwise independent for different terms. Term weights are binary {0,1}. For terms that do not occur in query q the probabilities for such a term occurring are the same for relevant and irrelevant documents. IDM WS 2005 4-2
IDM WS 2005 4-3 4.1.2 robabilistic I with Term Independence: anking roportional to elevance Odds ) ( ), ( d d d O q d sim d d (Bayes theorem) (odds for relevance) ~ d d X X i i i (independence or linked dependence) q i log log log ), ( q d sim q i ( = 1 if d includes i-th term, 0 otherwise)
robabilistic etrieval: anking roportional to elevance Odds (cont.) iq ~ iq iq iq 1 1 log( pi ( 1 pi ) ) log( qi ( 1 qi ) ) (binary features) with estimators pi==1 and qi==1 log( pi ( 1 ( 1 pi ) pi log 1 pi pi log 1 pi pi ) ) log( iq iq qi ( 1 qi ) ) ( 1 qi ) 1 qi log qi 1 qi log qi iq 1 pi log 1 qi sim( d, q)'' IDM WS 2005 4-4
robabilistic etrieval: obertson / Sparck Jones Formula Estimate pi und qi based on training sample (query q on small sample of corpus) or based on intellectual assessment of first round s result (relevance feedback): Let N be #docs in sample, be # relevant docs in sample ni #docs in sample that contain term i, ri # relevant docs in sample that contain term i ri ni ri Estimate: pi qi N ri 0.5 ni ri 0.5 or: pi qi (Lidstone smoothing 1 N 1 with =0.5) ri 0.5 N ni ri 0.5 sim( d, q)'' log log i ri 0.5 i ni ri 0.5 ( ri 0.5) ( N ni ri 0.5) Weight of term i in doc d: log ( ri 0.5)( ni ri 0.5) IDM WS 2005 4-5
robabilistic etrieval: *idf Formula Assumptions (without training sample or relevance feedback): pi is the same for all i. Most documents are irrelevant. Each individual term i is infrequent. This implies: pi log c with constant c i 1 pi i df qi 1 i N 1 qi N dfi N qi df df sim( d, q)'' i log i pi 1 qi log i 1 pi i qi c i i idf IDM WS 2005 4-6 i Scalar product over the product of and dampend idf values for query terms
Example for robabilistic etrieval Documents with relevance feedback: t1 t2 t3 t4 t5 t6 d1 1 0 1 1 0 0 1 d2 1 1 0 1 1 0 1 d3 0 0 0 1 1 0 0 d4 0 0 1 0 0 0 0 ni 2 1 2 3 2 0 ri 2 1 1 2 1 0 pi 5/6 1/2 1/2 5/6 1/2 1/6 qi 1/6 1/6 1/2 1/2 1/2 1/6 q: t1 t2 t3 t4 t5 t6 =2, N=4 Score of new document d5 (with Lidstone smoothing with =0.5): d5q: <1 1 0 0 0 1> sim(d5, q) = log 5 + log 1 + log 0.2 + log 5 + log 5 + log 5 pi 1 qi sim( d, q)'' log log 1 pi i i qi IDM WS 2005 4-7
Laplace Smoothing (with Uniform rior) robabilities pi and qi for term i are estimated by MLE for binomial distribution (repeated coin tosses for relevant docs, showing term i with pi, epeated coin tosses for irrelevant docs, showing term i with qi) To avoid overfitting to feedback/training,the estimates should be smoothed (e.g. with uniform prior): Instead of estimating pi = k/n estimate (Laplace s law of succession): pi = (k+1) / (n+2) or with heuristic generalization (Lidstone s law of succession): pi = (k+) / ( n+2) with > 0 (e.g. =0.5) And for multinomial distribution (n times w-faceted dice) estimate: pi = (ki + 1) / (n + w) IDM WS 2005 4-8
BM25: Motivations Estimates for term probabilistic weights based on assumptions on the Estimates about the relevance of a term based on the notion of Eliteness of terms Assumptions about the relationships between eliteness and document length IDM WS 2005 4-9
Okapi BM25 Approximation of oisson model by similarly-shaped function: w: p(1 q) log q(1 p) k 1 finally leads to Okapi BM25 (which achieved best TEC results): ( k1 1) N df 0.5 w ( d) : log length ( d) k ((1 ) ) 0.5 1 b b df avgdoclength or in the most comprehensive, tunable form: N df 0.5 ( k1 1) ( k3 1) q len ( d) score( d, q) : log k2 q 0.5 len ( d) 1.. q df k ((1 ) ) k3 len ( d) 1 b b with =avgdoclength and tuning parameters k 1, k 2, k 3, b, and non-linear influence of and consideration of doc length IDM WS 2005 4-10
Eliteness in BM25 IDM WS 2005 4-11
IDM WS 2005 4-12
IDM WS 2005 4-13
IDM WS 2005 4-14
IDM WS 2005 4-15
IDM WS 2005 4-16
oisson Mixtures for Capturing Distribution Katz s K-mixture: distribution of values for term said Source: Church/Gale 1995 IDM WS 2005 4-17
Averaging Eliteness according to document length info IDM WS 2005 4-18
IDM WS 2005 4-19
IDM WS 2005 4-20
Okapi BM25 Approximation of oisson model by similarly-shaped function: w: p(1 q) log q(1 p) k 1 finally leads to Okapi BM25 (which achieved best TEC results): ( k1 1) N df 0.5 w ( d) : log length ( d) k ((1 ) ) 0.5 1 b b df avgdoclength or in the most comprehensive, tunable form: N df 0.5 ( k1 1) ( k3 1) q len ( d) score( d, q) : log k2 q 0.5 len ( d) 1.. q df k ((1 ) ) k3 len ( d) 1 b b with =avgdoclength and tuning parameters k 1, k 2, k 3, b, and non-linear influence of and consideration of doc length IDM WS 2005 4-21
4.1.3 robabilistic I with oisson Model (Okapi BM25) Generalize term weight into w log p q q p 0 0 p(1 q) w log q(1 p) with p, q denoting prob. that term occurs times in rel./irrel. doc ostulate oisson (or oisson-mixture) distributions: p e! q e! IDM WS 2005 4-22
Additional Literature robabilistic I: Grossman/Frieder Sections 2.2 and 2.4 S.E. obertson, K. Sparck Jones: elevance Weighting of Search Terms, JASIS 27(3), 1976 S.E. obertson, S. Walker: Some Simple Effective Approximations to the 2-oisson Model for robabilistic Weighted etrieval, SIGI 1994 K.W. Church, W.A. Gale: oisson Mixtures, Natural Language Engineering 1(2), 1995 C.T. Yu, W. Meng: rinciples of Database Query rocessing for Advanced Applications, Morgan Kaufmann, 1997, Chapter 9 D. Heckerman: A Tutorial on Learning with Bayesian Networks, Technical eport MS-T-95-06, Microsoft esearch, 1995 IDM WS 2005 4-23