CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes n IR? User Informaton Need Query Representaton Understandng of user need s uncertan How to match? Documents Document Representaton Uncertan guess of whether document has relevant content In tradtonal IR systems, matchng between each document and query s attempted n a semantcally mprecse space of ndex terms. Probabltes provde a prncpled foundaton for uncertan reasonng. Can we use probabltes to quantfy our uncertantes? 15 Jan-17 20 Chrstopher W. Clfton 1
Probablstc IR topcs Classcal probablstc retreval model Probablty rankng prncple, etc. Bnary ndependence model ( Naïve Bayes (Okap BM25 Bayesan networks for text retreval Language model approach to IR An mportant emphass n recent work Probablstc methods have been a recurrng theme n Informaton Retreval Tradtonally: neat deas, but ddn t wn on performance 16 The document rankng problem We have a collecton of documents User ssues a query A lst of documents needs to be returned Rankng method s the core of an IR system: In what order do we present documents to the user? We want the best document to be frst, second best second, etc. Idea: Rank by probablty of relevance of the document w.r.t. nformaton need P(R=1 document, query 17 Jan-17 20 Chrstopher W. Clfton 2
Recall a few probablty bascs For events A and B: Bayes Rule p(a, B = p(aç B = p(a Bp(B = p(b Ap(A Odds: p(a B = Posteror p(b Ap(A p(b = O(A = p(a p(a = p(a 1- p(a p(b Ap(A p(b Xp(X å X=A, A Pror 18 The Probablty Rankng Prncple (PRP If a reference retreval system s response to each request s a rankng of the documents n the collecton n order of decreasng probablty of relevance to the user who submtted the request, where the probabltes are estmated as accurately as possble on the bass of whatever data have been made avalable to the system for ths purpose, the overall effectveness of the system to ts user wll be the best that s obtanable on the bass of those data. [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rjsbergen (1979:113; Mannng & Schütze (1999:538 19 Jan-17 20 Chrstopher W. Clfton 3
Probablty Rankng Prncple (PRP Let x represent a document n the collecton. Let R represent relevance of a document w.r.t. gven (fxed query and let R=1 represent relevant and R=0 not relevant. Need to fnd p(r=1 x - probablty that a document x s relevant. p(r =1 x = p(r = 0 x = p(x R =1p(R =1 p(x p(x R = 0p(R = 0 p(x p(r = 0 x+ p(r =1 x =1 p(r=1,p(r=0 - pror probablty of retrevng a relevant or non-relevant document p(x R=1, p(x R=0 - probablty that f a relevant (not relevant document s retreved, t s x. 23 Probablty Rankng Prncple (PRP Smple case: no selecton costs or other utlty concerns that would dfferentally weght errors PRP n acton: Rank all documents by p(r=1 x Theorem: Usng the PRP s optmal, n that t mnmzes the loss (Bayes rsk under 1/0 loss Provable f all probabltes correct, etc. [e.g., Rpley 1996] How do we compute all those probabltes? Do not know exact probabltes, have to use estmates 24 Jan-17 20 Chrstopher W. Clfton 4
Probablstc Rankng Basc concept: For a gven query, f we know some documents that are relevant, terms that occur n those documents should be gven greater weghtng n searchng for other relevant documents. By makng assumptons about the dstrbuton of terms and applyng Bayes Theorem, t s possble to derve weghts theoretcally. Van Rjsbergen 28 Bnary Independence Model Tradtonally used n conjuncton wth PRP Bnary = Boolean: documents are represented as bnary ncdence vectors of terms: Ԧx = (x 1,, x n x = 1 ff term s present n document x. Independence : terms occur n documents ndependently Dfferent documents can be modeled as the same vector 29 Jan-17 20 Chrstopher W. Clfton 5
Bnary Independence Model Queres: bnary term ncdence vectors Gven query q, for each document d need to compute p(r q,d. replace wth computng p(r q,x where x s bnary term ncdence vector representng d. Interested only n rankng Use odds and Bayes Rule: O(R q, x = p(r =1 q, x p(r = 0 q, x = p(r =1 qp(x R =1, q p(x q p(r = 0 qp(x R = 0, q p(x q 30 Bnary Independence Model O(R q, x = p(r =1 q, x p(r =1 q p(x R =1,q = p(r = 0 q, x p(r = 0 q p(x R = 0,q Constant for a gven query Needs estmaton Usng Independence Assumpton: p(x R =1,q p(x R = 0,q = n =1 p(x R =1,q p(x R = 0,q O(R q, x = O(R q n =1 p(x R =1, q p(x R = 0,q 31 Jan-17 20 Chrstopher W. Clfton 6
Bnary Independence Model O(R q, x = O(R q Snce x s ether 0 or 1: n =1 p(x R =1, q p(x R = 0,q p(x O(R q, x = O(R q =1 R =1, q x =1 p(x =1 R = 0,q p(x = 0 R =1,q x =0 p(x = 0 R = 0, q Let p = p(x =1 R =1,q; r = p(x =1 R = 0,q; Assume, for all terms not occurrng n the query (q =0 p r O(R q, x = O(R q x =1 q =1 p r x =0 q =1 (1- p (1- r 32 Model Parameters document relevant (R=1 not relevant (R=0 term present x = 1 p r term absent x = 0 (1 p (1 r 33 Jan-17 20 Chrstopher W. Clfton 7
Bnary Independence Model O(R q, x = O(R q All matchng terms O(R q, x = O(R q All matchng terms x =1 q =1 p x =q =1 r p r x =1 q =1 x =0 q =1 p O(R q, x = O(R q (1- r x =q =1 r (1- p 1- p 1- r Non-matchng query terms æ 1- r 1- p ö ç è1- p 1- r ø q =1 1- p 1- r x =0 q =1 1- p 1- r All query terms 34 Bnary Independence Model O( R q, x O( R q p (1 r x 1 (1 1 1 q r p q r 1 p Constant for each query Retreval Status Value: p (1 r RSV log r (1 p Only quantty to be estmated for rankngs p (1 r log x q 1 x q 1 r (1 p 35 Jan-17 20 Chrstopher W. Clfton 8
Bnary Independence Model All bols down to computng RSV. RSV log RSV c x q 1 p (1 r p (1 r log x q 1 r (1 p x q 1 r (1 p ; c p (1 r log r (1 p The c are log odds ratos They functon as the term weghts n ths model So, how do we compute c s from our data? 36 Bnary Independence Model Estmatng RSV coeffcents n theory For each term look at ths table of document counts: Documents Relevant Non-Relevant Total x=1 s n-s n x=0 S-s N-n-S+s N-n Total S N-S N c s ( n s p r S ( N S s ( S s K( N, n, S, s log ( n s ( N n S s Estmates: For now, assume no zero terms. 37 Jan-17 20 Chrstopher W. Clfton 9
Estmaton key challenge If non-relevant documents are approxmated by the whole collecton, then r (prob. of occurrence n non-relevant documents for query s n/n and log 1- r r = log N - n - S + s n - s» log N - n n» log N n = IDF! 38 Estmaton key challenge p (probablty of occurrence n relevant documents cannot be approxmated as easly p can be estmated n varous ways: from relevant documents f know some Relevance weghtng can be used n a feedback loop constant (Croft and Harper combnaton match then just get df weghtng of terms (wth p =0.5 å RSV = log N x =q =1 n proportonal to prob. of occurrence n collecton Greff (SIGIR 1998 argues for 1/3 + 2/3 df /N 39 Jan-17 20 Chrstopher W. Clfton 10