Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology

Size: px

Start display at page:

Download "Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology"

Leonard Barber
5 years ago
Views:

1 Probablstc Informaton Retreval CE-324: Modern Informaton Retreval Sharf Unversty of Technology M. Soleyman Fall 2016 Most sldes have been adapted from: Profs. Mannng, Nayak & Raghavan (CS-276, Stanford)

whether doct has relevant content In tradtonal IR

attempted n a semantcally mprecse space of ndex terms.

2 Why probabltes n IR? User Informaton Need Query Representaton Understandng of user need s uncertan How to match? Documents Document Representaton Uncertan guess of whether doct has relevant content In tradtonal IR systems, matchng between each doc and query s attempted n a semantcally mprecse space of ndex terms. Probabltes provde a prncpled foundaton for uncertan reasonng. Can we use probabltes to quantfy our uncertantes? 2

3 Probablstc IR Probablstc methods are one of the oldest but also one of the currently hottest topcs n IR. Tradtonally: neat deas, but ddn t wn on performance It may be dfferent now. 3

4 Probablstc IR topcs Classcal probablstc retreval model Probablty Rankng Prncple Bnary ndependence model ( We wll see that ts a Naïve Bayes text categorzaton) (Okap) BM25 Bayesan networks for IR [we wll not cover ths topc] Language model approach to IR An mportant emphass on ths approach n recent work 4

The document rankng problem Problem specfcaton: We have a collecton of docs User ssues a query A lst of docs needs to be returned Rankng method s the core

5 The document rankng problem Problem specfcaton: We have a collecton of docs User ssues a query A lst of docs needs to be returned Rankng method s the core of an IR system: In what order do we present documents to the user? Idea: Rank by probablty of relevance of the doc w.r.t. nformaton need P(R = 1 doc, query) 5

6 Probablty Rankng Prncple (PRP) If a reference retreval system s response to each request s a rankng of the docs n the collecton n order of decreasng probablty of relevance to the user who submtted the request, where the probabltes are estmated as accurately as possble on the bass of whatever data have been made avalable to the system for ths purpose, the overall effectveness of the system to ts user wll be the best that s obtanable on the bass of those data. [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rjsbergen (1979:113); Mannng & Schütze (1999:538) 6

7 Recall a few probablty bascs Product rule: p a, b = p a b P(b) Sum rule: p a = b p(a, b) Bayes Rule Posteror Odds: Pror p( b a) p( a) p( b a) p( a) p( a b) p( b ) p( b a) p( a) p( b a ) p( a ) 7 p( a) p( a) O( a) p( a) 1 p( a)

8 Probablty Rankng Prncple (PRP) d: doc q: query R: relevance of a doc w.r.t. gven (fxed) query R = 1: relevant R = 0: not relevant Need to fnd probablty that a doc x s relevant to a query q. p(r = 1 d, q) p R = 0 d, q = 1 p R = 1 d, q 8

9 Probablty Rankng Prncple (PRP) p R = 1 d, q = p R = 0 d, q = p d R = 1, q p(r = 1 q) p(d q) p d R = 0, q p(r = 0 q) p(d q) p(d R = 1, q): probablty of d n the class of relevant docs to the query q. p(d R = 0, q) : probablty of d n the class of nonrelevant docs to the query q. 9

10 Probablty Rankng Prncple (PRP) Bayes Optmal Decson Rule d s relevant ff P(R = 1 d, q) > P(R = 0 d, q).e.,d s relevant ff P(R = 1 d, q) > 0.5 But PRP n acton: Rank all documents by P(R d, q) Theorem: Usng the PRP s optmal, n that t mnmzes the loss (Bayes rsk) under 1/0 loss Provable f all probabltes correct, etc. [e.g., Rpley 1996] 10

11 Probablty Rankng Prncple (PRP) How do we compute all those probabltes? Do not know exact probabltes, have to use estmates Bnary Independence Model (BIM) whch we dscuss next s the smplest model 11

12 Probablstc Retreval Strategy Estmate how terms contrbute to relevance How do thngs lke tf, df, and length nfluence your judgments about doc relevance? A more nuanced answer s the Okap formula Spärck Jones / Robertson Combne the above estmated values to fnd doc relevance probablty Order docs by decreasng probablty 12

13 Probablstc Rankng Basc concept: For a gven query, f we know some docs that are relevant, terms that occur n those docs should be gven greater weghtng n searchng for other relevant docs. By makng assumptons about the dstrbuton of terms and applyng Bayes Theorem, t s possble to derve weghts theoretcally. Van Rjsbergen 13

14 Bnary Independence Model Tradtonally used n conjuncton wth PRP Bnary = Boolean: docs are represented as bnary ncdence vectors of terms x = [x 1 x 2 x m ] x = 1 ff term s present n document x. Independence : terms occur n docs ndependently Equvalent to Multvarate Bernoull Nave Bayes model Sometmes used for text categorzaton [we wll see n the next lectures] 14

15 Bnary Independence Model Wll use odds and Bayes Rule: O ( R q, x) P ( R 1 q) P ( x R 1, q) P ( R 1 q, x) P( x q) P ( R 0 q, x) P ( R 0 q ) P ( x R 0, q) P( x q) 15

16 Bnary Independence Model O ( R q, x) P ( R 1 q, x) P ( R 1 q) P ( x R 1, q) P ( R 0 q, x) P ( R 0 q) P ( x R 0, q) Constant for a gven query Needs estmaton Usng Independence Assumpton: n p( x R 1, q) P ( x R 1, q) p( x R 0, q) P ( x R 0, q) 1 16 O ( R q, d ) O ( R q) n 1 P ( x R 1, q) P ( x R 0, q )

17 Bnary Independence Model Snce x s ether 0 or 1: P ( x 1 R 1, q) P ( x 0 R 1, q) O ( R q, d ) O ( R q) P ( x 1 R 0, q) P ( x 0 R 0, q) x 1 x 0 Let p P ( x 1 R 1, q) u P ( x 1 R 0, q) Assume, for all terms not occurrng n the query (q =0) that p = u Ths can be changed (e.g., n relevance feedback) 17

18 Probabltes document relevant (R=1) not relevant (R=0) term present x = 1 term absent x = 0 p u (1 p ) (1 u ) Then... 18

19 Bnary Independence Model p O ( R q, x ) O ( R q ) u All matchng terms x q 1 x 0 q 1 1 p 1u p (1 u ) 1 p O ( R q ) u (1 p ) 1u x q 1 q 1 Non-matchng query terms All matchng terms All query terms 19

20 Bnary Independence Model p (1 u ) 1 p O ( R q, x ) O ( R q ) u (1 p ) 1u Constant for each query x q 1 q 1 20 RSV Retreval Status Value: log p (1 u ) p (1 u ) log u (1 p ) u (1 p ) x q 1 x q 1 Only quantty to be estmated for rankngs

21 Bnary Independence Model All bols down to computng RSV: RSV log p (1 u ) p (1 u ) log u (1 p ) u (1 p ) x q 1 x q 1 RSV c x q 1 ; c p log u (1 u) (1 p ) c s functon as the term weghts n ths model So, how do we compute c s from our data? 21

22 BIM: example q = {x 1, x 2 } Relevance judgements from 20 docs together wth the dstrbuton of x 1, x 2 wthn these docs p 1 = 8/12, u 1 = 3/8 p 2 = 7/12 and u 2 = 4/8. c 1 = log 10 /3 c 2 = log 7 /5 (1,1) (1,0) (0,1) (0,0) 22

23 Bnary Independence Model Estmatng RSV coeffcents n theory For each term look at ths table of document counts: Documents Relevant Non-Relevant Total x=1 s df-s df x=0 S-s N-df-S+s N-df Total S N-S N Estmates: Weght of -th term: 23 p s S c log df s df s u = N S s S s N df S + s For now, assume no zero terms.

24 Estmaton key challenge If non-relevant docs are approxmated by the whole collecton: u = df /N prob. of occurrence n non-relevant docs for query log(1 u )/u = log(n df )/df log N/df IDF! 24

25 Estmaton key challenge cannot be approxmated as easly as u probablty of occurrence n relevant docs p p can be estmated n varous ways: constant (Croft and Harper combnaton match) Then just get df weghtng of terms proportonal to prob. of occurrence n collecton Greff (SIGIR 1998) argues for 1/3 + 2/3 df /N from relevant docs f know some Relevance weghtng can be used n a feedback loop 25

26 Probablstc Relevance Feedback 1. Guess p and u and use t to retreve a frst set of relevant docs VR. 2. Interact wth the user to refne the descrpton: user specfes some defnte members wth R = 1 (the set VR) and R = 0 (the set VNR) 3. Re-estmate p and u : p = VR + 1 2, u VR +1 = VNR + 4. Repeat, thus generatng a successon of approxmatons to relevant docs 1 2 VNR +1 26

27 Probablstc Relevance Feedback 1. Guess p and u and use t to retreve a frst set of relevant docs VR. 2. Interact wth the user to refne the descrpton: learn some defnte members wth R = 1 and R = 0 3. Re-estmate p and u : Or can combne new nfo wth orgnal guess (use Bayesan update): ( 1) VR p p VR ( t ) t κ s pror weght 4. Repeat, thus generatng a successon of approxmatons to relevant docs 27

28 Iteratvely estmatng p (= Pseudo-relevance feedback) 1. Assume that p s constant over all x n query p = 0.5 (even odds) for any gven doc 2. Determne guess of relevant doc set: V s fxed sze set of hghest ranked docs on ths model 3. We need to mprove our guesses for p and u : Let V be set of docs contanng x p = V V + 1 Assume f not retreved then not relevant u = df V + 1/2 N V Go to 2. untl converges then return rankng 28

29 PRP and BIM Gettng reasonable approxmatons of probabltes s possble. Requres restrctve assumptons: boolean representaton of docs/queres/relevance term ndependence terms that do not appear n the query don t affect the outcome doc relevance values are ndependent Some of these assumptons can be removed Problem: ether requre partal relevance nformaton or only can derve somewhat nferor term weghts 29

30 Removng term ndependence In general, ndex terms aren t ndependent Dependences can be complex Rjsbergen (1979) proposed model of smple tree dependences Exactly Fredman and Goldszmdt s Tree Augmented Nave Bayes (AAAI 13, 1996) Each term dependent on one other In 1970s, estmaton problems held back success of ths model 30

31 A key lmtatons of the BIM BIM was desgned for ttles or abstracts, and not for modern full text search lke much of orgnal IR We want to pay attenton to term frequency and doc lengths just lke n other models we ve dscussed. 31

32 Okap BM25 BM25 Best Match 25 (they had a bunch of tres!) Developed n the context of the Okap system Started to be ncreasngly adopted by other teams durng the TREC compettons It works well Goal: Releasng some assumpton of BIM whle not addng too many parameters (Spärck Jones et al. 2000) I ll omt the theory, but show the form. 32

33 Recall: BIM Bols down to: RSV BIM c BIM x q 1 ; c BIM log p (1 u ) (1 p ) u Log odds rato document relevant (R=1) not relevant (R=0) term present x = 1 p u term absent x = 0 (1 p ) (1 u ) Smplfes to (wth constant p = 0.5) RSV BIM x q 1 log N df

34 Early versons of BM25 Verson 1: usng the saturaton functon c BM 25v1 (tf ) = c BIM tf k 1 + tf Verson 2: BIM smplfcaton to IDF: c BM 25 v2 N 1 ) ( tf ) log df k1 tf (k 1 + 1) factor doesn t change rankng, but makes term score 1 when tf = 1 Smlar to tf-df, but term scores are bounded ( k 1 tf

35 Document length normalzaton Longer documents are lkely to have larger tf values Why mght documents be longer? Verbosty: suggests observed tf too hgh Larger scope: suggests observed tf may be rght A real document collecton probably has both effects so should apply some knd of normalzaton

36 Document length normalzaton Document length: V avdl:average document length over collecton Length normalzaton component: b = 1 full document length normalzaton b = 0 no document length normalzaton dl tf dl B ( 1 b) b, 0 b 1 av _ dl

37 Okap BM25 Factor n the frequency of each term versus doc length: BM 25 BM 25 RSV c q ( tf ); c BM 25 ( tf ) log N df k 1 ( k1 1) tf dl ((1 b) b ) av _ dl tf tf,d s term freq of n d L d s length of d and L ave s ave. doc length k 1 and b are tunng parameters 37

38 Okap BM25 RSV BM 25 log N df k ( k1 1) tf dl ((1 b) b avdl q ) 1 tf k 1 controls term frequency scalng k 1 = 0 s bnary model; k 1 = large s raw term frequency b controls doc length normalzaton b = 0 s no length normalzaton; b = 1 s relatve frequency (fully scale by doc length) Typcally, k 1 s set around and b around

39 Resources S. E. Robertson and K. Spärck Jones Relevance Weghtng of Search Terms. Journal of the Amercan Socety for Informaton Scences 27(3): C. J. van Rjsbergen Informaton Retreval. 2nd ed. London: Butterworths, chapter 6. [Most detals of math] N. Fuhr Probablstc Models n Informaton Retreval. The Computer Journal, 35(3), [Easest read, wth BNs] F. Crestan, M. Lalmas, C. J. van Rjsbergen, and I. Campbell Is Ths Document Relevant?... Probably: A Survey of Probablstc Models n Informaton Retreval. ACM Computng Surveys 30(4): [Adds very lttle materal that sn t n van Rjsbergen or Fuhr ] 39

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes