Probabilistic Ranking Principle. Hongning Wang

Size: px

Start display at page:

Download "Probabilistic Ranking Principle. Hongning Wang"

Arron Brooks
5 years ago
Views:

1 robablstc ankng rncple Hongnng Wang

2 Noton of relevance elevance epq epd Smlarty r1qd r {01} robablty of elevance d q or q d robablstc nference Dfferent rep & smlarty Vector space model Salton et al. 75 egresson Model Fox 83 rob. dstr. model Wong & Yao 89 Doc generaton Generatve Model Classcal prob. Model obertson & Sparck Jones 76 Today s lecture uery generaton LM approach onte & Croft 98 Lafferty & Zha 01a Dfferent nference system rob. concept space model Wong & Yao 95 Inference network model Turtle & Croft 91 CS@UVa CS 6501: Informaton etreval 2

3 Basc concepts n probablty andom experment n experment wth uncertan outcome e.g. tossng a con pckng a word from text Sample space S ll possble outcomes of an experment e.g. tossng 2 far cons S{HH HT TH TT} Event E E S E happens ff outcome s n S e.g. E{HH} all heads E{HHTT} same face Impossble event {} certan event S robablty of event 0 E 1 CS@UVa CS 6501: Informaton etreval 3

4 Essental probablty concepts robablty of events Mutually exclusve events BB + BB General events BB + BB BB Independent events BB BB Jont probablty or smply as BB CS@UVa CS 6501: Informaton etreval 4

5 ecap: TF normalzaton Two vews of document length doc s long because t s verbose doc s long because t has more content aw TF s naccurate Document length varaton epeated occurrences are less nformatve than the frst occurrence elevance does not ncrease proportonally wth number of term occurrence Generally penalze long doc but avod overpenalzng voted length normalzaton CS@UVa CS 6501: Informaton etreval 5

6 ecap: document frequency Idea: a term s more dscrmnatve f t occurs only n fewer documents CS@UVa CS 6501: Informaton etreval 6

7 ecap: cosne smlarty ngle between two vectors TF-IDF vector cccccccccccc VV qq VV dd VV qq VV dd VV qq 2 VV dd 2 VV qq VV qq 2 VV dd VV dd 2 Document length normalzed Fnance D 2 TF-IDF space Unt vector D 1 uery Sports CS@UVa CS 6501: Informaton etreval 7

8 ecap: fast computaton of cosne n retreval Mantan a score accumulator for each doc when scannng the postngs CS@UVa uery nfo securty Sdqgt 1 + +gt n [sum of TF of matched terms] Info: d1 3 d2 4 d3 1 d4 5 Securty: d2 3 d4 1 d5 3 nfo securty ccumulators: d1 d2 d3 d4 d5 d13 > d24 > d31 > d45 > d23 > d41 > d53 > CS 6501: Informaton etreval Can be easly appled to TF-IDF weghtng! Keep only the most promsng accumulators for top K retreval 8

9 ecap: advantages of VS Model Emprcally effectve! Top TEC performance Intutve Easy to mplement Well-studed/Mostly evaluated The Smart system Developed at Cornell: Stll wdely used Warnng: Many varants of TF-IDF! CS 6501: Informaton etreval 9

10 ecap: dsadvantages of VS Model ssume term ndependence ssume query and document to be the same Lack of predctve adequacy rbtrary term weghtng rbtrary smlarty measure Lots of parameter tunng! CS 6501: Informaton etreval 10

11 Essental probablty concepts Condtonal probablty BB BB/ Bayes ule: BB BB BB/ For ndependent events BB BB Total probablty If 1 nn form a non-overlappng partton of S BB SS BB BB nn BB BB BB BB nn nn BB Ths allows us to compute BB based on BB CS@UVa CS 6501: Informaton etreval 11

12 Interpretaton of Bayes rule Hypothess space: HH {HH 1 HH nn } Evdence: EE HH EE EE HH HH EE If we want to pck the most lkely hypothess H* we can drop EE osteror probablty of HH ror probablty of HH H E E H H Lkelhood of data/evdence gven HH CS@UVa CS 6501: Informaton etreval 12

13 Theoretcal justfcaton of rankng s stated by Wllam Cooper If a reference retreval system s response to each request s a rankng of the documents n the collectons n order of decreasng probablty of usefulness to the user who submtted the request where the probabltes are estmated as accurately as possble on the bass of whatever data made avalable to the system for ths purpose then the overall effectveness of the system to ts users wll be the best that s obtanable on the bass of that data. ank by probablty of relevance leads to the optmal retreval effectveness CS@UVa CS 6501: Informaton etreval 13

14 Justfcaton From decson theory Two types of loss Lossretrevednon-relevantaa 1 Lossnot retrevedrelevantaa 2 φφdd qq: probablty of dd beng relevant to qq Expected loss regardng to the decson of ncludng dd n the fnal results etreve: 1 φφ dd qq aa 1 Your decson crteron? Not retreve: φφ dd qq aa 2 CS@UVa CS 6501: Informaton etreval 14

15 Justfcaton From decson theory We make decson by If 1 φφ dd qq Otherwse not retreve dd aa 1 <φφ dd qq aa 2 retreve dd Check f φφ dd qq > aa 1 aa 1 +aa 2 ank documents by descendng order of φφ dd qq would mnmze the loss CS@UVa CS 6501: Informaton etreval 15

16 ccordng to what we need s relevance measure functon Fqd For all q d 1 d 2 Fqd 1 > Fqd 2 ff. pelqd 1 >pelqd 2 ssumptons Independent relevance Sequental browsng Most exstng research on I models so far has fallen nto ths lne of thnkng Lmtatons? CS@UVa CS 6501: Informaton etreval 16

17 robablty of relevance Three random varables uery Document D elevance {01} Goal: rank D based on 1D Compute 1D ctually one only needs to compare 1D 1 wth 1D 2.e. rank documents Several dfferent ways to defne 1D CS@UVa CS 6501: Informaton etreval 17

18 Condtonal models for 1D Basc dea: relevance depends on how well a query matches a document 1DgepDθ epd: feature representaton of query-doc par E.g. #matched terms hghest IDF of a matched term doclen Usng tranng data wth known relevance judgments to estmate parameter θ pply the model to rank new documents Specal case: logstc regresson a functonal form CS@UVa CS 6501: Informaton etreval 18

19 egresson for rankng? Lnear regresson yy ww TT XX In a rankng problem: XX features about query-document par yy relevance label of document for the gven query elatonshp between a scalar dependent varable yy and one or more explanatory varables CS@UVa CS 6501: Informaton etreval 19

20 Features/ttrbutes for rankng Typcal features consdered n rankng problems M 1 X1 logft j X X X X X M 1 M 1 M 1 L M 1 DL M 1 log M log DF N nt log n t j t j j verage bsolute uery Frequency uery Length verage bsolute Document Frequency Document Length verage Inverse Document Frequency Number of Terms n common between 6 query and document CS@UVa CS 6501: Informaton etreval 20

21 egresson for rankng Lnear regresson yy ww TT XX elatonshp between a scalar dependent varable yy and one or more explanatory varables y Y s dscrete n a rankng problem! yy 1 wwtt XX > ww TT XX 0.5 What f we have an outler? Optmal regresson model 0.00 x CS@UVa CS 6501: Informaton etreval 21

22 egresson for rankng Logstc regresson 1D σσ ww TT XX 1+exp ww TT XX Drectly modelng posteror of document relevance Sgmod functon yx 1 What f we have an outler? x CS@UVa CS 6501: Informaton etreval 22

23 Condtonal models for 1D ros & Cons dvantages bsolute probablty of relevance avalable May re-use all the past relevance judgments roblems erformance heavly depends on the selecton of features Lttle gudance on feature selecton Wll be covered wth more detals n later learnng-to-rank dscussons CS@UVa CS 6501: Informaton etreval 23

24 Generatve models for 1D Basc dea Compute Odd1D usng Bayes rule Odd 1 D 1 D 0 D ssumpton elevance s a bnary varable Varants Document generaton DD uery generaton DDD D 1 1 D 0 0 Ignored for rankng CS@UVa CS 6501: Informaton etreval 24

25 ecap: justfcaton From decson theory We make decson by If 1 φφ dd qq Otherwse not retreve dd aa 1 <φφ dd qq aa 2 retreve dd Check f φφ dd qq > aa 1 aa 1 +aa 2 ank documents by descendng order of φφ dd qq would mnmze the loss CS@UVa CS 6501: Informaton etreval 25

26 ecap: condtonal models for 1D Basc dea: relevance depends on how well a query matches a document 1DgepDθ epd: feature representaton of query-doc par E.g. #matched terms hghest IDF of a matched term doclen Usng tranng data wth known relevance judgments to estmate parameter θ pply the model to rank new documents Specal case: logstc regresson a functonal form CS@UVa CS 6501: Informaton etreval 26

27 Basc dea ecap: generatve models for 1D Compute Odd1D usng Bayes rule Odd 1 D 1 D 0 D ssumpton elevance s a bnary varable Varants Document generaton DD uery generaton DDD D 1 1 D 0 0 Ignored for rankng CS@UVa CS 6501: Informaton etreval 27

28 Document generaton model CS 6501: Informaton etreval D D D D D D D Odd Model of relevant docs for Model of non-relevant docs for ssume ndependent attrbutes of 1 k.why? Let Dd 1 d k where d k {01} s the value of attrbute k Smlarly q 1 q k k d k d k d d D Odd Terms occur n doc Terms do not occur n doc document relevant1 nonrelevant0 term present 1 p u term absent 0 1-p 1-u Ignored for rankng nformaton retreval retreved s helpful for you everyone Doc Doc CS@UVa 28

29 Document generaton model CS 6501: Informaton etreval k q k q d k q d k q d k d k d k u p p u u p u p u p d d D Odd Terms occur n doc Terms do not occur n doc document relevant1 nonrelevant0 term present 1 p u term absent 0 1-p 1-u ssumpton: terms not occurrng n the query are equally lkely to occur n relevant and nonrelevant documents.e. p t u t Important trcks CS@UVa 29

30 obertson-sparck Jones Model obertson & Sparck Jones 76 logo 1 D ank k 1 d q 1 p 1 u log u 1 p k 1 d q 1 p log 1 p + 1 u log u SJ model Two parameters for each term : p 11: prob. that term occurs n a relevant doc u 10: prob. that term occurs n a non-relevant doc How to estmate these parameters? Suppose we have relevance judgments pˆ # rel. doc wth # rel. doc + 1 uˆ # nonrel. doc wth # nonrel. doc and +1 can be justfed by Bayesan estmaton as prors er-query estmaton! CS@UVa CS 6501: Informaton etreval 30

31 arameter estmaton General settng: Gven a hypotheszed & probablstc model that governs the random experment The model gves a probablty of any data ppddθθ that depends on the parameter θθ Now gven actual sample data X{x 1 x n } what can we say about the value of θθ? Intutvely take our best guess of θθ -- best means best explanng/fttng the data Generally an optmzaton problem CS@UVa CS 6501: Informaton etreval 31

32 Maxmum lkelhood vs. Bayesan Maxmum lkelhood estmaton Best means data lkelhood reaches maxmum θθ aaaaaaaaaaaa θθ XXθθ Issue: small sample sze Bayesan estmaton Best means beng consstent wth our pror knowledge and explanng data well θθ aaaaaaaaaaxx θθ θθ XX aaaaaaaaaaaa θθ XX θθ θθ.k.a Maxmum a osteror estmaton Issue: how to defne pror? ML: Frequentst s pont of vew M: Bayesan s pont of vew CS@UVa CS 6501: Informaton etreval 32

33 Illustraton of Bayesan estmaton osteror: pθx pxθpθ Lkelhood: pxθ Xx 1 x N ror: pθ θ α : pror mode θ: posteror mode θ ml : ML estmate θ CS@UVa CS 6501: Informaton etreval 33

Maxmum lkelhood estmaton Data: a document d wth counts cw 1 cw N Model: multnomal computer dstrbuton pwwθθ wth parameters 30% nformaton θθ retreval ppww computer scence relevant lterature Maxmum

34 Maxmum lkelhood estmaton Data: a document d wth counts cw 1 cw N Model: multnomal computer dstrbuton pwwθθ wth parameters 30% nformaton θθ retreval ppww computer scence relevant lterature Maxmum lkelhood estmator: θθ aaaaaaaaaaxx θθ ppwwθθ pp WW θθ NN NN cc ww 1 ccww NN 1 NN NN θθ ccww 1 LL WW θθ cc ww log θθ + λλ θθ 1 1 NN 1 LL cc ww + λλ θθ θθ θθ cc ww λλ Snce θθ scence 31% lterature relevant 8% 1% 1 we have θθ ccww NN θθ 1 λλ cc ww cc ww nformaton 20% retreval 10% NN 1 log pp WW θθ cc ww Usng Lagrange multpler approach we ll tune θθ to maxmze LLWW θθ Set partal dervatves to zero ML estmate CS@UVa NN CS 6501: Informaton etreval 34 1 cc ww NN 1 equrement from probablty log θθ

35 obertson-sparck Jones Model obertson & Sparck Jones 76 logo 1 D ank k 1 d q 1 p 1 u log u 1 p k 1 d q 1 p log 1 p + 1 u log u SJ model Two parameters for each term : p 11: prob. that term occurs n a relevant doc u 10: prob. that term occurs n a non-relevant doc How to estmate these parameters? Suppose we have relevance judgments pˆ # rel. doc wth # rel. doc + 1 uˆ # nonrel. doc wth # nonrel. doc and +1 can be justfed by Bayesan estmaton as prors er-query estmaton! CS@UVa CS 6501: Informaton etreval 35

36 SJ Model wthout relevance nfo Croft & Harper 79 nformaton retreval retreved s helpful for you everyone ank k k p 1 u p 1 u log Doc1 O 1 D1 log 1 0 1log 1 + log d q u p d q p u Doc SJ model Suppose we do not have relevance judgments - We wll assume p to be a constant - Estmate u by assumng all documents to be non-relevant logo 1 D ank k c + log 1 d q 1 n N n emnder: IDF 1+ log N n N: # documents n collecton n : # documents n whch term occurs IDF weghted Boolean model? CS@UVa CS 6501: Informaton etreval 36

37 SJ Model: summary The most mportant classc probablstc I model Use only term presence/absence thus also referred to as Bnary Independence Model Essentally Naïve Bayes for doc rankng Desgned for short catalog records When wthout relevance judgments the model parameters must be estmated n an ad-hoc way erformance sn t as good as tuned VS models CS@UVa CS 6501: Informaton etreval 37

38 Improvng SJ: addng TF CS 6501: Informaton etreval Let Dd 1 d k where d k s the frequency count of term k k d k d k d k d d d d d d D D E E e f E e f E p E f p E E f p E p f p f E f E µ µ µ µ + +!! 2-osson mxture model for TF Many more parameters to estmate! how many exactly? Compound wth document length! Elteness: f the term s about the concept asked n the query CS@UVa 38

39 BM25/Okap approxmaton obertson et al. 94 Idea: model pd wth a smpler functon that approxmates 2-osson mxture model Observatons: 1 D k d D log O1D s a sum of term weghts occurrng n both query and document Term weght W 0 f TF 0 W ncreases monotoncally wth TF W has an asymptotc lmt The smple functon s W 1 d 1 d TF k p 1 u log K + TF u 1 p 0 CS@UVa CS 6501: Informaton etreval 39

40 ddng doc. length Incorporatng doc length Motvaton: the 2-osson model assumes equal document length Implementaton: penalze long doc W where TF k p 1 u log K + TF u 1 p 1 d K k1 b + b avg d voted document length normalzaton CS@UVa CS 6501: Informaton etreval 40

41 ddng query TF Incorporatng query TF Motvaton Natural symmetry between document and query Implementaton: a smlar TF transformaton as n document TF W TF ks + 1 TF k1 + 1 p 1 u log k + TF K + TF u 1 p s The fnal formula s called BM25 achevng top TEC performance BM: best match CS@UVa CS 6501: Informaton etreval 41

42 The BM25 formula Okap TF/BM25 TF becomes IDF when no relevance nfo s avalable CS@UVa CS 6501: Informaton etreval 42

43 The BM25 formula closer look rel q D TF-IDF component for document k n 1 IDF q D 1 tf k + k b + b 2 11 bb s usually set to [ ] kk 1 s usually set to [ ] kk 2 s usually set to ] tf + 1 avg D TF component for query qtf k qtf Vector space model wth TF-IDF schema! CS@UVa CS 6501: Informaton etreval 43

44 Extensons of Doc Generaton models Capture term dependence lternatve ways to ncorporate TF [jsbergen & Harper 78] Feature/term selecton for feedback reports] [Croft 83 Kalt96] [Okap s TEC Estmate of the relevance model based on pseudo feedback [Lavrenko & Croft 01] to be covered later CS@UVa CS 6501: Informaton etreval 44

45 uery generaton models O 1 D D D 1 0 D 1 D 1 D 0 D 0 D D 1 D 1 0 ssume D 0 0 uery lkelhood pq θ d Document pror ssumng unform document pror we have O 1 D D 1 Now the queston s how to compute D 1? Generally nvolves two steps: 1 estmate a language model based on D 2 compute the query lkelhood accordng to the estmated model Language models we wll cover t n next lecture! CS@UVa CS 6501: Informaton etreval 45

46 What you should know Essental concepts n probablty Justfcaton of rankng by relevance Dervaton of SJ model Maxmum lkelhood estmaton BM25 formula CS@UVa CS 6501: Informaton etreval 46

47 Today s readng Chapter 11. robablstc nformaton retreval 11.2 The robablty ankng rncple 11.3 The Bnary Independence Model Okap BM25: a non-bnary model CS@UVa CS 6501: Informaton etreval 47

Probabilistic Ranking Principle. Hongning Wang

Probabilistic Ranking Principle. Hongning Wang Probablstc anng Prncple Hongnng Wang CS@UVa ecap: latent semantc analyss Soluton Low ran matrx approxmaton Imagne ths s *true* concept-document matrx Imagne ths s our observed term-document matrx andom