Information Retrieval Language models for IR

Size: px

Start display at page:

Download "Information Retrieval Language models for IR"

Robert Fleming
5 years ago
Views:

1 Informaton Retreval Language models for IR From Mannng and Raghavan s course [Borros sldes from Vktor Lavrenko and Chengxang Zha] 1

2 Recap Tradtonal models Boolean model Vector space model robablstc models Today IR usng statstcal language models 2

3 rncple of statstcal language modelng Goal: create a statstcal model so that one can calculate the probablty of a seuence of ords s 1, 2,, n n a language. General approach: Tranng corpus s robabltes of the observed elements s 3

4 Examples of utlzaton Speech recognton Tranng corpus sgnals + ords probabltes: ordsgnal, ord2ord1 Utlzaton: sgnals seuence of ords Statstcal taggng Tranng corpus ords + tags n, v robabltes: ordtag, tag2tag1 Utlzaton: sentence seuence of tags 4

5 Stochastc Language Models A statstcal model for generatng text robablty dstrbuton over strngs n a gven language M M M M, M, M, 5

6 6 rob. of a seuence of ords,... 2, 1 n s Elements to be estmated: - If h s too long, one cannot observe h, n the tranng corpus, and h, s hard generalze - Soluton: lmt the length of h h h h n n n h 1 1 1,

7 7 n-grams Lmt h to n-1 precedng ords Most used cases Un-gram: B-gram: Tr-gram: n s 1 n s 1 1 n s 1 1 2

8 Ungram and hgher-order models Ungram Language Models Bgram generally, n-gram Language Models Easy. Effectve! 8

9 Estmaton Hstory: short long modelng: coarse refned Estmaton: easy dffcult Maxmum lkelhood estmaton MLE # # h h C C un If h m s not observed n tranng corpus, h0 h m coud stll be possble n the language Soluton: smoothng ngram 9

10 Smoothng Goal: assgn a lo probablty to ords or n- grams not observed n the tranng corpus MLE smoothed ord 10

11 Smoothng methods n-gram: α Change the fre. of occurrences Laplace smoothng add-one: Good-Turng change the fre. r to r* add _ one nr no. of n-grams of fre. r α C α V r* r + 1 α + 1 α + 1 redstrbute the total count of ords of freuency r+1 to ords of freuency r n r+ 1 n r 11

12 12 Smoothng cont d Combne a model th a loer-order model Backoff Katz Interpolaton Jelnek-Mercer In IR, combne doc. th corpus otherse 0 f > Katz GT Katz α JM ML JM + λ λ 1 C D D ML ML λ λ +

13 Standard robablstc IR Informaton need d1 uery matchng d2 dn document collecton 13

14 IR based on Language Model LM Informaton need Q M 1 d M d d1 uery generaton M d2 d2 A uery generaton process For an nformaton need, magne an deal document Imagne hat ords could appear n that document Formulate a uery usng those ords M dn dn document collecton 14

15 Stochastc Language Models Models probablty of generatng strngs n the language commonly all strngs over alphabet Model M 0.2 the 0.1 a 0.01 man 0.01 oman 0.03 sad 0.02 lkes the man lkes the oman multply s M

16 Stochastc Language Models Model probablty of generatng any strng Model M1 Model M2 0.2 the 0.01 class sayst pleaseth yon maden 0.01 oman 0.2 the class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maden oman the class pleaseth yon maden sm2 > sm1 16

17 Usng Language Models n IR Treat each document as the bass for a model e.g., ungram suffcent statstcs Rank document d based on d d d x d / s the same for all documents, so gnore d [the pror] s often treated as the same for all d But e could use crtera lke authorty, length, genre d s the probablty of gven d s model Very general formal approach 17

18 Language Models for IR Language Modelng Approaches Attempt to model uery generaton process Documents are ranked by the probablty that a uery ould be observed as a random sample from the respectve document model Multnomal approach 18

19 Retreval based on probablstc LM Treat the generaton of ueres as a random process. Approach Infer a language model for each document. Estmate the probablty of generatng the uery accordng to each of these models. Rank the documents accordng to these probabltes. Usually a ungram estmate of ords s used 19

20 Retreval based on probablstc LM Intuton Users Have a reasonable dea of terms that are lkely to occur n documents of nterest. They ll choose uery terms that dstngush these documents from others n the collecton. Collecton statstcs Are ntegral parts of the language model. Are not used heurstcally as n many other approaches. In theory. In practce, there s usually some ggle room for emprcally set parameters 20

21 Query generaton probablty 1 Rankng formula p Q, d The probablty of producng the uery gven the language model of document d usng MLE s: pˆ Q M pˆ t M d t Q tf dl t Q t, d d p d p Q p d p Q ml d d M d Ungram assumpton: Gven a partcular language model, the uery terms occur ndependently tf M d t, d dl d : language model of document d : ra tf of term t n document d : total number of tokens n document d 21

Insuffcent data Zero probablty May not sh to assgn a probablty of zero to a document that s mssng one or more of the uery terms [gves conjuncton semantcs] General approach p t M d A non-occurrng

22 Insuffcent data Zero probablty May not sh to assgn a probablty of zero to a document that s mssng one or more of the uery terms [gves conjuncton semantcs] General approach p t M d A non-occurrng term s possble, but no more lkely than ould be expected by chance n the collecton. If tf 0, t, d cs cf t : ra collecton szetotal number of tokens n the collecton : ra count of term t n the collecton 0 22

23 Insuffcent data Zero probabltes spell dsaster We need to smooth probabltes Dscount nonzero probabltes Gve some probablty mass to unseen thngs There s a de space of approaches to smoothng probablty dstrbutons to deal th ths problem, such as addng 1, ½ or ε to counts, Drchlet prors, dscountng, and nterpolaton A smple dea that orks ell n practce s to use a mxture beteen the document multnomal and the collecton multnomal dstrbuton 23

24 Mxture model d λ mle M d + 1 λ mle M c Mxes the probablty from the document th the general collecton freuency of the ord. Correctly settng λ s very mportant A hgh value of lambda makes the search conjunctve-lke sutable for short ueres A lo value s more sutable for long ueres Can tune λ to optmze performance erhaps make t dependent on document sze cf. Drchlet pror or Wtten-Bell smoothng 24

25 Basc mxture model summary General formulaton of the LM for IR p Q, d p d 1 λ p t + λp t t Q M d general language model ndvdual-document model The user has a document n mnd, and generates the uery from ths document. The euaton represents the probablty that the document that the user had n mnd as n fact ths one. 25

26 Example Document collecton 2 documents d 1 : Xerox reports a proft but revenue s don d 2 : Lucent narros uarter loss but revenue decreases further Model: MLE ungram from documents; λ ½ Query: revenue don Qd 1 [1/8 + 2/16/2] x [1/8 + 1/16/2] 1/8 x 3/32 3/256 Qd 2 [1/8 + 2/16/2] x [0 + 1/16/2] 1/8 x 1/32 1/256 Rankng: d 1 > d 2 26

onte and Croft Experments Data TREC topcs 202-250 on TREC dsks 2 and 3 Natural language ueres consstng of

054 <dom>doman: Internatonal Economcs <ttle>topc: Satellte Launch Contracts <desc>descrpton: </desc>

27 onte and Croft Experments Data TREC topcs on TREC dsks 2 and 3 Natural language ueres consstng of one sentence each TREC topcs on TREC dsk 3 usng the concept felds Lsts of good terms <num>number: 054 <dom>doman: Internatonal Economcs <ttle>topc: Satellte Launch Contracts <desc>descrpton: </desc> <con>concepts: 1. Contract, agreement 2. Launch vehcle, rocket, payload, satellte Launch servces, </con>

28 recson/recall results

29 recson/recall results

30 LM vs. rob. Model for IR The man dfference s hether Relevance fgures explctly n the model or not LM approach attempts to do aay th modelng relevance LM approach asssumes that documents and expressons of nformaton problems are of the same type Computatonally tractable, ntutvely appealng 30

31 LM vs. rob. Model for IR roblems of basc LM approach Assumpton of euvalence beteen document and nformaton problem representaton s unrealstc Very smple models of language Can t easly accommodate phrases, passages, Boolean operators Several extensons puttng relevance back nto the model, uery expanson term dependences, etc. 31

32 Alternatve Models of Text Generaton M Searcher Query M Searcher Query Model Query Is ths the same model? Wrter M Wrter Doc Model Doc M Doc 32

33 Model Comparson Estmate uery and document models and compare Sutable measure s KL dvergence DQ m D m euvalent to uery-lkelhood approach f smple emprcal dstrbuton used for uery model hy? More general rsk mnmzaton frameork has been proposed Zha and Lafferty 2001 Better results than uery-lkelhood 33

34 Comparson Wth Vector Space There s some relaton to tradtonal tf.df models: unscaled term freuency s drectly n model the probabltes do length normalzaton of term freuences the effect of dong a mxture th overall collecton freuences s a lttle lke df: terms rare n the general collecton but common n some documents ll have a greater nfluence on the rankng 34

35 35 Comparason: LM v.s. tf*df 1,, 1, 1, 1,, 1 /, 1, 1,, 1, 1, D Q D Q j D Q j Q D Q j D Q D Q j j j C C tf D D tf const C C tf C C tf D D tf const C C tf C C tf C C tf D D tf C C tf C C tf D D tf D D Q λ λ λ λ λ λ λ λ λ λ λ λ Log QD ~ VSM th tf*df and document length normalzaton Smoothng ~ df + length normalzaton df

36 Comparson Wth Vector Space Smlar n some ays Term eghts based on freuency Terms often used as f they ere ndependent Inverse document/collecton freuency used Some form of length normalzaton useful Dfferent n others Based on probablty rather than smlarty Intutons are probablstc rather than geometrc Detals of use of document length and term, document, and collecton freuency dffer 36

37 Resources J.M. onte and W.B. Croft A language modelng approach to nformaton retreval. In SIGIR 21. D. Hemstra A lngustcally motvated probablstc model of nformaton retreval. ECDL 2, pp A. Berger and J. Lafferty Informaton retreval as statstcal translaton. SIGIR 22, pp D.R.H. Mller, T. Leek, and R.M. Schartz A hdden Markov model nformaton retreval system. SIGIR 22, pp Chengxang Zha, Statstcal language models for nformaton retreval, n the seres of Synthess Lectures on Human Language Technologes, Morgan & Claypool, 2009 [Several relevant neer papers at SIGIR 2000 no.] Workshop on Language Modelng and Informaton Retreval, CMU The Lemur Toolkt for Language Modelng and Informaton Retreval. CMU/Umass LM and IR system n C++, currently actvely developed. 37

Retrieval Models: Language models

Retrieval Models: Language models CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty Introducton to language model Ungram language model Document language model estmaton Maxmum