Retrieval Models: Language models

Size: px

Start display at page:

Download "Retrieval Models: Language models"

Marshall Flowers
6 years ago
Views:

1 CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty

2 Introducton to language model Ungram language model Document language model estmaton Maxmum Lelhood estmaton Maxmum a posteror estmaton Jelne Mercer Smoothng Model-based feedbac

3 Vector space model for nformaton retreval Documents and queres are vectors n the term space Relevance s measure by the smlarty between document vectors and query vector Problems for vector space model Ad-hoc term weghtng schemes Ad-hoc smlarty measurement No justfcaton of relatonshp between relevance and smlarty We need more prncpled retreval models

4 Language model can be created for any language sample A document A collecton of documents Sentence, paragraph, chapter, query The sze of language sample affects the qualty of language model Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be relable

5 A document language model defnes a probablty dstrbuton over ndexed terms E.g., the probablty of generatng a term Sum of the probabltes s 1 A query can be seen as observed data from unnown models Query also defnes a language model (more on ths later How mght the models be used for IR? Ran documents by Pr( q d Ran documents by language models of q and d based on ullbac-lebler (KL dvergence between the models (come later

6 Generate retreval results q sport, basetball Estmate the generaton probablty of Pr( q d Language Model for d 1 Language Model for d 2 Language Model for d 3 d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

7 Three basc problems for language models What type of probablstc dstrbuton can be used to construct language models? How to estmate the parameters of the dstrbuton of the language models? How to compute the lelhood of generatng queres gven the language modes of documents?

8 Language model bult by multnomal dstrbuton on sngle terms (.e., ungram n the vocabulary Examples: Fve words n vocabulary (sport, basetball, tcet, fnance, stoc For a document d, ts language mode s: {P ( sport, P ( basetball, P ( tcet, P ( fnance, P ( stoc } Formally: The language model s: {P (w for any word w n vocabulary V} P( w = 1 0 P ( w 1

9 Multnomal Model for 1 d Multnomal Model for 2 d Multnomal Model for 3 d d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

10 d 1,..,d d 1,..,d d 1,..,d!" #$ Maxmum Lelhood Estmaton: Fnd model parameters that mae generaton lelhood reach maxmum: M*=argmax M Pr(D M There are K words n vocabulary, w 1...w K (e.g., 5 Data: one document d wth counts tf (w 1,, tf (w K, and length d Model: multnomal M wth parameters {p (w } Lelhood: Pr( d M M*=argmax M Pr( d M

11 !" #$ d p( d M p ( w p ( w l( d M = log p( d M = tf ( w log p ( w K K tf ( w tf ( w = tf ( w1... tf ( wk = 1 = 1 ( = ( log ( + ( ( 1 ' l d M tf w p w λ p w ' l tf ( w tf ( w = + λ = 0 p ( w = p ( w p ( w λ Snce p ( w = 1, λ = tf ( w = d So, p ( w = Use Lagrange multpler approach Set partal dervatves to zero Get maxmum lelhood estmate c ( w d

12 !" #$ (p sp, p b, p t, p f, p st = (0.5,0.25,0.25,0,0 (p sp, p b, p t, p f, p st = (0.2,0.2,0.4,0.2,0 (p sp, p b, p t, p f, p st = (0,0,0,0.5,0.5 d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

13 d 1,..,d!" #$ Maxmum Lelhood Estmaton: Assgn zero probabltes to unseen words n small sample A specfc example: Only two words n vocabulary d (w 1 =sport, w 2 =busness le (head, tal for a con; A document generates sequence of two words or draw a con for many tmes d ( 1 ( 2 Pr( d M = p ( 1 tf w (1 ( 1 tf w w p w tf ( w1 tf ( w2 Only observe two words (flp the con twce and MLE estmators are: busness sport P (w 1 =0.5 sport sport P (w 1 =1? busness busness P (w 1 =0?

14 !" #$ A specfc example: Only observe two words (flp the con twce and MLE estmators are: busness sport P (w 1 *=0.5 sport sport P (w 1 *=1? busness busness P (w 1 *=0? Data sparseness problem

15 %&' Maxmum a posteror (MAP estmaton Shrnage Bayesan ensemble approach

16 (&#(&$ Maxmum A Posteror Estmaton: Select a model that maxmzes the probablty of model gven observed data M*=argmax M Pr(M D=argmax M Pr(D MPr(M Pr(M: Pror belef/nowledge Use pror Pr(M to avod zero probabltes A specfc examples: Only two words n vocabulary (sport, busness d For a document : Pror Dstrbuton d ( 1 ( 2 Pr( M d = p ( 1 tf w ( 2 tf w w p w Pr ( M tf ( w1 tf ( w2

17 (&#(&$ Maxmum A Posteror Estmaton: Introduce pror on the multnomal dstrbuton Use pror Pr(M to avod zero probabltes, most of cons are more or less unbased Use Drchlet pror on p(w Γ ( α1 + + αk α 1 Dr ( p α1,, αk = p ( w, p ( w = 1, 0 p ( w 1 Γ( α Γ( α 1 K Hyper-parameters Constant for p K Γ(x s gamma functon t x 1 Γ( x e t dx 0 Γ ( n + 1 = n! f n

18 (&#(&$ For the two word example: a Drchlet pror Pr( M p( w (1 p( w P(w 1 2 (1-P(w 1 2

19 d 1,..,d (&#(&$ Maxmum A Posteror: M*=argmax M Pr(M D=argmax M Pr(D MPr(M Pr( d MPr( M p ( w (1 p ( w p ( w p ( w tf ( w1 tf ( w2 α1 1 α = p ( w (1 p ( w tf ( w1 + α1 1 tf ( w2 + α Pseudo Counts * tf ( w1 + α1 1 tf ( w2 + α2 1 M = arg max p ( w1 (1 p ( w1 p ( w 1

20 (&#(&$ A specfc example: Only observe two words (flp a con twce: sport sport P (w 1 *=1? tmes P(w 1 2 (1-P(w 1 2

21 (&#(&$ A specfc example: Only observe two words (flp a con twce: sport sport P (w 1 *=1? tf ( w1 + α1 1 p( w * = 1 tf ( w + α 1 + tf ( w + α = = =

22 (& Maxmum A Posteror Estmaton: Use Drchlet pror for multnomal dstrbuton How to set the parameters for Drchlet pror

23 (& Maxmum A Posteror Estmaton: Use Drchlet pror for multnomal dstrbuton There are K terms n the vocabulary: Multnomal : p = { p ( w,..., p ( w }, p ( w = 1, 0 p ( w 1 1 K Γ ( α + + α Dr p p w p w p w Γ Γ 1 K α 1 ( α1,, αk = (, ( = 1, 0 ( 1 ( α1 ( αk Hyper-parameters Constant for p K

24 (& MAP Estmaton for ungram language model: * Γ ( α1 + + αk tf w α p = argmax p ( w p ( w Γ( α Γ( α p 1 K st. p ( w = 1, 0 p ( w 1 = arg max p ( w p * tf ( w + α 1 p ( w = ( tf ( w + α 1 tf ( w + α 1 st. p ( w = 1, 0 p ( w 1 ( 1 Use Lagrange Multpler; Set dervatve to 0 Pseudo counts set by hyper-parameters

25 (& MAP Estmaton for ungram language model: Use Lagrange Multpler; Set dervatve to 0 * tf ( w + α 1 p ( w = ( tf ( w + α 1 How to determne the approprate value for hyper-parameters? When nothng observed from a document p * α 1 ( w = α ( 1 What s most lely p (w wthout loong at the content of the document?

26 (& MAP Estmaton for ungram language model: What s most lely p (w wthout loong at the content of the document? The most lely p (w wthout loong nto the content of the document d s the ungram probablty of the collecton: {p(w 1 c, p(w 2 c,, p(w K c} Wthout any nformaton, guess the behavor of one member on the behavor of whole populaton p w p w p w * α 1 ( = = ( α 1 c 1 = c ( α µ ( Constant

27 (& MAP Estmaton for ungram language model: * Γ ( α1 + + αk µ p = argmax p ( w p ( w Γ( α Γ( α p 1 K st. p ( w = 1, 0 p ( w 1 = arg max p ( w p * tf ( w + µ pc ( w p ( w = tf ( w + µ tf ( w +µ p ( w c st. p ( w = 1, 0 p ( w 1 tf ( w p ( w c Use Lagrange Multpler; Set dervatve to 0 Pseudo counts Pseudo document length

28 (&#(&$ Drchlet MAP Estmaton for ungram language model: Step 0: compute the probablty on whole collecton based collecton ungram language model p ( w = c tf ( w Step 1: for each document d, compute ts smoothed ungram language model (Drchlet smoothng as tf ( w + µ pc ( w p ( w = d + µ d

29 (&#(&$ Drchlet MAP Estmaton for ungram language model: Step 2: For a gven query q ={tf q (w 1,, tf q (w } For each document d, compute lelhood K K tfq ( w tf ( w + µ pc ( w p( q d = p( w d = = 1 = 1 d + µ The larger the lelhood, the more relevant the document s to the query tf q ( w

30 %"" *+,%+ Drchlet Smoothng: p( q d = K = 1 tf ( w + µ pc ( w d + µ tf ( w q? TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

31 %"" *+,%+ Drchlet Smoothng: p( q d = K = 1 tf ( w + µ pc ( w d + µ tf ( w tf ( w log p( q d = tfq ( w log1 + log( d + µ + log µ pc ( w µ p ( 1 c w = q TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

32 %"" *+,%+ Drchlet Smoothng: tfq ( w K tf ( w + µ pc ( w p( q d = = 1 d + µ log p( q d = tf ( w log tf ( w + p ( w log( d + = 1 { ( µ µ } q c µ pc ( w + tf ( w = tfq ( w log µ pc ( w log( d + µ µ p ( 1 c w = tf ( w = tfq ( w log1 + + log µ pc ( w log( d + µ µ p ( 1 c w =

33 %"" *+,%+ Drchlet Smoothng: Irrelevant part tf ( w log p( q d = tfq ( w log1 + log( d + µ + log µ pc ( w µ p ( 1 c w = tf ( w log p( q d tfq ( w log1 + log( d + µ µ p ( 1 c w = TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

34 %"" *+,%+ Drchlet Smoothng: Loo at the tf.df part tf ( w log1 + µ pc ( w tf ( w tf ( w log1 + µ pc ( w p c tf ( w ( w log1 + µ pc ( w

35 %"" -.,& Drchlet Smoothng: Hyper-parameter p ( w = tf ( w + µ pc ( w d + µ When When µ s very small, approach MLE estmator µ s very large, approach probablty on whole collecton How to set approprate µ?

36 %"" -.,& Leave One out Valdaton: p ( w = tf ( w + µ pc ( w d + µ w 1 p( w d / w 1 1 p ( w d / w = 1 1 tf ( w1 1 + µ p c ( w1 d 1 + µ w j p( w d / w j... j... p ( w d / w = j j tf ( w j 1 + µ p c ( w j d 1 + µ

37 %"" -.,& Leave One out Valdaton: w 1 w j p( w d / w 1 1 p( w d / w j... j... l l 1 1 d tf ( w j 1 + µ p c ( w j ( µ, d = lo g j = 1 d 1 + µ µ * C d tf ( w j 1 + µ p c ( w j ( µ, C = lo g = 1 j = 1 d 1 + µ µ = arg max l ( µ, C µ 1

38 %"" -.,& What type of document/collecton would get large? Most documents use smlar vocabulary and wordng pattern as the whole collecton What type of document/collecton would get small µ? Most documents use dfferent vocabulary and wordng pattern than the whole collecton µ

39 "! Maxmum Lelhood (MLE bulds model purely on document data and generates query word Model may not be accurate when document s short (many unseen words Shrnage estmator bulds more relable model by consultng more general models (e.g., collecton language model Example: Estmate P(Lung_Cancer Smoe West Lafayette Indana U.S.

40 "! Jelne Mercer Smoothng Assume for each word, wth probablty λ, t s generated from document language model (MLE, wth probablty 1- λ, t s generated from collecton language model (MLE Lnear nterpolaton between document language model and collecton language model JM Smoothng: tf ( w p ( w = λ + (1 λ pc ( w d

41 "! Relatonshp between JM Smoothng and Drchlet Smoothng tf ( w + µ pc ( w p ( w = d + µ 1 = ( tf ( w + µ pc ( w d + µ 1 d tf ( w = + µ pc ( w d + µ d d tf ( w µ = + pc ( w d + µ d d + µ JM Smoothng: tf ( w p ( w = λ + (1 λ pc ( w d

42 /+'! Equvalence of retreval based on query generaton lelhood and Kullbac-Lebler (KL Dvergence between query and document language models Kullbac-Lebler (KL Dvergence between two probablstc dstrbutons p( x KL( p q = p ( x log x q( x It s the dstance between two probablstc dstrbutons It s always larger than zero How to prove t?

43 /+'! Equvalence of retreval based on query generaton lelhood and Kullbac-Lebler (KL Dvergence between query and document language models Sm( q, d = KL( q d q( w = q( w log w p ( w ( ( ( ( = q( w log p w q( w log q w w w Loglelhood of query generaton probablty Document ndependent constant Generalze query representaton to be a dstrbuton (fractonal term weghtng

44 /+'! Retreval results Estmate the generaton probablty of Pr( q d Retreval results Calculate KL Dvergence K L ( q d q Language Model for d Language Model for q Estmatng query language model Language Model for d Estmatng language model d q Estmatng document language model d

45 /+'! Feedbac Documents from ntal results Language Model for q F Retreval results Calculate KL Dvergence K L ( q d α = 0 No feedbac ' q = q New Query Model ' q = (1 αq + αq F α =1 Full feedbac ' q = q F Language Model for q Estmatng query language model q Language Model for Estmatng document language model d d

46 /+'! q F Assume there s a generatve model to produce each word wthn feedbac document(s For each word n feedbac document(s, gven λ Flp a con q * F 1-λ λ = arg max l( X, λ q Bacground model F = 1 ( ( λqf w λ pc w = arg max log ( + (1 log( ( q F P C (w Topc words q F (w n w w Feedbac Documents

47 /+'! q F For each word, there s a hdden varable tellng whch language model t comes from Bacground Model p C (w C Unnown query topc p(w θ F =? Basetball the 0.12 to 0.05 t 0.04 a 0.02 sport basetball sport =? basetball =? game =? player =? 1-λ=0.8 λ=0.2 Feedbac Documents MLE Estmator If we now the value of hdden varable of each word...

48 /+'! q F For each word, the hdden varable Z = {1 (feedbac, 0 (bacground} Step1: estmate hdden varable based current on model parameter (Expectaton p( z = 1 p( w z = 1 p( z = 1 w = p( z = 1 p( w z = 1 + p( z = 0 p( w z = 0 λq ( w = λq ( w (1 p ( w C ( t F ( t F + λ C the (0.1 basetball (0.7 game (0.6 s (0.2. E-step Step2: Update model parameters based on the guess n step1 (Maxmzaton ( t+ 1 c( w, F p( z = 1 w qf ( w θf = c( w, F p( z = 1 w j j j j M-Step

49 /+'! q F Expectaton-Maxmzaton (EM algorthm Step 0: Intalze values of Step1: (Expectaton Step2: (Maxmzaton Gve λ=0.5 0 q F λq ( w q ( w (1 p ( w C ( t F = w = ( t λ F + λ C p( z 1 ( t+ 1 c( w, F p( z = 1 w qf ( w θf = c( w, F p( z = 1 w j j j j

50 /+'! q F Propertes of parameter λ If λ s close to 0, most common words can be generated from collecton language model, so more topc words n query language mode If λ s close to 1, query language model has to generate most common words, so fewer topc words n query language mode

51 Introducton to language model Ungram language model Document language model estmaton Maxmum Lelhood estmaton Maxmum a posteror estmaton Jelne Mercer Smoothng Model-based feedbac

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: