Retrieval Models: Language models

CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty

Introducton to language model Ungram language model Document language model estmaton Maxmum Lelhood estmaton Maxmum a posteror estmaton Jelne Mercer Smoothng Model-based feedbac

Vector space model for nformaton retreval Documents and queres are vectors n the term space Relevance s measure by the smlarty between document vectors and query vector Problems for vector space model Ad-hoc term weghtng schemes Ad-hoc smlarty measurement No justfcaton of relatonshp between relevance and smlarty We need more prncpled retreval models

Language model can be created for any language sample A document A collecton of documents Sentence, paragraph, chapter, query The sze of language sample affects the qualty of language model Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be relable

A document language model defnes a probablty dstrbuton over ndexed terms E.g., the probablty of generatng a term Sum of the probabltes s 1 A query can be seen as observed data from unnown models Query also defnes a language model (more on ths later How mght the models be used for IR? Ran documents by Pr( q d Ran documents by language models of q and d based on ullbac-lebler (KL dvergence between the models (come later

Generate retreval results q sport, basetball Estmate the generaton probablty of Pr( q d Language Model for d 1 Language Model for d 2 Language Model for d 3 d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

Three basc problems for language models What type of probablstc dstrbuton can be used to construct language models? How to estmate the parameters of the dstrbuton of the language models? How to compute the lelhood of generatng queres gven the language modes of documents?

Language model bult by multnomal dstrbuton on sngle terms (.e., ungram n the vocabulary Examples: Fve words n vocabulary (sport, basetball, tcet, fnance, stoc For a document d, ts language mode s: {P ( sport, P ( basetball, P ( tcet, P ( fnance, P ( stoc } Formally: The language model s: {P (w for any word w n vocabulary V} P( w = 1 0 P ( w 1

Multnomal Model for 1 d Multnomal Model for 2 d Multnomal Model for 3 d d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

d 1,..,d d 1,..,d d 1,..,d!" #$ Maxmum Lelhood Estmaton: Fnd model parameters that mae generaton lelhood reach maxmum: M*=argmax M Pr(D M There are K words n vocabulary, w 1...w K (e.g., 5 Data: one document d wth counts tf (w 1,, tf (w K, and length d Model: multnomal M wth parameters {p (w } Lelhood: Pr( d M M*=argmax M Pr( d M

!" #$ d p( d M p ( w p ( w l( d M = log p( d M = tf ( w log p ( w K K tf ( w tf ( w = tf ( w1... tf ( wk = 1 = 1 ( = ( log ( + ( ( 1 ' l d M tf w p w λ p w ' l tf ( w tf ( w = + λ = 0 p ( w = p ( w p ( w λ Snce p ( w = 1, λ = tf ( w = d So, p ( w = Use Lagrange multpler approach Set partal dervatves to zero Get maxmum lelhood estmate c ( w d

!" #$ (p sp, p b, p t, p f, p st = (0.5,0.25,0.25,0,0 (p sp, p b, p t, p f, p st = (0.2,0.2,0.4,0.2,0 (p sp, p b, p t, p f, p st = (0,0,0,0.5,0.5 d 1 sport, basetball, tcet, sport Estmatng language model for each document d 2 basetball, tcet, fnance, tcet, sport d 3 stoc, fnance, fnance, stoc

d 1,..,d!" #$ Maxmum Lelhood Estmaton: Assgn zero probabltes to unseen words n small sample A specfc example: Only two words n vocabulary d (w 1 =sport, w 2 =busness le (head, tal for a con; A document generates sequence of two words or draw a con for many tmes d ( 1 ( 2 Pr( d M = p ( 1 tf w (1 ( 1 tf w w p w tf ( w1 tf ( w2 Only observe two words (flp the con twce and MLE estmators are: busness sport P (w 1 =0.5 sport sport P (w 1 =1? busness busness P (w 1 =0?

!" #$ A specfc example: Only observe two words (flp the con twce and MLE estmators are: busness sport P (w 1 *=0.5 sport sport P (w 1 *=1? busness busness P (w 1 *=0? Data sparseness problem

%&' Maxmum a posteror (MAP estmaton Shrnage Bayesan ensemble approach

(&#(&$ Maxmum A Posteror Estmaton: Select a model that maxmzes the probablty of model gven observed data M*=argmax M Pr(M D=argmax M Pr(D MPr(M Pr(M: Pror belef/nowledge Use pror Pr(M to avod zero probabltes A specfc examples: Only two words n vocabulary (sport, busness d For a document : Pror Dstrbuton d ( 1 ( 2 Pr( M d = p ( 1 tf w ( 2 tf w w p w Pr ( M tf ( w1 tf ( w2

(&#(&$ Maxmum A Posteror Estmaton: Introduce pror on the multnomal dstrbuton Use pror Pr(M to avod zero probabltes, most of cons are more or less unbased Use Drchlet pror on p(w Γ ( α1 + + αk α 1 Dr ( p α1,, αk = p ( w, p ( w = 1, 0 p ( w 1 Γ( α Γ( α 1 K Hyper-parameters Constant for p K Γ(x s gamma functon t x 1 Γ( x e t dx 0 Γ ( n + 1 = n! f n

(&#(&$ For the two word example: a Drchlet pror Pr( M p( w (1 p( w 2 2 1 1 P(w 1 2 (1-P(w 1 2

d 1,..,d (&#(&$ Maxmum A Posteror: M*=argmax M Pr(M D=argmax M Pr(D MPr(M Pr( d MPr( M p ( w (1 p ( w p ( w p ( w tf ( w1 tf ( w2 α1 1 α2 1 1 1 1 1 = p ( w (1 p ( w tf ( w1 + α1 1 tf ( w2 + α2 1 1 1 Pseudo Counts * tf ( w1 + α1 1 tf ( w2 + α2 1 M = arg max p ( w1 (1 p ( w1 p ( w 1

(&#(&$ A specfc example: Only observe two words (flp a con twce: sport sport P (w 1 *=1? tmes P(w 1 2 (1-P(w 1 2

(&#(&$ A specfc example: Only observe two words (flp a con twce: sport sport P (w 1 *=1? tf ( w1 + α1 1 p( w * = 1 tf ( w + α 1 + tf ( w + α 1 1 1 2 2 2+ 3 1 4 2 = = = 2+ 3 1+ 0+ 3 1 6 3

(& Maxmum A Posteror Estmaton: Use Drchlet pror for multnomal dstrbuton How to set the parameters for Drchlet pror

(& Maxmum A Posteror Estmaton: Use Drchlet pror for multnomal dstrbuton There are K terms n the vocabulary: Multnomal : p = { p ( w,..., p ( w }, p ( w = 1, 0 p ( w 1 1 K Γ ( α + + α Dr p p w p w p w Γ Γ 1 K α 1 ( α1,, αk = (, ( = 1, 0 ( 1 ( α1 ( αk Hyper-parameters Constant for p K

(& MAP Estmaton for ungram language model: * Γ ( α1 + + αk tf w α p = argmax p ( w p ( w Γ( α Γ( α p 1 K st. p ( w = 1, 0 p ( w 1 = arg max p ( w p * tf ( w + α 1 p ( w = ( tf ( w + α 1 tf ( w + α 1 st. p ( w = 1, 0 p ( w 1 ( 1 Use Lagrange Multpler; Set dervatve to 0 Pseudo counts set by hyper-parameters

(& MAP Estmaton for ungram language model: Use Lagrange Multpler; Set dervatve to 0 * tf ( w + α 1 p ( w = ( tf ( w + α 1 How to determne the approprate value for hyper-parameters? When nothng observed from a document p * α 1 ( w = α ( 1 What s most lely p (w wthout loong at the content of the document?

(& MAP Estmaton for ungram language model: What s most lely p (w wthout loong at the content of the document? The most lely p (w wthout loong nto the content of the document d s the ungram probablty of the collecton: {p(w 1 c, p(w 2 c,, p(w K c} Wthout any nformaton, guess the behavor of one member on the behavor of whole populaton p w p w p w * α 1 ( = = ( α 1 c 1 = c ( α µ ( Constant

(& MAP Estmaton for ungram language model: * Γ ( α1 + + αk µ p = argmax p ( w p ( w Γ( α Γ( α p 1 K st. p ( w = 1, 0 p ( w 1 = arg max p ( w p * tf ( w + µ pc ( w p ( w = tf ( w + µ tf ( w +µ p ( w c st. p ( w = 1, 0 p ( w 1 tf ( w p ( w c Use Lagrange Multpler; Set dervatve to 0 Pseudo counts Pseudo document length

(&#(&$ Drchlet MAP Estmaton for ungram language model: Step 0: compute the probablty on whole collecton based collecton ungram language model p ( w = c tf ( w Step 1: for each document d, compute ts smoothed ungram language model (Drchlet smoothng as tf ( w + µ pc ( w p ( w = d + µ d

(&#(&$ Drchlet MAP Estmaton for ungram language model: Step 2: For a gven query q ={tf q (w 1,, tf q (w } For each document d, compute lelhood K K tfq ( w tf ( w + µ pc ( w p( q d = p( w d = = 1 = 1 d + µ The larger the lelhood, the more relevant the document s to the query tf q ( w

%"" *+,%+ Drchlet Smoothng: p( q d = K = 1 tf ( w + µ pc ( w d + µ tf ( w q? TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

%"" *+,%+ Drchlet Smoothng: p( q d = K = 1 tf ( w + µ pc ( w d + µ tf ( w tf ( w log p( q d = tfq ( w log1 + log( d + µ + log µ pc ( w µ p ( 1 c w = q TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

%"" *+,%+ Drchlet Smoothng: tfq ( w K tf ( w + µ pc ( w p( q d = = 1 d + µ log p( q d = tf ( w log tf ( w + p ( w log( d + = 1 { ( µ µ } q c µ pc ( w + tf ( w = tfq ( w log µ pc ( w log( d + µ µ p ( 1 c w = tf ( w = tfq ( w log1 + + log µ pc ( w log( d + µ µ p ( 1 c w =

%"" *+,%+ Drchlet Smoothng: Irrelevant part tf ( w log p( q d = tfq ( w log1 + log( d + µ + log µ pc ( w µ p ( 1 c w = tf ( w log p( q d tfq ( w log1 + log( d + µ µ p ( 1 c w = TF-IDF Weghtng: K sm( q, d = tfq ( w tf ( w df ( w norm( d = 1

%"" *+,%+ Drchlet Smoothng: Loo at the tf.df part tf ( w log1 + µ pc ( w tf ( w tf ( w log1 + µ pc ( w p c tf ( w ( w log1 + µ pc ( w

%"" -.,& Drchlet Smoothng: Hyper-parameter p ( w = tf ( w + µ pc ( w d + µ When When µ s very small, approach MLE estmator µ s very large, approach probablty on whole collecton How to set approprate µ?

%"" -.,& Leave One out Valdaton: p ( w = tf ( w + µ pc ( w d + µ w 1 p( w d / w 1 1 p ( w d / w = 1 1 tf ( w1 1 + µ p c ( w1 d 1 + µ w j p( w d / w j... j... p ( w d / w = j j...... tf ( w j 1 + µ p c ( w j d 1 + µ

%"" -.,& Leave One out Valdaton: w 1 w j p( w d / w 1 1 p( w d / w j... j... l l 1 1 d tf ( w j 1 + µ p c ( w j ( µ, d = lo g j = 1 d 1 + µ µ * C d tf ( w j 1 + µ p c ( w j ( µ, C = lo g = 1 j = 1 d 1 + µ µ = arg max l ( µ, C µ 1

%"" -.,& What type of document/collecton would get large? Most documents use smlar vocabulary and wordng pattern as the whole collecton What type of document/collecton would get small µ? Most documents use dfferent vocabulary and wordng pattern than the whole collecton µ

"! Maxmum Lelhood (MLE bulds model purely on document data and generates query word Model may not be accurate when document s short (many unseen words Shrnage estmator bulds more relable model by consultng more general models (e.g., collecton language model Example: Estmate P(Lung_Cancer Smoe West Lafayette Indana U.S.

"! Jelne Mercer Smoothng Assume for each word, wth probablty λ, t s generated from document language model (MLE, wth probablty 1- λ, t s generated from collecton language model (MLE Lnear nterpolaton between document language model and collecton language model JM Smoothng: tf ( w p ( w = λ + (1 λ pc ( w d

"! Relatonshp between JM Smoothng and Drchlet Smoothng tf ( w + µ pc ( w p ( w = d + µ 1 = ( tf ( w + µ pc ( w d + µ 1 d tf ( w = + µ pc ( w d + µ d d tf ( w µ = + pc ( w d + µ d d + µ JM Smoothng: tf ( w p ( w = λ + (1 λ pc ( w d

/+'! Equvalence of retreval based on query generaton lelhood and Kullbac-Lebler (KL Dvergence between query and document language models Kullbac-Lebler (KL Dvergence between two probablstc dstrbutons p( x KL( p q = p ( x log x q( x It s the dstance between two probablstc dstrbutons It s always larger than zero How to prove t?

/+'! Equvalence of retreval based on query generaton lelhood and Kullbac-Lebler (KL Dvergence between query and document language models Sm( q, d = KL( q d q( w = q( w log w p ( w ( ( ( ( = q( w log p w q( w log q w w w Loglelhood of query generaton probablty Document ndependent constant Generalze query representaton to be a dstrbuton (fractonal term weghtng

/+'! Retreval results Estmate the generaton probablty of Pr( q d Retreval results Calculate KL Dvergence K L ( q d q Language Model for d Language Model for q Estmatng query language model Language Model for d Estmatng language model d q Estmatng document language model d

/+'! Feedbac Documents from ntal results Language Model for q F Retreval results Calculate KL Dvergence K L ( q d α = 0 No feedbac ' q = q New Query Model ' q = (1 αq + αq F α =1 Full feedbac ' q = q F Language Model for q Estmatng query language model q Language Model for Estmatng document language model d d

/+'! q F Assume there s a generatve model to produce each word wthn feedbac document(s For each word n feedbac document(s, gven λ Flp a con q * F 1-λ λ = arg max l( X, λ q Bacground model F = 1 ( ( λqf w λ pc w = arg max log ( + (1 log( ( q F P C (w Topc words q F (w n w w Feedbac Documents

/+'! q F For each word, there s a hdden varable tellng whch language model t comes from Bacground Model p C (w C Unnown query topc p(w θ F =? Basetball the 0.12 to 0.05 t 0.04 a 0.02 sport 0.0001 basetball 0.00005 sport =? basetball =? game =? player =? 1-λ=0.8 λ=0.2 Feedbac Documents MLE Estmator If we now the value of hdden varable of each word...

/+'! q F For each word, the hdden varable Z = {1 (feedbac, 0 (bacground} Step1: estmate hdden varable based current on model parameter (Expectaton p( z = 1 p( w z = 1 p( z = 1 w = p( z = 1 p( w z = 1 + p( z = 0 p( w z = 0 λq ( w = λq ( w (1 p ( w C ( t F ( t F + λ C the (0.1 basetball (0.7 game (0.6 s (0.2. E-step Step2: Update model parameters based on the guess n step1 (Maxmzaton ( t+ 1 c( w, F p( z = 1 w qf ( w θf = c( w, F p( z = 1 w j j j j M-Step

/+'! q F Expectaton-Maxmzaton (EM algorthm Step 0: Intalze values of Step1: (Expectaton Step2: (Maxmzaton Gve λ=0.5 0 q F λq ( w q ( w (1 p ( w C ( t F = w = ( t λ F + λ C p( z 1 ( t+ 1 c( w, F p( z = 1 w qf ( w θf = c( w, F p( z = 1 w j j j j

/+'! q F Propertes of parameter λ If λ s close to 0, most common words can be generated from collecton language model, so more topc words n query language mode If λ s close to 1, query language model has to generate most common words, so fewer topc words n query language mode

Introducton to language model Ungram language model Document language model estmaton Maxmum Lelhood estmaton Maxmum a posteror estmaton Jelne Mercer Smoothng Model-based feedbac