RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1

Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic retrieval Retrieval evaluation Link analysis From queries to top-k results Social search Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 2

Boolean model Exact match, i.e., either a document matches the query or not No ranking Query composed of Boolean operators Examples: Richard Nixon Watergate, Schwarzenegger (governor politics) (movie action) Advantages Efficient processing Results are easy to explain Main disadvantages Effectiveness (i.e., whether user will find all relevant results) depends entirely on user Not practical for web retrieval (result set would be either huge or too small) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 3

Extended Boolean model (1) Consider term weights, e.g., for document d and term t w t, d = P t d tf t, d Consider the vector t 1,, t m of all terms in all documents (in some order) d can be represented as d = w(t 1, d),, w(t m, d) For query q = t i1 t ik sim d, q = 1 p 1 w t i1, d p + + 1 w t ik, d p k For query q = t i1 t ik p sim d, q = w t i 1, d p + + w t ik, d p k Rank documents by decreasing similarities to query P-norm Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 4

Extended Boolean model (2) Source: G. Salton, Extended Boolean Information Retrieval. CACM 1983 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 5

Extended Boolean model: example Documents X, Z, Y and queries A B and A B Source: G. Salton, Extended Boolean Information Retrieval. CACM 1983 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 6

Term-document matrix Vector space model (1) d 1 d 2 d 3 d 4 d 5 d 6 champion 3 2 0 0 0 0 football 2 0 0 0 0 0 goal 4 3 0 1 0 0 law 0 0 2 3 0 0 party 0 0 6 5 0 0 politician 0 0 4 4 0 0 rain 0 0 0 0 3 3 score 4 5 0 0 0 0 soccer 0 3 0 0 0 0 weather 0 0 0 0 5 4 wind 0 1 0 0 2 3 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 7

Main idea Vector space model (2) Terms span a vector space Documents and queries are vectors in that space Distance measured by cosine similarity: t 3 For d i = t i1,, t im and q = q 1,, q m d i d j sim d i, q = j t ij q j t 2 2 j ij j q j q Measures angle-based affinity between query and doc vector Feature value of query term Feature value of doc term (e.g., tf-idf) t 1 t 2 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 8

Query: football score Vector space model: example 0 1 0 0 0 Query vector: q = 0 0 1 0 0 0 Ranking by using term frequencies tf football, d 1 0.15, tf score, d 1 0.31, tf football, d 2 0, tf score, d 2 0.36 sim d 1, q = d 1 T q 0.15 1+0.31 1 d 1 q 0.52 1.41 = 0.63, sim d 2, q 0 1+0.36 0.49 1.41 = 0.52 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 9

User issues simple query System returns initial results Relevance feedback User marks some results as relevant others as non-relevant Initial query can be reformulated to yield better results based on above relevance feedback q opt = arg max q sim q, R sim q, R Relevant docs Non-relevant docs In vector space model q opt = arg max q 1 R d j R sim(d j, q) 1 R d j R sim(d j, q) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 10

Relevance feedback: Rocchio algorithm Used as relevance feedback algorithm in the SMART system q = αq 0 + β 1 R d j d j R γ 1 R d j d j R With typical parameter values α = 8, β = 16, γ = 4 Negative feature values are set to 0 Source: Introduction to Information Retrieval Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 11

Vector space model for Latent Semantic Indexing (1) Map documents to corresponding latent topics d 1 d 2 d 3 d 4 d 5 d 6 champion 3 2 0 0 0 0 football 2 0 0 0 0 0 goal 4 3 0 1 0 0 law 0 0 2 3 0 0 party 0 0 6 5 0 0 politician 0 0 4 4 0 0 rain 0 0 0 0 3 3 score 4 5 0 0 0 0 soccer 0 3 0 0 0 0 weather 0 0 0 0 5 4 wind 0 1 0 0 2 3 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 12

Vector space model for Latent Semantic Indexing (2) Basic idea (further details in the Data Mining lecture ) holds orthonormal Eigenvectors of AA T (doc-doc-similarity) Mapping of docs d i, d j into latent space: d i U T k d i = d T i = D k V k i q U T k q = q (map keyword query) and measure cosine-similarity positive roots of the Eigenvalues (singular values) of A T A holds orthonormal Eigenvectors of A T A (term-term similarity) Mapping of new doc d into latent space: d U k T d = d Add d as last column of V k T Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 13

Advantages Vector space model: pros & cons Simple framework for ranking Other similarity measures or similarity weighting schemes can be used, e.g.: Metric distance Definition Euclidean x y = x i y i 2 i Manhattan x y 1 = x i y i i Maximum x y = max i x i y i with sim x, y Disadvantages 1 1+dist(x,y) Term independence assumption Impractical for relevance prediction or sim x, y 1 e dist(x,y) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 14

Basic idea Probabilistic retrieval Return documents to a query ranked by decreasing probability of relevance (with respect to the query) If probabilities are calculated on all available data, the retrieval effectiveness should be best! Sample doc, returned by a retrieval system to a user s query Rq Rq Set of all documents Bayes decision rule: document d is relevant if P Rq d > P Rq d P Rq, d > P Rq, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 15

Binary independence model with Naïve Bayes Estimate probabilities P Rq, d = P d Rq P Rq Assume independence between terms in the document given the class (e.g., Rq or Rq) P d Rq = P t i Rq t i d Document d can be classified as relevant if P d Rq P d Rq P Rq > P Rq Note: a document is a vector of binary features (i.e., 0, 1 indicating term occurrence or non-occurrence respectively) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 16

Bayes decision with binary independence model Let r i P t i Rq and r i P t i Rq P d Rq P d Rq = r i i,ti =1 r i 1 r i i,ti =0 1 r i = r i i,ti =1 r i 1 r i i,ti =1 1 r i 1 r i i,ti =1 1 r i 1 r i i,ti =0 1 r i = r i 1 r i i,ti =1 r i 1 r i i 1 r i 1 r i In practice, score is computed by taking the logarithm P d Rq P d Rq log r i 1 r i i,ti =1 r i 1 r i Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 17

BM25 Contingency table Relevant Non-relevant All t i = 1 R i n i R i n i t i = 0 R R i N R n i R i N n i All R N R N Smoothed maximum likelihood estimations for r i and r i r i = R i + 0.5 r R + 1 i = n i R i + 0.5 N R + 1 BM25 (short for Best Match 25) P d Rq P d Rq log r i 1 r i i,ti =1 r i 1 r i = log R i + 0.5 N R n i + R i + 0.5 n i R i + 0.5 R R i + 0.5 i,t i =1 BM25 d, q log R i + 0.5 N R n i + R i + 0.5 n i R i + 0.5 R R i + 0.5 i,t i =1 α + 1 f i A + f i β + 1 qf i β + qf i α, β, A are parameters, where A = α (1 γ) + γ av d k Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 18 d

Language models Text is generated from probability distribution over terms (i.e., language model) Terms are drawn with probability of their occurrence Suppose each document d = e.g., multinomial distribution: P t 1 = k 1,, t m = k m ; p 1,, p m = LM(θ 1 ) LM(θ 2 ) LM(θ 3 ) t 1,, t m is represented by a language model, d! k 1! k m! p 1 k 1 p m k m with θ d = (p 1,, p m ), such that i p i = 1 and i k i = d Given a query q = q 1,, q n, what is the likelihood P q θ d that q is a sample from a document d? Rank documents by decreasing likelihoods of having generated q. Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 19

Language models: the query likelihood model Start with joint probability distribution P θ d, q = P q θ d P θ d likelihood prior Assume uniform prior and conditional independence between query terms given the parameters θ d P q θ d = P q i θ d i (unigram model) Rank documents by decreasing likelihoods P q θ dj Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 20

Language models: estimating probabilities P q θ d = P q i θ d i Estimate P q i θ d by P q i θ d = freq q i, d d Maximum likelihood estimation Problem What if q i does not occur in d? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 21

Language models: smoothing P q i θ d = freq q i, d d Problem What if q i does not occur in d? Maximum likelihood estimation Use smoothing: technique for estimating probabilities of missing words Mixture model for smoothing P q θ d = 1 α P q i θ d + αp q i C i C: whole document corpus Background model Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 22

Mixture model for smoothing Smoothing & tf-idf P q θ d = 1 α P q i θ d + αp q i C i log 1 α P q i θ d + αp q i C i = log 1 α P q i θ d + αp q i C i,fq i >0 + log αp q i C i,fq i =0 = log i,fq i >0 1 α P q i θ d + αp q i C αp q i C + log αp q i C i i,fq i >0 log 1 α P q i θ d αp q i C + 1 What does this have to do with tf-idf??? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 23

Jelineck-Mercer smoothing P q θ d = 1 α P q i θ d + αp q i C i Set α to a constant λ and use maximum likelihood estimations P q θ d = 1 λ freq q i, d d i + λ freq q i, C C In practice: use logarithms for ranking Score q, d = log 1 λ freq q i, d d i + λ freq q i, C C where C = t C freq t, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 24

Dirichlet smoothing P q θ d = 1 α P q i θ d + αp q i C i Basic idea Term that does not occur in a long document should be assigned low smoothed probability (the longer the document, the lower this probability) Set α = μ μ+ d and use maximum likelihood estimations P q θ d = i freq q i, d + μ freq q i, C / C μ + d Score d, q = log freq q i, d + μ freq q i, C / C μ + d i Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 25

Dirichlet smoothing: side note Motivated by Bayesian reasoning posterior P θ D P D θ P θ likelihood prior Dirichlet distribution is conjugate prior of the Multinomial distribution (enables closed-form calculation: prior has same functional form as posterior) In Dirichlet smoothing: Dirichlet prior is assumed over the model parameters, likelihood is multinomially distributed Priors are updated as more and more data is observed Further details in the Data Mining lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 26

Estimate P q i θ d by Laplace/Lidstone smoothing P q i θ d = freq q i, d + 1 freq t, d + 1 t C = freq q i, d + 1 C + d For ranking Score q, d = log freq q i, d + 1 C + d i Advantage: easy-to-use (i.e., prameter-less) additive smoothing technique, no background model Disadvantage: for corpus with many short documents probability estimations are biased towards corpus frequencies Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 27

Jelineck-Mercer smoothing (λ = 0.2) P party d 1 = 0.8 0 13 Smoothing: examples + 0.2 11 72 0.031 P wind d 2 = 0.8 1 14 + 0.2 6 72 0.074 Dirichlet smoothing (μ = 0.2) P party d 1 = 0+0.211 72 0.2+13 0.0023 P wind d 2 = 1+0.2 6 72 0.2+14 0.072 Laplace/Lidstone smoothing P party d 1 = 0+1 72+13 0.01 P wind d 2 = 1+1 72+14 0.023 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 28

Language models vs. tf-idf Precision-recall results on TREC dataset. Source: A Language Modeling Approach to Information Retrieval by M. Ponte More on evaluation measures, next lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 29

Jelineck-Mercer smoothing Smoothing: summary Flexible combination of term frequencies with background model Empirically: the longer the queries the larger λ (i.e., the more smoothing is needed) Does not take document length into account Dirichlet smoothing Probabilistic (i.e. Bayesian) reasoning about prior distribution on model parameters Empirically: the longer the queries the larger μ However, μ is collection dependent Laplace/Lidstone smoothing Easy-to-use additive smoothing technique Probability estimations may be biased towards corpus frequencies Two-stage smoothing is possible (i.e., Jelineck-Mercer smoothing on a Dirichlet smoothed estimation) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 30

Language models for relevance (1) In document language models we were interested in estimating the query likelihood P q θ d In relevance language models we are interested in P d θ Rq, the probability of a document being generated by a relevance model (i.e., document is sample from this model) Can the two models be combined for ranking? One possibility: rank documents by increasing dissimilarity between P q θ d and P d θ Rq Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 31

Language models for relevance (2) Measure the negative Kullback-Leibler divergence KL P d θ Rq P q θ d = P t θ Rq log P t θ Rq t P t θ d = P t θ Rq log P t θ d t P t θ Rq log P t θ Rq t Simple estimation of P t θ Rq : P t θ Rq P t q 1,, q n = P t, q 1,, q n P q 1,, q n P t, q 1,, q n = P d P t, q 1,, q n d d C = P d P t d P q i d d C i conditional independence assumption Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 32

Pseudo-relevance feedback algorithm Estimation of P t θ Rq : P t θ Rq P t q 1,, q n = P t, q 1,, q n P q 1,, q n P t, q 1,, q n = P d P t, q 1,, q n d d C = P d P t d P q i d d C i 1. Rank documents by P q θ d 2. Confine the corpus to C = {top-k ranked documents} 3. Estimate P d θ Rq on this new corpus 4. Rerank documents by using KL P d θ Rq P q θ d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 33

Integrating relevance feedback in language models Doc D Query Q D Q KL D( Q D) Results ' (1 ) Q Q F Feedback Docs F ={d 1, d 2,, d k } Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 34

Language models: summary Tractable retrieval framework with appealing interpretation Smoothed maximum likelihood estimations for unigrams work well in practice Estimations work less well for multigram likelihoods, problem known as curse of dimensionality Relevance model is possible Relevance feedback is difficult to integrate Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 35

Advanced probabilistic models: Bayesian networks Bayesian network: directed graph representing dependencies between random variables Local Markov property: a variable node is independent of all its non- descendants given its parents. One can reason about joint probabilities, e.g., P d i, I d 1 f 1 f i f m q 1 q 2 q k d 2 Doc variable More details in the Data Mining lecture I Query variable feature variable Information need variable Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 36

Ranking support vector machine (Ranking SVM) Motivation: feedback is cheap and plentiful Input: q, r, q, r,, where d i, d j r if d i ranks higher than d j Can we learn a weight vector w such that r: d 1 d 2 d 6 d i, d j r: w T d i > w T d j d 5 d 7 Optimization problem minimize subject to r: 1 w + λ ξ 2 i,j,r i,j,r Slack variable, allows some misclassification w d 9 d 1, d 2, d 1, d 6, d 5, d 6 d 5, d 2, d 5, d 7, d 7, d 9 r r d i, d j r: w T d i > w T d j + 1 ξ i,j,r More details on SVMs in the Data Mining lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 37

Retrieval models (overview) Binary Independence Source: http://en.wikipedia.org/wiki/file:information-retrieval-models.png Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 38

Summary Boolean model Concatenation of query keywords by Boolean operators Either match or no match, no ranking Allows efficient processing but is impractical for large-scale retrieval Vector space model Vector representations of query and documents allow algebraic methods for measuring similarity Good for ranking, but not good dealing with uncertainty or predicting relevance Probabilistic and language models Extensible and versatile Can handle uncertainty (by reasoning about probabilities) Maximum likelihood estimations work well for unigram models, less well for multigram models Difficult to maintain at large scale and in the presence of frequent updates Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 39