RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Similar documents
PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Retrieval and Web Search

Information Retrieval

Modern Information Retrieval

Ranked Retrieval (2)

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

PV211: Introduction to Information Retrieval

Lecture 13: More uses of Language Models

Chap 2: Classical models for information retrieval

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Information Retrieval


Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

University of Illinois at Urbana-Champaign. Midterm Examination

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

CS47300: Web Information Search and Management

Boolean and Vector Space Retrieval Models

Latent Dirichlet Allocation

Recent Advances in Bayesian Inference Techniques

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Information Retrieval Basic IR models. Luca Bondi

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Latent Semantic Analysis. Hongning Wang

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Data Mining Techniques

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Probabilistic Latent Semantic Analysis

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Latent Dirichlet Allocation (LDA)

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

PATTERN RECOGNITION AND MACHINE LEARNING

Variable Latent Semantic Indexing

Machine Learning: Assignment 1

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

CS 646 (Fall 2016) Homework 3

Generative Clustering, Topic Modeling, & Bayesian Inference

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Probabilistic Information Retrieval

Introduction to Information Retrieval

CS145: INTRODUCTION TO DATA MINING

13: Variational inference II

Latent Dirichlet Allocation Introduction/Overview

Behavioral Data Mining. Lecture 2

Document and Topic Models: plsa and LDA

Modern Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Language as a Stochastic Process

Machine learning for pervasive systems Classification in high-dimensional spaces

Bayesian Methods: Naïve Bayes

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Modeling Environment

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Modern Information Retrieval

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

9 Searching the Internet with the SVD

Scaling Neighbourhood Methods

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Retrieval models II. IN4325 Information Retrieval

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

The Naïve Bayes Classifier. Machine Learning Fall 2017

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

IR Models: The Probabilistic Model. Lecture 8

STA 4273H: Statistical Machine Learning

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Language Models. Hongning Wang

Day 6: Classification and Machine Learning

Lecture 5: Web Searching using the SVD

Mining Classification Knowledge

Language Information Processing, Advanced. Topic Models

Midterm Examination Practice

Information Retrieval. Lecture 6

Latent Semantic Analysis. Hongning Wang

STA141C: Big Data & High Performance Statistical Computing

Score Distribution Models

13 Searching the Web with the SVD

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

DATA MINING LECTURE 10

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic Latent Semantic Analysis

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

PV211: Introduction to Information Retrieval

Bayesian Networks. Motivation

Pivoted Length Normalization I. Summary idf II. Review

Kernel methods, kernel SVM and ridge regression

Bayesian Machine Learning

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Intelligent Data Analysis Lecture Notes on Document Mining

Chapter 4: Advanced IR Models

Feature Engineering, Model Evaluations

Data Mining Recitation Notes Week 3

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Transcription:

RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1

Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic retrieval Retrieval evaluation Link analysis From queries to top-k results Social search Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 2

Boolean model Exact match, i.e., either a document matches the query or not No ranking Query composed of Boolean operators Examples: Richard Nixon Watergate, Schwarzenegger (governor politics) (movie action) Advantages Efficient processing Results are easy to explain Main disadvantages Effectiveness (i.e., whether user will find all relevant results) depends entirely on user Not practical for web retrieval (result set would be either huge or too small) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 3

Extended Boolean model (1) Consider term weights, e.g., for document d and term t w t, d = P t d tf t, d Consider the vector t 1,, t m of all terms in all documents (in some order) d can be represented as d = w(t 1, d),, w(t m, d) For query q = t i1 t ik sim d, q = 1 p 1 w t i1, d p + + 1 w t ik, d p k For query q = t i1 t ik p sim d, q = w t i 1, d p + + w t ik, d p k Rank documents by decreasing similarities to query P-norm Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 4

Extended Boolean model (2) Source: G. Salton, Extended Boolean Information Retrieval. CACM 1983 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 5

Extended Boolean model: example Documents X, Z, Y and queries A B and A B Source: G. Salton, Extended Boolean Information Retrieval. CACM 1983 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 6

Term-document matrix Vector space model (1) d 1 d 2 d 3 d 4 d 5 d 6 champion 3 2 0 0 0 0 football 2 0 0 0 0 0 goal 4 3 0 1 0 0 law 0 0 2 3 0 0 party 0 0 6 5 0 0 politician 0 0 4 4 0 0 rain 0 0 0 0 3 3 score 4 5 0 0 0 0 soccer 0 3 0 0 0 0 weather 0 0 0 0 5 4 wind 0 1 0 0 2 3 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 7

Main idea Vector space model (2) Terms span a vector space Documents and queries are vectors in that space Distance measured by cosine similarity: t 3 For d i = t i1,, t im and q = q 1,, q m d i d j sim d i, q = j t ij q j t 2 2 j ij j q j q Measures angle-based affinity between query and doc vector Feature value of query term Feature value of doc term (e.g., tf-idf) t 1 t 2 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 8

Query: football score Vector space model: example 0 1 0 0 0 Query vector: q = 0 0 1 0 0 0 Ranking by using term frequencies tf football, d 1 0.15, tf score, d 1 0.31, tf football, d 2 0, tf score, d 2 0.36 sim d 1, q = d 1 T q 0.15 1+0.31 1 d 1 q 0.52 1.41 = 0.63, sim d 2, q 0 1+0.36 0.49 1.41 = 0.52 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 9

User issues simple query System returns initial results Relevance feedback User marks some results as relevant others as non-relevant Initial query can be reformulated to yield better results based on above relevance feedback q opt = arg max q sim q, R sim q, R Relevant docs Non-relevant docs In vector space model q opt = arg max q 1 R d j R sim(d j, q) 1 R d j R sim(d j, q) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 10

Relevance feedback: Rocchio algorithm Used as relevance feedback algorithm in the SMART system q = αq 0 + β 1 R d j d j R γ 1 R d j d j R With typical parameter values α = 8, β = 16, γ = 4 Negative feature values are set to 0 Source: Introduction to Information Retrieval Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 11

Vector space model for Latent Semantic Indexing (1) Map documents to corresponding latent topics d 1 d 2 d 3 d 4 d 5 d 6 champion 3 2 0 0 0 0 football 2 0 0 0 0 0 goal 4 3 0 1 0 0 law 0 0 2 3 0 0 party 0 0 6 5 0 0 politician 0 0 4 4 0 0 rain 0 0 0 0 3 3 score 4 5 0 0 0 0 soccer 0 3 0 0 0 0 weather 0 0 0 0 5 4 wind 0 1 0 0 2 3 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 12

Vector space model for Latent Semantic Indexing (2) Basic idea (further details in the Data Mining lecture ) holds orthonormal Eigenvectors of AA T (doc-doc-similarity) Mapping of docs d i, d j into latent space: d i U T k d i = d T i = D k V k i q U T k q = q (map keyword query) and measure cosine-similarity positive roots of the Eigenvalues (singular values) of A T A holds orthonormal Eigenvectors of A T A (term-term similarity) Mapping of new doc d into latent space: d U k T d = d Add d as last column of V k T Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 13

Advantages Vector space model: pros & cons Simple framework for ranking Other similarity measures or similarity weighting schemes can be used, e.g.: Metric distance Definition Euclidean x y = x i y i 2 i Manhattan x y 1 = x i y i i Maximum x y = max i x i y i with sim x, y Disadvantages 1 1+dist(x,y) Term independence assumption Impractical for relevance prediction or sim x, y 1 e dist(x,y) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 14

Basic idea Probabilistic retrieval Return documents to a query ranked by decreasing probability of relevance (with respect to the query) If probabilities are calculated on all available data, the retrieval effectiveness should be best! Sample doc, returned by a retrieval system to a user s query Rq Rq Set of all documents Bayes decision rule: document d is relevant if P Rq d > P Rq d P Rq, d > P Rq, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 15

Binary independence model with Naïve Bayes Estimate probabilities P Rq, d = P d Rq P Rq Assume independence between terms in the document given the class (e.g., Rq or Rq) P d Rq = P t i Rq t i d Document d can be classified as relevant if P d Rq P d Rq P Rq > P Rq Note: a document is a vector of binary features (i.e., 0, 1 indicating term occurrence or non-occurrence respectively) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 16

Bayes decision with binary independence model Let r i P t i Rq and r i P t i Rq P d Rq P d Rq = r i i,ti =1 r i 1 r i i,ti =0 1 r i = r i i,ti =1 r i 1 r i i,ti =1 1 r i 1 r i i,ti =1 1 r i 1 r i i,ti =0 1 r i = r i 1 r i i,ti =1 r i 1 r i i 1 r i 1 r i In practice, score is computed by taking the logarithm P d Rq P d Rq log r i 1 r i i,ti =1 r i 1 r i Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 17

BM25 Contingency table Relevant Non-relevant All t i = 1 R i n i R i n i t i = 0 R R i N R n i R i N n i All R N R N Smoothed maximum likelihood estimations for r i and r i r i = R i + 0.5 r R + 1 i = n i R i + 0.5 N R + 1 BM25 (short for Best Match 25) P d Rq P d Rq log r i 1 r i i,ti =1 r i 1 r i = log R i + 0.5 N R n i + R i + 0.5 n i R i + 0.5 R R i + 0.5 i,t i =1 BM25 d, q log R i + 0.5 N R n i + R i + 0.5 n i R i + 0.5 R R i + 0.5 i,t i =1 α + 1 f i A + f i β + 1 qf i β + qf i α, β, A are parameters, where A = α (1 γ) + γ av d k Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 18 d

Language models Text is generated from probability distribution over terms (i.e., language model) Terms are drawn with probability of their occurrence Suppose each document d = e.g., multinomial distribution: P t 1 = k 1,, t m = k m ; p 1,, p m = LM(θ 1 ) LM(θ 2 ) LM(θ 3 ) t 1,, t m is represented by a language model, d! k 1! k m! p 1 k 1 p m k m with θ d = (p 1,, p m ), such that i p i = 1 and i k i = d Given a query q = q 1,, q n, what is the likelihood P q θ d that q is a sample from a document d? Rank documents by decreasing likelihoods of having generated q. Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 19

Language models: the query likelihood model Start with joint probability distribution P θ d, q = P q θ d P θ d likelihood prior Assume uniform prior and conditional independence between query terms given the parameters θ d P q θ d = P q i θ d i (unigram model) Rank documents by decreasing likelihoods P q θ dj Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 20

Language models: estimating probabilities P q θ d = P q i θ d i Estimate P q i θ d by P q i θ d = freq q i, d d Maximum likelihood estimation Problem What if q i does not occur in d? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 21

Language models: smoothing P q i θ d = freq q i, d d Problem What if q i does not occur in d? Maximum likelihood estimation Use smoothing: technique for estimating probabilities of missing words Mixture model for smoothing P q θ d = 1 α P q i θ d + αp q i C i C: whole document corpus Background model Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 22

Mixture model for smoothing Smoothing & tf-idf P q θ d = 1 α P q i θ d + αp q i C i log 1 α P q i θ d + αp q i C i = log 1 α P q i θ d + αp q i C i,fq i >0 + log αp q i C i,fq i =0 = log i,fq i >0 1 α P q i θ d + αp q i C αp q i C + log αp q i C i i,fq i >0 log 1 α P q i θ d αp q i C + 1 What does this have to do with tf-idf??? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 23

Jelineck-Mercer smoothing P q θ d = 1 α P q i θ d + αp q i C i Set α to a constant λ and use maximum likelihood estimations P q θ d = 1 λ freq q i, d d i + λ freq q i, C C In practice: use logarithms for ranking Score q, d = log 1 λ freq q i, d d i + λ freq q i, C C where C = t C freq t, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 24

Dirichlet smoothing P q θ d = 1 α P q i θ d + αp q i C i Basic idea Term that does not occur in a long document should be assigned low smoothed probability (the longer the document, the lower this probability) Set α = μ μ+ d and use maximum likelihood estimations P q θ d = i freq q i, d + μ freq q i, C / C μ + d Score d, q = log freq q i, d + μ freq q i, C / C μ + d i Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 25

Dirichlet smoothing: side note Motivated by Bayesian reasoning posterior P θ D P D θ P θ likelihood prior Dirichlet distribution is conjugate prior of the Multinomial distribution (enables closed-form calculation: prior has same functional form as posterior) In Dirichlet smoothing: Dirichlet prior is assumed over the model parameters, likelihood is multinomially distributed Priors are updated as more and more data is observed Further details in the Data Mining lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 26

Estimate P q i θ d by Laplace/Lidstone smoothing P q i θ d = freq q i, d + 1 freq t, d + 1 t C = freq q i, d + 1 C + d For ranking Score q, d = log freq q i, d + 1 C + d i Advantage: easy-to-use (i.e., prameter-less) additive smoothing technique, no background model Disadvantage: for corpus with many short documents probability estimations are biased towards corpus frequencies Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 27

Jelineck-Mercer smoothing (λ = 0.2) P party d 1 = 0.8 0 13 Smoothing: examples + 0.2 11 72 0.031 P wind d 2 = 0.8 1 14 + 0.2 6 72 0.074 Dirichlet smoothing (μ = 0.2) P party d 1 = 0+0.211 72 0.2+13 0.0023 P wind d 2 = 1+0.2 6 72 0.2+14 0.072 Laplace/Lidstone smoothing P party d 1 = 0+1 72+13 0.01 P wind d 2 = 1+1 72+14 0.023 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 28

Language models vs. tf-idf Precision-recall results on TREC dataset. Source: A Language Modeling Approach to Information Retrieval by M. Ponte More on evaluation measures, next lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 29

Jelineck-Mercer smoothing Smoothing: summary Flexible combination of term frequencies with background model Empirically: the longer the queries the larger λ (i.e., the more smoothing is needed) Does not take document length into account Dirichlet smoothing Probabilistic (i.e. Bayesian) reasoning about prior distribution on model parameters Empirically: the longer the queries the larger μ However, μ is collection dependent Laplace/Lidstone smoothing Easy-to-use additive smoothing technique Probability estimations may be biased towards corpus frequencies Two-stage smoothing is possible (i.e., Jelineck-Mercer smoothing on a Dirichlet smoothed estimation) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 30

Language models for relevance (1) In document language models we were interested in estimating the query likelihood P q θ d In relevance language models we are interested in P d θ Rq, the probability of a document being generated by a relevance model (i.e., document is sample from this model) Can the two models be combined for ranking? One possibility: rank documents by increasing dissimilarity between P q θ d and P d θ Rq Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 31

Language models for relevance (2) Measure the negative Kullback-Leibler divergence KL P d θ Rq P q θ d = P t θ Rq log P t θ Rq t P t θ d = P t θ Rq log P t θ d t P t θ Rq log P t θ Rq t Simple estimation of P t θ Rq : P t θ Rq P t q 1,, q n = P t, q 1,, q n P q 1,, q n P t, q 1,, q n = P d P t, q 1,, q n d d C = P d P t d P q i d d C i conditional independence assumption Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 32

Pseudo-relevance feedback algorithm Estimation of P t θ Rq : P t θ Rq P t q 1,, q n = P t, q 1,, q n P q 1,, q n P t, q 1,, q n = P d P t, q 1,, q n d d C = P d P t d P q i d d C i 1. Rank documents by P q θ d 2. Confine the corpus to C = {top-k ranked documents} 3. Estimate P d θ Rq on this new corpus 4. Rerank documents by using KL P d θ Rq P q θ d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 33

Integrating relevance feedback in language models Doc D Query Q D Q KL D( Q D) Results ' (1 ) Q Q F Feedback Docs F ={d 1, d 2,, d k } Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 34

Language models: summary Tractable retrieval framework with appealing interpretation Smoothed maximum likelihood estimations for unigrams work well in practice Estimations work less well for multigram likelihoods, problem known as curse of dimensionality Relevance model is possible Relevance feedback is difficult to integrate Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 35

Advanced probabilistic models: Bayesian networks Bayesian network: directed graph representing dependencies between random variables Local Markov property: a variable node is independent of all its non- descendants given its parents. One can reason about joint probabilities, e.g., P d i, I d 1 f 1 f i f m q 1 q 2 q k d 2 Doc variable More details in the Data Mining lecture I Query variable feature variable Information need variable Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 36

Ranking support vector machine (Ranking SVM) Motivation: feedback is cheap and plentiful Input: q, r, q, r,, where d i, d j r if d i ranks higher than d j Can we learn a weight vector w such that r: d 1 d 2 d 6 d i, d j r: w T d i > w T d j d 5 d 7 Optimization problem minimize subject to r: 1 w + λ ξ 2 i,j,r i,j,r Slack variable, allows some misclassification w d 9 d 1, d 2, d 1, d 6, d 5, d 6 d 5, d 2, d 5, d 7, d 7, d 9 r r d i, d j r: w T d i > w T d j + 1 ξ i,j,r More details on SVMs in the Data Mining lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 37

Retrieval models (overview) Binary Independence Source: http://en.wikipedia.org/wiki/file:information-retrieval-models.png Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 38

Summary Boolean model Concatenation of query keywords by Boolean operators Either match or no match, no ranking Allows efficient processing but is impractical for large-scale retrieval Vector space model Vector representations of query and documents allow algebraic methods for measuring similarity Good for ranking, but not good dealing with uncertainty or predicting relevance Probabilistic and language models Extensible and versatile Can handle uncertainty (by reasoning about probabilities) Maximum likelihood estimations work well for unigram models, less well for multigram models Difficult to maintain at large scale and in the presence of frequent updates Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 39