Similar documents
CS 646 (Fall 2016) Homework 3

Ranked Retrieval (2)

Information Retrieval

Lecture 13: More uses of Language Models

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Language Models. Hongning Wang

University of Illinois at Urbana-Champaign. Midterm Examination

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Midterm Examination Practice

Information Retrieval

PV211: Introduction to Information Retrieval

Bayesian Methods: Naïve Bayes

Probabilistic Field Mapping for Product Search

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models, Smoothing, and IDF Weighting

IR Models: The Probabilistic Model. Lecture 8

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Chapter 4: Advanced IR Models

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Boolean and Vector Space Retrieval Models

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Name: Matriculation Number: Tutorial Group: A B C D E

Probabilistic Information Retrieval

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Expectation maximization tutorial

Statistical methods for NLP Estimation

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Advanced Topics in Information Retrieval 5. Diversity & Novelty

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Chap 2: Classical models for information retrieval

INFO 630 / CS 674 Lecture Notes

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Information Retrieval

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data

The Naïve Bayes Classifier. Machine Learning Fall 2017

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Is Document Frequency important for PRF?

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval Basic IR models. Luca Bondi

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

DISTRIBUTIONAL SEMANTICS

Score Distribution Models

Statistical Methods for NLP

IMU Experiment in IR4QA at NTCIR-8

Concept Tracking for Microblog Search

CS145: INTRODUCTION TO DATA MINING

CS 188: Artificial Intelligence Spring Today

Lower-Bounding Term Frequency Normalization

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Modern Information Retrieval

Generative Models for Discrete Data

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Generalized Inverse Document Frequency

Comparing Relevance Feedback Techniques on German News Articles

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Gaussian Models

Information Retrieval and Web Search

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Behavioral Data Mining. Lecture 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

3. Basics of Information Retrieval

Generative Clustering, Topic Modeling, & Bayesian Inference

Information Retrieval

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Why Language Models and Inverse Document Frequency for Information Retrieval?

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Text mining and natural language analysis. Jefrey Lijffijt

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Machine Learning

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Non-Boolean models of retrieval: Agenda

Language as a Stochastic Process

Information Retrieval

Transcription:

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16

5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48

O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists per game o'neal averaged 15.2 points 9.2 rebounds and 1.0 assists per game o'neal averaged 15.2 points 9.2 rebounds 1.0 assists per game o'neal averag 15.2 point 9.2 rebound 1.0 assist per game

Original Porter2 Krovetz organization organ organization organ organ organ heading head heading head head head Original Porter2 Krovetz european european europe europe europ europe urgency urgenc urgent urgent urgent urgent

D2 2 1 0

cos x, y x x x x 1 2... k y y y y 1 2... k x y x y i1 k x y i k k 2 2 x i yi i1 i1 i

x x x x 1 2... k y y y y 1 2... k cos x, y k x y x y xi yi x y x y x y i1

cos q, d q q q q d d d d 1 2... k 1 2... k q d q d q d k i i i i i1 i1 k k k 2 2 2 qi di di i1 i1 i1 k q d

w TF t, D IDF t i i i

TF binary t, D i 1 c ti, D 0 0 c ti, D 0

TFraw ti, D c ti, D

TF raw t, D i c t D ct D 1 log b c ti, D i, 0 0 i, 0

y x log 1 x y log 2 x y log x y log 10 x

log2 0.69 log31.10 q 1.69 1 0

IDF t uniform 1 i

IDF KSJ t log N n t

IDF BM 25 t log N nt 0.5 N nt n 0.5 2 t 0 n t N 2

Multivariate Bernoulli Distribution Results of n independent Bernoulli trails {X 1,, X n } n different coins; toss each coin once Each coin may have a different probability of heads/tails X 1 : tails X 2 : heads X 3 : heads X 4 : heads Different from a binomial distribution with size n Toss the same coin n times (each time independent of others)

A Multivariate Bernoulli Model of Document Consider a document D as the outcome of the model Let be a document D s binary term occurrence vector What is the probability of by this MB model? X i P(X i = 1) P(X i = 0) index 1 0.4 0.6 retrieval 1 0.3 0.7 search 0 0.5 0.5 information 1 0.9 0.1 data 0 0.8 0.2 computer 1 0.9 0.1 science 0 0.4 0.6 index=1 retrieval=1 search=0 information=1 data=0 computer=1 science=0 P D P P P P P P P 0.40.30.50.90.20.90.6

Naïve Bayes Classification using MB Models X i P(X i = 1 IR) P(X i = 1 DB) index 1 0.7 0.8 search 1 0.9 0.9 information 0 0.8 0.6 data 1 0.5 0.9 computer 0 0.4 0.6 relevance 1 0.9 0.1 SQL 0 0.1 0.8 P(IR) = 0.3 P(DB) = 0.7 P IR D P IR P X i Di IR P DB D P DB P X D DB i X V i i P IR D 0.3 0.7 0.9 10.8 0.5 10.4 0.9 10.1 6.33 P DB D 0.7 0.8 0.9 10.6 0.9 10.6 0.1 10.8 D is 6.33 times more likely to be an IR article than a DB one.

log P R D, Q pi 1qi log P NR D, Q 1 p q i t QD i i p P X 1 R, Q i i q P X 1 NR, Q i i p 1 1 0.5 i qi q N n i t i log log log 1 p q q n 0.5 t QD i i t QD i t QD t i i i i

,, scorebm25 d q weightbm25 d q q q i i weight BM25 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl tf : the raw frequency of q in d q, d i N : the total number of documents in the corpus dl : the length of the document d avdl : the average length of documents in the corpus 1 i n : the document frequency of q k and b are two parameters i i

weight k 1 BM25 determines the upperbound of TF: k 1 tf dl k 1 1 b b tf avdl 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl 1 lim k1 1 tf For an average-length document ( dl avdl), TF k1 1 tf k1 1 tf dl k1 tf k 1 1 b b tf avdl

weight BM25 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl For a longer-than-average document, raw tf will be penalized, k1 1 tf k1 1 tf k1 1 tf TF dl dl avdl k1 tf k 1 1 b b tf k1 tf k1b avdl avdl For a shorter-than-average document, raw tf will be boosted, k1 1 tf k1 1 tf k1 1 tf TF dl dl avdl k1 tf k 1 1 b b tf k1 tf k1b avdl avdl

tf dl k 1 1 b b tf avdl tf

tf dl k 1 1 b b tf avdl tf

tf dl k 1 1 b b tf avdl tf

θ θ θ θ tv P t 1 t P(t θ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08

θ Pt P D n i1 0.110.32 0.180.21 0.32 i t P(t θ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08

t c(t,d) P(t IR) P(t DB) index 1 0.21 0.17 retrieval 2 0.32 0.05 search 1 0.18 0.22 information 1 0.11 0.12 data 0 0.06 0.33 computer 0 0.04 0.08 science 0 0.08 0.03 Prior Prob. P(IR)=0.3 P(DB)=0.7 c t, D c t, D P IR D P IR P t IR P DB D P DB td P t DB P IR D 0.3 0.21 0.32 0.18 0.11 16.26 P DB D 0.7 0.17 0.05 0.22 0.12 2

θ θ θ Pt P q D tq Pt log P q D log tq D D ± ±

Recap: The Query likelihood model (QL) Each document is generated from a document LM θ D Estimate a language model θ D for the document D Rank documents by P(q θ D ) Example: QL ranks D 1 higher than D 2 D 1 s model t P(t θ D1 ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08? query information retrieval Pt P q 0.110.32 D1 D1 tq? Pt D2 D2 tq D 2 s model t P(t θ D2 ) index 0.17 retrieval 0.05 search 0.22 information 0.12 data 0.33 computer 0.08 science 0.03 P q 0.120.05

ˆ c t, D term frequency PMLE t D D document length ˆ information D 12 2 1 D Pˆ MLE retrieval D 12 12 ˆ 1 1 PMLE an D Pˆ MLE for D 12 12 1 1 Pˆ MLE technique D 12 12 P MLE ˆ is 1 12 PMLE D ˆ important P MLE D

ˆ c t, D term frequency PMLE t D D document length c t, D ˆ D corpus term frequency PMLE t corpus corpus length D D t DF P X e N ˆ i IDF i MLE i 1 corpus t

MLE: Recall Several Issues Unseen words get zero probability As long as one query term does not appear in D, the document gets zero probability. Ranking by P MLE (q D) is similar to Boolean AND But no occurrence does not mean impossible. Limited sample size MLE is reasonably good for a large sample size. But we are estimating a document model, usually just a few hundred/thousand words long. In some cases (will cover next lecture), we also need to estimate a query model. The sample size is even shorter. Solution: smoothing (will discuss a few slides later )

Jelinek-Mercer Smoothing Start simple, but reasonably good Using P(t Corpus) as the background model Set λ to be constant for all documents, independent of any document or query characteristics Tune to optimize retrieval performance e.g., maximize mean values of P@10 or AP over a set of different queries in a dataset. optimal value of λ varies with different databases, query sets, etc. Correctly setting λ is very important 1 P t P t P t Corpus D MLE D

Jelinek-Mercer Smoothing Example: λ = 0.5, D = 1281 106 Psmoothed the D 0.5 0.50.063904 1281 word freq P MLE (* D) P(* Corpus) Smoothed the 106 0.082748 0.063904 0.073326 soviet 18 0.014052 0.000208 0.007130 chernobyl 10 0.007806 0.000012 0.003909 disclosure 1 0.000781 0.000053 0.000417 divert 1 0.000781 0.000014 0.000397 downplaye 1 0.000781 0.000001 0.000391 each 1 0.000781 0.000489 0.000635 early 1 0.000781 0.000486 0.000633

Dirichlet Smoothing Problem with Jelinek-Mercer All documents have the same λ Longer documents provide better estimates (because it provides a larger sample), and thus its own MLE is more reliable Make smoothing depend on sample size (adaptive) Here D is the length of the sample and is a constant P t D, Pt Corpus c t D D D MLE weight: 1 Corpus weight: D D

Dirichlet Smoothing Example: = 500, D = 1281 Psmoothed D 1065000.063904 the 0.077458 1281500 word freq P MLE (* D) P(* Corpus) Smoothed the 106 0.082748 0.063904 0.077458 soviet 18 0.014052 0.000208 0.010165 chernobyl 10 0.007806 0.000012 0.005618 disclosure 1 0.000781 0.000053 0.000576 divert 1 0.000781 0.000014 0.000565 downplaye 1 0.000781 0.000001 0.000562 each 1 0.000781 0.000489 0.000699 early 1 0.000781 0.000486 0.000698

Smoothing and IDF Similar to many retrieval models, we can write QL as:,, where, log score q D w t D w t D P t D tq Dirichlet smoothing: P t D tf P t Corpus D tf JM smoothing: P t D 1 D Pt Corpus w t, D 1 P t D tf P t D tf Dirichlet: JM: w t, D 1 1 tf P t D D w t, D 1 1 tf P t D D tf is discrete, but let s just assume the functions are all continuous here.

Smoothing and IDF No matter in which smoothing method is employed, common words get much higher P smoothed (t D) The weight (score) of the common words by QL increases much slower than that for less common and rare words while raw tf increases. wt D Dirichlet: wt, D 1 Pt D tf Pt D tf, 1 1 JM: w t D tf Pt D D, which is the same for common and less common terms., 1 1 tf P t D D tf log P t, D 1 For MLE: Pt, D,, only depends on tf D tf tf tf is discrete, but let s just assume the functions are all continuous here.

Dirichlet smoothing, = 1500. Long document, D = 5000; short document, D = 100. Common word, P(t corpus) = 0.01; less common word, P(t corpus) = 0.0001. log P smoothed For common words, log P smoothed increases much slower. tf

KLDP Q Pilog i P i Q i

θ θ θ θ KLD P t log P t q D Pt q t t q D q log D q log q P t P t P t P t t

q D q log D q log q KLD P t P t P t P t ˆ c t, q when PMLE t q, q t t q log Pt D P t, log Pt c t q D ct, q t 1 log Pt D QL q q q t t

How to improve retrieval? Clustering search results Group top-ranked results into different topics Showing only a few results for each topic To avoid that the top-ranked results are biased towards only one particular topic Using clusters to improve document representation Because a document is (relatively) short Document representation is boosted by taken into account the clusters/topics it belongs to

Cluster-based Document Model Liu & Croft, SIGIR 04 Using k-means for clustering; unigram as features Represent a cluster as a language model, c t cluster Pt cluster 1 c t, cluster Smooth a document D s MLE model using The corpus model (the same as QL) The model of the cluster D belongs to t i i Pt corpus 1 P t D P t D P t cluster P t corpus 1 MLE 2 1 2

LDA-based Document Model Wei & Croft, SIGIR 05 Similar to the cluster-based document model Smooth a document MLE model with The corpus model A mixture model of the document s topics

Pseudo-relevance Feedback Pseudo-relevance feedback (PRF); blind feedback; Do an initial search using a regular approach, such as QL Assume the top k ranked results as relevant Perform relevance feedback based on the top k results Normally by query expansion Re-run the query A few practical issues The assumption Efficiency concerns: expand a short query (2-3 words) into a long one (e.g., ~50 words) Practically effective for improving overall search effectiveness (in terms of the mean values of effectiveness metrics) Our focus today

RM1 Pt, q R D D D D D D R R D D R R,, P D R P t q D R,, P D R P t D R P q D R,, P D R P t D R P q D R P t D P q D q q i i q q i i A1, A2 A3 A4 Assumptions A1: is uniform A2: and A3: A4:

RM1 RM1: P t q, R P t, q R P t D P q D Computation Iterate over each feedback document (source) D Assign a weight to D In terms of PRF, we just retrieve top k results by QL and weight each document by QL probability Higher-ranked results get more weights Expand a term t from D by the weight P(t D)P(q D) Sum up terms weights in each feedback document D Normalize the terms weights to probability P t q, R DD R t DD j R D D R i q q Pt D Pqi D qi q Pt j D Pqi D q q i i

RM2 RM2: P t q, R P t D P q D j i j qi q R Computation Iterate over each query term q i Iterate over each feedback document D Assign a weight P(q i D) to D Expand a term t from D by P(t D)P(q i D): if both t and q i occur frequently in D, t gets a greater weight Sum up the weight in each document Multiply the expansion weight for each q i Normalize the terms weights to probability P t D j P t D D D D j D

Comparing different approaches Lv & Zhai, CIKM 09

Pseudo-relevance Feedback Usually believed to be a useful technique But somewhat controversial Recall oriented; limited improvements in precision at the top Making good queries bad; making bad queries worse Overall improvements: average values of metrics? But improving bad/difficult queries may be more important Search efficiency concern Difficult to control; unpredictable for the user Difficult to improve in noisy corpus (such as web corpus) Using some clean corpus for query expansion, e.g., Wikipedia

Average Precision (AP): example = the relevant documents Ranking #1 Rank = 1, precision = 1 Rank = 3, precision = 2/3 Rank = 6, precision = 3/6 Rank = 9, precision = 4/9 Rank = 10, precision = 5/10 AP = 1/5 x (1+2/3+3/6+4/9+5/10)

Average Precision (AP): example = the relevant documents Ranking #2 AP=?

Average Precision (AP): example = the relevant documents Ranking #2 Rank = 2, precision = 1/2 Rank = 5, precision = 2/5 Rank = 6, precision = 3/6 Rank = 7, precision = 4/7 A relevant result is not retrieved we can consider the retrieved rank=, precision = 0 AP = 1/5 x (1/2+2/5+3/6+4/7+0)

DCG@ k k r 2 i 1 log i 1 i1 2

DCG@ k k r 2 i 1 log i 1 i1 2

ndcg@ k DCG@ k IDCG@ k

1 3 0 0 3 DCG @5 log 2 log 3 log 4 log 5 log 6 2 2 2 2 2 3 3 1 1 0 IDCG @5 log 2 log 3 log 4 log 5 log 6 2 2 2 2 2 DCG @5 ndcg @5 IDCG @5

User Sequence User Sequence User 1 A -> B -> C User 4 A -> B -> C User 2 B - > C -> A User 5 B - > C -> A User 3 C - > A -> B User 6 C - > A -> B

R 1 n 2 i1 n i1 y i y i yˆ i y 2 2

2 2 2 1 2 1 ˆ 1 1 Ajusted 1 1 1 1 1 n i i i n i i y y n n R R n p n p y y

Lucene Galago Indri Task 1 3.40 3.72 3.65 Task 2 3.56 3.48 3.82

Lucene Galago Indri Mean 3.40 3.82 3.60

Source DF Adj SS Adj MS F-Value P-Value Task 2 8.222 4.1111 1.71 0.2094 System 2 20.222 10.1111 4.20 0.0318 Task * System 4 46.222 11.5556 4.80 0.0082 Error 18 43.333 2.4074 Total 26 118.000

Source DF Adj SS Adj MS F-Value P-Value Task 2 8.222 4.1111 1.71 0.2094 System 2 20.222 10.1111 4.20 0.0318 Task * System 4 46.222 11.5556 4.80 0.0082 Error 18 43.333 2.4074 Total 26 118.000