IR Models: The Probabilistic Model. Lecture 8

Similar documents
Information Retrieval

Chap 2: Classical models for information retrieval

Querying. 1 o Semestre 2008/2009

PV211: Introduction to Information Retrieval

Probabilistic Information Retrieval

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Boolean and Vector Space Retrieval Models

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Ranked Retrieval (2)

Modern Information Retrieval

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Information Retrieval and Web Search

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Lecture 13: More uses of Language Models

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

CAIM: Cerca i Anàlisi d Informació Massiva

Compact Indexes for Flexible Top-k Retrieval

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS


Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Information Retrieval and Web Search Engines

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Organisation

Lecture 3: Probabilistic Retrieval Models

Maschinelle Sprachverarbeitung

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

vector space retrieval many slides courtesy James Amherst

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Variable Latent Semantic Indexing

Information Retrieval

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CSCE 561 Information Retrieval System Models

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Information Retrieval. Lecture 6

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Query Propagation in Possibilistic Information Retrieval Networks

3. Basics of Information Retrieval

Behavioral Data Mining. Lecture 2

1 Information retrieval fundamentals

PV211: Introduction to Information Retrieval

Chapter 4: Advanced IR Models

Dealing with Text Databases

Mining Classification Knowledge

Score Distribution Models

The Naïve Bayes Classifier. Machine Learning Fall 2017

Information Retrieval

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

DISTRIBUTIONAL SEMANTICS

Modern Information Retrieval

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

The Boolean Model ~1955

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Molecular Similarity Searching Using Inference Network

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

9 Searching the Internet with the SVD

16 The Information Retrieval "Data Model"

Modern Information Retrieval

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Midterm Examination Practice

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

University of Illinois at Urbana-Champaign. Midterm Examination

Information Retrieval Models

Mining Classification Knowledge

Modeling and reasoning with uncertainty

The troubles with using a logical model of IR on a large collection of documents

A FUZZY LINGUISTIC IRS MODEL BASED ON A 2-TUPLE FUZZY LINGUISTIC APPROACH

Text Analytics. Searching Terms. Ulf Leser

Lecture 5: Web Searching using the SVD

Non-Boolean models of retrieval: Agenda

CS47300: Web Information Search and Management

Pivoted Length Normalization I. Summary idf II. Review

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

13 Searching the Web with the SVD

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Handling Uncertainty using FUZZY LOGIC

Name: Matriculation Number: Tutorial Group: A B C D E

Computer Science CPSC 322. Lecture 18 Marginalization, Conditioning

Probability Kinematics in Information. Retrieval: a case study. F. Crestani. Universita' di Padova, Italy. C.J. van Rijsbergen.

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Computational Complexity of Bayesian Networks

A Risk Minimization Framework for Information Retrieval

0.1 Naive formulation of PageRank

Transcription:

IR Models: The Probabilistic Model Lecture 8

' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index terms mismatch Leads to several statistical approaches probability theory, fuzzy logic, theory of evidence...

Probabilistic Retrieval Given a query q, there exists a subset of the documents R which are relevant to q But membership of R is uncertain A Probabilistic retrieval model ranks documents in decreasing order of probability of relevance to the information need: P(R q,d i )

Difficulties 1. Evidence is based on a lossy representation Evaluate probability of relevance based on occurrence of terms in query and documents Start with an initial estimate, and refine through feedback 2. Computing the probabilities exactly according to the model is intractable Make some simplifying assumptions

Probabilistic Model definitions d j = (t 1,j, t 2,j, t t,j ), t i,j {0,1} terms occurrences are boolean (not counts) query q is represented similarly R is the set of relevant documents, ~R is the set of irrelevant documents P(R d j ) is probability that d j is relevant, P(~R d j ) irrelevant

Retrieval Status Value "Similarity" function ratio of prob. of relevance to prob. of non-relevance Transform P(R d j ) using Bayes Rule Compute rsv() in terms of document probabilities P(R) and P(~R) are constant for each document

Retrieval Status Value (2) d is a vector of binary term occurrences We assume that terms occur independently of each other

Computing term probabilities Initially, there are no retrieved documents R is completely unknown Assume P(t i R) is constant (usually 0.5) Assume P(t i ~R) approximated by distribution of t i across collection IDF This can be used to compute an initial rank using IDF as the basic term weight

d 1 2 3 4 5 Probabilistic Model Example col day eat 6 w t 0.26 0.56 0.56 0.26 0.56 Document vectors <tf d,t > hot lot nin 0.56 q1 = eat q2 = porridge q3 = hot porridge q4 = eat nine day old porridge old 0.56 pea 0.0 por 0.0 pot 0.26

Improving the ranking Now, suppose we have shown the initial ranking to the user the user has labeled some of the documents as relevant ("relevance feedback") We now have N documents in coll, R are known relevant n i documents containing t i, r i are relevant

Improving term estimates for term i Rel Non-rel Total docs containing term r n-r n docs NOT containing term R-r N-R-n+r N-n Total R N-R N

Final term weight Add 0.5 to each term, to keep the weight from being infinite when R, r are small: Can continue to refine the ranking as the user gives more feedback.

Relevance-weighted Example d Document vectors <tf d,t > col day eat hot lot nin old pea 1 2 3 4 5 6 w t -0.33 0.0 0.0-0.33 0.0 0.0 0.0 0.62 q3 = hot porridge, document 2 is relevant por 0.62 pot 0.95

Summary Probabilistic model uses probability theory to model the uncertainty in the retrieval process Assumptions are made explicit Term weight without relevance information is inverse document frequency (IDF) Relevance feedback can improve the ranking by giving better term probability estimates No use of within-document term frequencies or document lengths

Building on the Probabilistic Model: Okapi weighting Okapi system developed at City University London based on probabilistic model Cost of not using tf and document length doesn t perform as well as VSM hurts performance on long documents Okapi solution model within-document term frequencies as a mixture of two Poisson distributions one for relevant documents and one for irrelevant ones

)(! # $ " Okapi best-match weights.0/ +-, * %'& "

Okapi weighting Okapi weights use a "tf" component similar to VSM separate document and query length normalizations several tuning constants which depend on the collection In experiments, Okapi weights give the best performance

d 1 2 3 4 5 Okapi-weights Example col day eat Document vectors <tf d,t > hot lot pea 2.0 2.0 por 2.0 2.0 pot 2.0 6 2 w (1) 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26 nin q1 = eat q2 = porridge q3 = hot porridge q4 = eat nine day old porridge old dl 6 3 3 4 4

d 1 2 3 4 5 Okapi-weights +RF Example col day eat hot 6 2 w (1) -0.33 0.0 0.0-0.33 0.0 0.0 0.0 0.62 0.62 0.95 q3 = hot porridge, doc 2 is relevant Document vectors <tf d,t > lot nin old pea 2.0 2.0 por 2.0 2.0 pot 2.0 dl 6 3 3 4 4

Ranking algorithm 1. A = {} (set of accumulators for documents) 2. For each query term t Get term, f t, and address of I t from lexicon set w (1) and qtf variables Read inverted list I t For each <d, f d,t > in I t 1. If A d A, initialize A d to 0 and add it to A 2. A d = A d + (w (1) x tf x qtf) + qnorm 3. For each A d in A 1. A d = A d /W d 4. Fetch and return top r documents to user

Managing Accumulators How to store accumulators? static array, 1 per document grow as needed with a hash table How many accumulators? can impose a fixed limit quit processing I t s after limit reached continue processing, but add no new A d s

Managing Accumulators (2) To make this work, we want to process the query terms in order of decreasing idf t Also want to process I t in decreasing tf d,t order sort I t when we read it in or, store inverted lists in f d,t -sorted order <5; (1,2) (2,2) (3,5) (4,1) (5,2)> <f t ; (d, f d,t ) > <5; (3,5) (1,2) (2,2) (5,2) (4,1)> sorted by f d,t <5; (5, 1:3) (2, 3:1,2,5) (1, 1:4)> <f t ; (f d,t, c:d, ) > This can actually compress better, but makes Boolean queries harder to process

Getting the top documents Naïve: sort the accumulator set at end Or, use a heap and pull top r documents much faster if r << N Or better yet, as accumulators are processed to add the length norm (W d ): make first r accumulators into a min-heap for each next accumulator if A d < heap-min, just drop it if A d > heap-min, drop the heap-min, and put A d in