Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Similar documents
Information Retrieval

PV211: Introduction to Information Retrieval

Probabilistic Information Retrieval

Ranked Retrieval (2)

IR Models: The Probabilistic Model. Lecture 8

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval

Lecture 12: Link Analysis for Web Retrieval

Lecture 13: More uses of Language Models

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Chapter 4: Advanced IR Models

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

COMP90051 Statistical Machine Learning

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Chap 2: Classical models for information retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Modern Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Midterm Examination Practice

Information Retrieval and Web Search Engines

Language Models, Smoothing, and IDF Weighting

Lecture 3: Probabilistic Retrieval Models

Lecture 15: Logistic Regression

Machine Learning for natural language processing

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Query Propagation in Possibilistic Information Retrieval Networks

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Language Models. Hongning Wang

Variable Latent Semantic Indexing

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval

PV211: Introduction to Information Retrieval

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science


What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Introduction to Machine Learning

Behavioral Data Mining. Lecture 2

Basic Probability and Statistics

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Introduction to Machine Learning

Boolean and Vector Space Retrieval Models

A Study of the Dirichlet Priors for Term Frequency Normalisation

Intelligent Systems (AI-2)

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Directed Graphical Models or Bayesian Networks

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Uncertainty. Chapter 13

Non-Boolean models of retrieval: Agenda

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Probability Theory for Machine Learning. Chris Cremer September 2015

Rapid Introduction to Machine Learning/ Deep Learning

Scoring, Term Weighting and the Vector Space

Deposited on: 1 November 2007 Glasgow eprints Service

Information Retrieval

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS 188: Artificial Intelligence. Bayes Nets

Generalized Inverse Document Frequency

Midterm sample questions

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

3. Basics of Information Retrieval

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

Where are we? Knowledge Engineering Semester 2, Reasoning under Uncertainty. Probabilistic Reasoning

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Our Status in CSE 5522

DISTRIBUTIONAL SEMANTICS

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Probabilistic Reasoning

Lecture 1b: Text, terms, and bags of words

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

6 Probabilistic Retrieval Models

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CS47300: Web Information Search and Management

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Reinforcement Learning Wrap-up

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification

Propositional Logic: Bottom-Up Proofs

Score Distribution Models

Bayes Nets: Sampling

Why Language Models and Inverse Document Frequency for Information Retrieval?

Transcription:

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1

What we ll learn in this lecture Probabilistic models for IR As a model of relevance, P(R d, q) Binary independence model Okapi BM25

Probabilistic vs. geometric models Fundamental calculations: Geometric How similar sim(d, q) is document d to query q? Probabilistic What is the probability P(R = 1 d, q) that d is relevant to q? Different starting points, but leading to similar formulations.

Probabilistic models Probabilistic models: Clearer theoretical basis that geometric Particularly when considering extensions & modifications Probabilistic models initially not very successful, but now widespread We look at classical probabilistic models up to BM25 Next lecture, language models

(Discrete) Probability Recap Random Variable variable that can take on a value, e.g., A {0, 1} Probability distribution likelihood of taking on a value, e.g., P(A = 1), P(A = 0) Shorthand: P(A) P(A = 1), P(Ā) P(A = 0). Joint probability distribution what is the likelihood of several RVs each taking on specific values. E.g., P(A, B). Conditional probability distribution given some side information, what is the likelihood of the RV taking on certain values. E.g., P(A B), P(A B), P(Ā B), P(Ā B).

Representing Probability Distributions as vectors and matrices Volcanic eruption (V) Cloudy given volcanic eruption, P(C V ) Cloudy and volcanic eruption, P(C, V ) V P(C) 0 0.001 1 0.999 C V = 0 V = 1 0 0.7 0.95 1 0.3 0.05 C V = 0 V = 1 0 0.699 0.00005 1 0.300 0.00095

Basic Rules of Probability Conditioning P(A B) = Chain rule P(A, B) P(B) = P(A) P(B) P(B) if A and B are independent P(A, B, C) = P(A)P(B A)P(C A, B) = P(C)P(B C)P(A B, C) =... Marginalisation property P(A) = b P(A, B = b) = P(A, B) + P(A, B)

Bayes Rule The conditioning forumlation and chain rule imply: P(A B) = = P(A, B) = P(B A)P(A) P(B) P(B) P(B A) P(A) X {A,Ā} P(B X )P(X ) Commonly referred to as Bayes rule or Bayes theorem. Volcano Example If we know it s cloudy, was this caused by a volcano? I.e., what is P(V C)? 0.95 0.001 P(V C) = 0.3 0.999 + 0.95 0.001 = 0.0032 Now 3 more likely than before!

Bayes theorem P(A B) = P(B A) P(B) P(A) P(A) is the prior probability (distribution) of A We then observe B P(B A)/P(B) is support that B provides for A (known as the likelihood and the evidence, resp.) P(B A) P(B) = 1 if A and B are independent P(A B) is posterior probability of A

Bayes theorem for relevance P(R d, q) = P(d R, q) P(d q) P(R q) (1) P(R q) can be understood as proportion of documents in collection that are relevant to query P(d R, q) is probability that a (retrieved) document relevant to q looks like d P(d q) = P(d R, q) P(R q) + P(d R, q) P( R q) is probability of observing (retrieved) document, regardless of relevance Effectively a normaliser such that (1) is valid, P(R d, q) + P( R d, q) = 1. OK, but how do we go about estimating these values?

Rank-equivalence given query Probability Ranking Principle (PRP) Assume output is ranking Assume that relevance of each document is independent Then optimal ranking is by decreasing probability of relevance For ranking, we only care about Relative probability for given query This allows various simplifications to Equation 1 Provided they are monotonic, i.e., for transformation f (), P(A) > P(B) f (P(A)) > f (P(B)) e.g., f = log, ± constant, a positive constant,...

Odds-based matching score Take odds ratio between relevance and irrelevance: O(R d, q) = P(R d, q) P( R d, q) = P(R q)p(d R,q) P(d q) P( R q)p(d R,q) P(d q) = P(R q) P(d R, q) P( R q) P(d R, q) P(d R, q) = O(R q) P(d R, q) where O(R q) captures the first term which does not involve d (and can be ignored in ranking). We have removed 2 of the 3 terms from Equation 1.

Binary independence model How to estimate P(d R, q) and P(d R, q)? Must be based on attributes of d and q Binary indendence model Binary Doc attributes are presence of terms (not frequency) Independence Term appearances independent given relevance Represent: Document as binary vector d Query as binary vector q over words.

BIM odds ratio Under BIM, odds ratio resolves to: O(R d, q) T t=1 = t:d t=1 = t:d t=1 P(d t R, q) P(d t R, q) P(d t R, q) P(d t R, q) p t u t t:d t=0 t:d t=0 1 p t 1 u t where p t = P(d t R, q) and u t = P(d t R, q). Note similarity to Naive Bayes. P( d t R, q) P( d t R, q)

Query terms only Assume non-query terms equally likely to occur in relevant as irrelevant documents prescence or abscence of these terms irrelevant many factors cancel, leaving O(R d, q) t:q t=d t=1 = t:d t=q t=1 t:d t=q t=1 p t u t t:q t=1 d t=0 p t (1 u t ) u t (1 p t ) p t (1 u t ) u t (1 p t ) t:q t=1 1 p t 1 u t where the last step drops the constant scaling factor. (1 p t ) (1 u t )

Retreival status value The retreival status value (RSV d ) is defined as the log transformation of the odds-ratio 1 RSV d = log = t:d t=q t=1 t:d t=q t=1 p t (1 u t ) u t (1 p t ) log p t(1 u t ) u t (1 p t ) The RSV can now be used for ranking. Note the similarity with geometric vector space models, here the weight of term t is: w t = log + log 1 u t 1 p t u t as used in RSV d = 1 up to a scaling constant p t t:d t=q t=1 w t

Assessment-time estimation Term weights w t still depends upon unknown p t = P(d t R, q) and u t = P(d t R, q). If we have relevance judgements, p t and u t can be estimated as ˆp t = 1 R û t = 1 R d R d t d R d t But generally R is unknown at retreival time. How to estimate?

Retrieval-time estimation: u t Probablity of term occurrence in non-relevant document Assume relevant documents rare If all documents in the collection are not relevant then Overall, this gives u t = f t N log 1 u t u t log N f t f t log N f t Look familiar?

Retrieval-time estimation: p t Setting log p t /(1 p t ) to a constant E.g., p t = 0.5 removes p t /(1 p t ) term entirely Relevance score of doc is just sum of IDFs Plausible for binary model Or based on analysis of empirical data, e.g., Greiff, A theory of term weighting, SIGIR, 1998

Summary: Binary independence model Document attributes in BIM are term occurrences {0, 1} Models p t = P(d t = 1 R, q), the chance of a relavent document containing t and u t = P(d t = 1 R, q), the chance of an irrelevant document containing t Weight w t of query term t occurring in document d is then: w t = log If p t = 0.5, this approximates IDF. p t 1 p t + log 1 u t u t

Okapi BM25 Inspired by the BIM probabilistic formulation Why not try other alteratives for term weighting? Can we capture various aspects in a simple formula? idf tf document length query tf Then seek to tune each component... Results in effective and widely used models, Okapi BM25

Okapi BM25 [ w t = log N f ] t + 0.5 f t + 0.5 (idf) (k 1 + 1)f d,t ) k 1 ((1 b) + b L d L ave + f d,t (tf and doc. length) (k 3 + 1) f q,t k 3 + f q,t (query tf) Parameters k 1, b, and k 3 need to be tuned (k 3 only for very long queries). k 1 = 1.5 and b = 0.75 common defaults. BM25 highly effective, most widely used weighting in IR Robertson, Walker et al. (1994, 1998).

What have we achieved? Pros Started from plausible probabilistic model of term distribution Shown how it can be made to fit something like TF*IDF Providing a probabilistic justification TF*IDF-like approaches Cons Directly trying to estimate P(f dt R) not practicable in retrieval (too many parameters, not enough evidence) Such approaches end up as ad-hoc as geometric model Progress requires letting query tell us what relevance looks like This the approach of language models

Looking back and forward Back Probabilistic IR models estimate P(R d, q) (or monotonic function thereof) Probability derived from attributes (term occurrences) of documents BIM naive Bayes assumptions Binary attributes (term occurs or doesn t) Term occurrences independent Provides probabilistic justification for IDF BM25 builds on this using heuristic weighting formula to account for term frequency, document length, query term frequency etc

Looking back and forward Forward Language models an alternative probabilistic IR framework

Further reading Chapter 11, Probabilistic information retrieval of Manning, Raghavan, and Schutze, Introduction to Information Retrieval. http://nlp.stanford.edu/ir-book/pdf/11prob.pdf Sparck Jones, Walker, and Robertson, A Probabilistic MOdel of Information Retrieval, IPM, 2000.