Information Retrieval and Web Search Engines

Similar documents
Lecture 3: Probabilistic Retrieval Models

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Chap 2: Classical models for information retrieval

PV211: Introduction to Information Retrieval

Boolean and Vector Space Retrieval Models

Modern Information Retrieval

text statistics October 24, 2018 text statistics 1 / 20

P (E) = P (A 1 )P (A 2 )... P (A n ).

Discrete Probability and State Estimation

Probability, Entropy, and Inference / More About Inference

Information Retrieval Basic IR models. Luca Bondi

The dark energ ASTR509 - y 2 puzzl e 2. Probability ASTR509 Jasper Wal Fal term

IR Models: The Probabilistic Model. Lecture 8

Chapter 1 (Basic Probability)

Ranked Retrieval (2)

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Introduction to Algebra: The First Week

TEACHING INDEPENDENCE AND EXCHANGEABILITY

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events

Conditional probabilities and graphical models

University of Illinois at Urbana-Champaign. Midterm Examination

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Probability in Programming. Prakash Panangaden School of Computer Science McGill University

Exercise 1: Basics of probability calculus

Discrete Probability and State Estimation

Naïve Bayes Classifiers

With Question/Answer Animations. Chapter 7

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Solution: chapter 2, problem 5, part a:

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Random Variables

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Modeling Environment

Discrete Mathematics and Probability Theory Fall 2010 Tse/Wagner MT 2 Soln

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Notes on Latent Semantic Analysis

MAS3301 Bayesian Statistics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Computational Cognitive Science

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Approximate Inference

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

INFO 2950 Intro to Data Science. Lecture 18: Power Laws and Big Data

Lecture 1b: Text, terms, and bags of words

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

7.1 What is it and why should we care?

Privacy in Statistical Databases

Mathematics-I Prof. S.K. Ray Department of Mathematics and Statistics Indian Institute of Technology, Kanpur. Lecture 1 Real Numbers

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

6.867 Machine learning, lecture 23 (Jaakkola)

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

MITOCW ocw f99-lec09_300k

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS 188: Artificial Intelligence Spring Today

Take the measurement of a person's height as an example. Assuming that her height has been determined to be 5' 8", how accurate is our result?

Continuity and One-Sided Limits

Contents. Decision Making under Uncertainty 1. Meanings of uncertainty. Classical interpretation

STA Module 4 Probability Concepts. Rev.F08 1

Latent Dirichlet Allocation Introduction/Overview

Structure learning in human causal induction

Math 381 Midterm Practice Problem Solutions

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

CS 188: Artificial Intelligence Fall 2008

2) There should be uncertainty as to which outcome will occur before the procedure takes place.

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

CS 188: Artificial Intelligence Fall 2009

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Math 10 - Compilation of Sample Exam Questions + Answers

COS433/Math 473: Cryptography. Mark Zhandry Princeton University Spring 2017

(Refer Slide Time: 00:10)

Replay argument. Abstract. Tanasije Gjorgoski Posted on on 03 April 2006

Bayes Rule: A Tutorial Introduction ( www45w9klk)

Language Models, Smoothing, and IDF Weighting

CS 188: Artificial Intelligence Spring Announcements

STA111 - Lecture 1 Welcome to STA111! 1 What is the difference between Probability and Statistics?

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Computational Complexity of Bayesian Networks

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

STOR 435 Lecture 5. Conditional Probability and Independence - I

Bayesian Inference. Introduction

1 Continuity and Limits of Functions

MITOCW ocw f99-lec01_300k

Lecture 13: More uses of Language Models

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

MITOCW ocw f99-lec05_300k

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Formal Modeling in Cognitive Science

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

Choosing priors Class 15, Jeremy Orloff and Jonathan Bloom

Transcription:

Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

Probabilistic Retrieval Models Probabilistic IR models use Pr(document d is useful for the user asking query q) as underlying measure of similarity between queries and documents Advantages: Probability theory is the right tool to reason under uncertainty in a formal way Methods from probability theory can be re-used 2

Lecture 4: Probabilistic Retrieval Models 1. The Probabilistic Ranking Principle 2. Probabilistic Indexing 3. Binary Independence Retrieval Model 4. Properties of Document Collections 3

The Probabilistic Ranking Principle Probabilistic information retrieval rests upon the Probabilistic Ranking Principle (Robertson, 1977) If a reference retrieval system s response to each request is a ranking of the documents in the collection in order of decreasing probability of usefulness for the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. 4

Usefulness and Relevance Characterizing usefulness is really tricky, we will discuss this later Instead of usefulness we will consider relevance Given a document representation d and a query q, one can objectively determine whether d is relevant with respect to q or not This means in particular: Relevance is a binary concept Two documents having the same representation are either both relevant or both irrelevant 5

Relevance Denote the set of all document representations contained in our current collection by C For any query q, denote the set of relevant documents contained in our collection by R q, i.e. R q = {d C d is relevant with respect to q} q 6

Probability of Relevance Our task then becomes: Input: The user s query q and a document d Output: Pr(d R q ) Precisely what does this probability mean? As we have defined it, it is either d R q or d R q Is Pr(d R q ) a sensible concept? What does probability in general mean? Maybe we should deal with that first 7

Interpretations of Probability There are different interpretations of probability, we will look at the two most common ones Frequentists vs. Bayesians Frequentists Probability = expected frequency on the long run Neyman, Pearson, Wald, Bayesians: Probability = degree of belief Bayes, Laplace, de Finetti, 8

The Frequency Interpretation An event can be assigned a probability only if 1. it is based on a repeatable(!) random experiment, and 2. within this experiment, the event occurs at a persistent rate on the long run, its relative frequency An event s probability is the limit of its relative frequency in a large number of trials Examples: Events in dice rolling The probability that it will rain tomorrow in the whether forecast (if based on historically collected data) 9

The Bayesian Interpretation Probability is the degree of belief in a proposition The belief can be: subjective, i.e. personal, or objective, i.e. justified by rational thought Unknown quantities are treated probabilistically Knowledge can always be updated Named after Thomas Bayes Examples: The probability that there is life on other planets The probability that you pass this course s exam 10

Frequentist vs. Bayesian There is a book lying on my desk I know it is about one of the following two topics: Information retrieval Animal health What s Pr( the book is about IR )? That question is stupid! There is no randomness here! Frequentist That s a valid question! I only know that the book is either about IR or AH. So let s assume the probability is 0.5! Bayesian 11

Frequentist vs. Bayesian (2) But: Let s assume that the book is lying on my desk due to a random draw from my bookshelf Let X be the topic result of a random draw What s Pr( X is about IR ) That question is valid! This probability is equal to the proportion of IR books in your shelf. Frequentist 12

Frequentist vs. Bayesian (3) Back to the book lying on my desk What s Pr( the book is about IR )? Even if I assume that you got this book by drawing randomly from your shelf, the question stays stupid. I have no idea what this book is about. But I can tell you what properties a random book has. This guy is strange Frequentist Bayesian 13

Frequentist vs. Bayesian (4) A more practical example: Rolling a dice Let x be the (yet hidden) number on the dice that lies on the table Note: x is a number, not a random variable! What s Pr(x = 5)? Stupid question again. As I told you: There is no randomness involved! Frequentist Since I do not know what x is, this probability expresses my degree of belief. I know the dice s properties, therefore the probability is 1/6. Bayesian 14

Frequentist vs. Bayesian (5) What changes if I show you the value of x? Nothing changes. Uncertainty and probability have nothing in common. Now the uncertainty is gone. The probability (degree of belief) is either 1 or 0, depending whether the dice shows a 5. Frequentist Bayesian 15

Probability of Relevance, Again How to interpret Pr(d R q )? Clearly: Bayesian (expressing uncertainty regarding R q ) Although there is a crisp set R q (by assumption), we do not know what R q looks like Bayesian approach: Express uncertainty in terms of probability Probabilistic models of information retrieval: Start with Pr(d R q ) and relate it to other probabilities, which might be more easily accessible On this way, make some reasonable assumptions Finally, estimate Pr(d R q ) using other probabilities estimates 16

Lecture 4: Probabilistic Retrieval Models 1. The Probabilistic Ranking Principle 2. Probabilistic Indexing 3. Binary Independence Retrieval Model 4. Properties of Document Collections 17

Probabilistic Indexing Presented by Maron and Kuhns in 1960 Goal: Improve automatic search on manually indexed document collections Basic notions: k index terms Documents = vectors over [0, 1] k, i.e. terms are weighted Queries = vectors over {0, 1} k, i.e. binary queries R q = relevant documents with respect to query q (as above) Task: Given a query q, estimate Pr(d R q ), for each document d 18

Probabilistic Indexing (2) Let Q be a random variable ranging over the set of all possible queries Q s distribution corresponds to the sequence of all queries asked in the past Example (k = 2): Ten queries have been asked to the system previously: (0, 0) (1, 0) (0, 1) (1, 1) 0 2 7 1 Then, Q s distribution is given by: Pr(Q = (0, 0)) Pr(Q = (1, 0)) Pr(Q = (0, 1)) Pr(Q = (1, 1)) 0 0.2 0.7 0.1 19

Probabilistic Indexing (3) If Q is a random query, then R Q is a random set of documents We can use R Q to express our initial probability Pr(d R q ): This means: If we restrict our view to events where Q is equal to q, then Pr(d R q ) is equal to Pr(d R Q ) 20

Probabilistic Indexing (4) Now, let s apply Bayes Theorem: Combined: 21

Probabilistic Indexing (5) Pr(Q = q) is the same for all documents d Therefore, the document ranking induced by Pr(d R q ) is identical to the ranking induced by Pr(d R Q ) Pr(Q = q d R Q ) Since we are only interested in the ranking, we can replace Pr(Q = q) by a constant: 22

Probabilistic Indexing (6) Pr(d R Q ) can be estimated from user feedback Give the users a mechanism to rate whether the document they read previously has been relevant with respect to their query Pr(d R Q ) is the relative frequency of positive relevance ratings Finally, we must estimate Pr(Q = q d R Q ) 23

Probabilistic Indexing (7) How to estimate Pr(Q = q d R Q )? Assume independence of query terms: Is this assumption reasonable? Obviously not (co-occurrence, think of synonyms)! 24

Probabilistic Indexing (8) What s next? Split up the product by q i s value! Look at complementary events: 25

Probabilistic Indexing (9) Only Pr(Q i = 1 d R Q ) remains unknown It corresponds to the following: Given that document d is relevant for some query, what is the probability that the query contained term i? 26

Probabilistic Indexing (10) Given that document d is relevant for some query, what is the probability that the query contained term i? Maron and Kuhns argue that Pr(Q i = 1 d R Q ) can be estimated by the weight of term i assigned to d by the human indexer Is this assumption reasonable? Yes! 1. The indexer knows that the current document to be indexed definitely is relevant with respect to some topics 2. She/he then tries to find out what these topics are Topics correspond to index terms Term weights represent degrees of belief 27

Probabilistic Indexing (11) Taken all together, we arrive at: c(q) doesn t matter Pr(d R Q ) can be estimated from query logs Possible modification: Remove the (1 d i ) factors, since most users leave out query terms unintentionally 28

Reality Check Pr(d R Q ) models the general relevance of d Pr(d R q ) is proportional to Pr(d R Q ) This is reasonable Think of the following example: You want to buy a book at a book store Book A s description almost perfectly fits what you are looking for Book B s description perfectly fits what you are looking for Book A is a bestseller Nobody else is interested in book B Which book is better? 29

Lecture 4: Probabilistic Retrieval Models 1. The Probabilistic Ranking Principle 2. Probabilistic Indexing 3. Binary Independence Retrieval Model 4. Properties of Document Collections 30

Binary Independence Retrieval Presented by van Rijsbergen in 1977 Basic notions: k index terms Documents = vectors over {0, 1} k, i.e. set of words model Queries = vectors over {0, 1} k, i.e. set of words model R q = relevant documents with respect to query q Task: Given a query q, estimate Pr(d R q ), for any document d 31

Binary Independence Retrieval (2) Let D be a uniformly distributed random variable ranging over the set of all documents in the collection We can use D to express our initial probability Pr(d R q ): This means: If we restrict our view to events where D is equal to d, then Pr(d R q ) is equal to Pr(D R q ) Note the similarity to probabilistic indexing: 32

Binary Independence Retrieval (3) Again, let s apply Bayes Theorem: Combined: 33

Binary Independence Retrieval (4) Pr(D R q ) is identical for all documents d Since we are only interested in the probability ranking, we can replace Pr(D R) q by a constant: 34

Binary Independence Retrieval (5) Pr(D = d) represents the proportion of documents in the collection having the same representation as d Although we know this probability, it basically is an artifact of our approach to transforming Pr(d R q ) into something Bayes Theorem can be applied on Unconditionally reducing highly popular documents in rank simply makes no sense 35

Binary Independence Retrieval (6) How to get rid of Pr(D = d)? Instead of Pr(d R q ) we look at its odds: As we will see on the next slide, ordering documents by this odds results in the same ranking as ordering by probability 36

Binary Independence Retrieval (7) This graph depicts probability versus (log) odds: 37

Binary Independence Retrieval (8) Applying Bayes Theorem on Pr(d R q ) yields: Again, c(q) is a constant that is independent of d 38

Binary Independence Retrieval (9) Putting it all together we arrive at: 39

Binary Independence Retrieval (10) It looks like we need an assumption Assumption of linked dependence: (slightly weaker than assuming independent terms) Is this assumption reasonable? No, think of synonyms 40

Binary Independence Retrieval (11) Let s split it up by term occurrences within d: Replace Pr(D i = 0 ) by 1 Pr(D i = 1 ): 41

Binary Independence Retrieval (12) Let s split it up by term occurrences within q: 42

Binary Independence Retrieval (13) Looks like we heavily need an assumption Assume that Pr(D i = 1 D R q ) = Pr(D i = 1 D R q ), for any i such that q i = 0 Idea: Relevant and non-relevant documents have identical term distributions for non-query terms Consequence: Two of the four product blocks cancel out 43

Binary Independence Retrieval (14) This leads us to: Multiply by 1 and regroup: 44

Binary Independence Retrieval (15) Fortunately, the first product block is independent of d, so we can replace it by a constant: 45

Binary Independence Retrieval (16) How to estimate the second quotient? Since usually most documents in the collection will not be relevant to q, we can assume the following: Reasonable assumption? 46

Binary Independence Retrieval (17) How to estimate Pr(D i = 1)? Pr(D i = 1) is roughly the proportion of documents in the collection containing term i: N: collection size df(t i ): document frequency of term i 47

Binary Independence Retrieval (18) This leads us to the final estimate: 48

Binary Independence Retrieval (19) Pr(D i = 1 D R q ) cannot be estimated that easy There are several options: Estimate it from user feedback on initial result lists Estimate it by a constant (Croft and Harper, 1979), e.g. 0.9 Estimate it by df(t i ) / N (Greiff, 1998) 49

Probabilistic Models Are there any other probabilistic models? Of course: Extension of the Binary Independence Retrieval model Learning from user feedback Different types of queries Accounting for dependencies between terms Poisson model Belief networks Many more 50

Pros and Cons Pros Very successful in experiments Probability of relevance as intuitive measure Well-developed mathematical foundations All assumptions can be made explicit Cons Estimation of parameters usually is difficult Doubtful assumptions Much less flexible than the vector space model Quite complicated 51

Lecture 4: Probabilistic Retrieval Models 1. The Probabilistic Ranking Principle 2. Probabilistic Indexing 3. Binary Independence Retrieval Model 4. Properties of Document Collections 52

Some Statistics Some data for two test collections: Newswire Web Size 1 GB 100 GB Documents 400,000 12,000,000 Posting entries 180,000,000 11,000,000,000 Vocabulary size (after stemming) 400,000 16,000,000 Index size (uncompressed, without word positions) Index size (uncompressed, with word positions) Index size (compressed, with word positions) Source: (Zobel and Moffat, 2006) 450 MB 21 GB 800 MB 43 GB 130 MB? 53

Heaps Law How big is the term vocabulary? Clearly, there must be an upper bound The number of all reasonable words When the collection grows, the vocabulary size will converge to this number Sorry, this is simply wrong Heaps law: #terms = k (#tokens) b k and b are positive constants, collection-dependent Typical values: 30 k 100, b 0.5 Empirically verified for many different collections 54

Heaps Law (2) Example: Heaps law: #terms = k (#tokens) b Looking at a collection of web pages, you find that there are 3,000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens Assume a search engine indexes a total of 20,000,000,000 pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps law? 3,000 = k 10,000 b 30,000 = k 1,000,000 b k = 30, b = 0.5 The vocabulary size is 30 4,000,000,000,000 0.5 = 60,000,000 55

Zipf s Law Zipf s law: The i-th most frequent term has frequency proportional to 1 / i Key insights: Few frequent terms Many rare terms A heavily skewed data distribution! That s why compression of posting lists works so well in practice! Zipf s law is an example of a power law: Pr(x) = a x b a is a normalization constant (total probability mass must be 1) In Zipf s law: b 1 56

Zipf s Law (2) Zipf analyzed samples of natural language Letter frequencies Term frequencies Letter frequencies in English language: E T A I N O S R H D L U C 0.120 0.085 0.077 0.076 0.067 0.067 0.067 0.059 0.050 0.042 0.042 0.037 0.032 F M W Y P B G V K Q J X Z 0.024 0.024 0.022 0.022 0.020 0.017 0.017 0.012 0.007 0.005 0.004 0.004 0.002 57

Zipf s Law (3) Term frequencies in Moby Dick: Source: http://searchengineland.com/the-long-tail-of-search-12198 58

Zipf s Law (4) The same is true for many other languages Zipf s own explanation: Principle of least effort: Do the job while minimizing total effort Cognitive effort of reading and writing should be small Pressure towards unification of vocabulary such that choosing and understanding words is easy (small vocabulary) Diversity of language has to be high Pressure towards diversification of vocabulary such that complex concepts can be expressed and distinguished The economy of language leads to the balance observed and formalized by Zipf s law 59

Zipf s Law (5) Zipf s law: The i-th most frequent term has frequency proportional to 1 / i Similar relationships hold in many different contexts: Distribution of letter frequencies Distribution of accesses per Web page Distribution of links per Web page Distribution of wealth Distribution of population for US cities 60

Zipf s Law (6) 61

Next Lecture Latent Semantic Indexing 62