Midterm Examination Practice

Similar documents
University of Illinois at Urbana-Champaign. Midterm Examination

Language Models. Hongning Wang

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Ranked Retrieval (2)


A Note on the Expectation-Maximization (EM) Algorithm

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Lecture 13: More uses of Language Models

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Information Retrieval

A Risk Minimization Framework for Information Retrieval

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Chap 2: Classical models for information retrieval

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Name: Matriculation Number: Tutorial Group: A B C D E

Bayesian Analysis for Natural Language Processing Lecture 2

Modeling Environment

INFO 630 / CS 674 Lecture Notes

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS 646 (Fall 2016) Homework 3

Score Distribution Models

Generalized Inverse Document Frequency

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Note for plsa and LDA-Version 1.1

Lecture 6: The Pigeonhole Principle and Probability Spaces

The Bayes classifier

Information Retrieval Basic IR models. Luca Bondi

Behavioral Data Mining. Lecture 2

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Why Language Models and Inverse Document Frequency for Information Retrieval?

BLAST: Target frequencies and information content Dannie Durand

Modern Information Retrieval

Bayesian Methods: Naïve Bayes

Computational Cognitive Science

1. Discrete Distributions

Information Retrieval and Organisation

ELEG 3143 Probability & Stochastic Process Ch. 1 Probability

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Expectation maximization tutorial

Probabilistic Graphical Models

Generative Models for Discrete Data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PMR Learning as Inference

PV211: Introduction to Information Retrieval

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CSC321 Lecture 18: Learning Probabilistic Models

The Naïve Bayes Classifier. Machine Learning Fall 2017

Computational Cognitive Science

Mutually Exclusive Events

IR Models: The Probabilistic Model. Lecture 8

Passage language models in ad hoc document retrieval

Math 493 Final Exam December 01

Latent Dirichlet Allocation (LDA)

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Risk Minimization and Language Modeling in Text Retrieval

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

ECE 302: Probabilistic Methods in Electrical Engineering

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Discrete Binary Distributions

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

7.1 What is it and why should we care?

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Comparing Relevance Feedback Techniques on German News Articles

What students need to know for... Functions, Statistics & Trigonometry (FST)

Lecture 2: Priors and Conjugacy

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Introduction To Machine Learning

STAT 345 Spring 2018 Homework 4 - Discrete Probability Distributions

Pivoted Length Normalization I. Summary idf II. Review

Introduc)on to Bayesian Methods

Probability Theory and Simulation Methods

Introduction to Machine Learning. Lecture 2

Sampling from Bayes Nets

Introduction to Probability 2017/18 Supplementary Problems

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

A Study of the Dirichlet Priors for Term Frequency Normalisation

Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models

CS 572: Information Retrieval

Document and Topic Models: plsa and LDA

6.867 Machine Learning

Latent Dirichlet Allocation Introduction/Overview

Probability Theory for Machine Learning. Chris Cremer September 2015

Approximate Inference

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

CPSC 340: Machine Learning and Data Mining

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Transcription:

University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai

1. Basic IR evaluation measures: The following table shows the search results on the first page of a Web search engine along with binary relevance judgments where a + (or - ) indicates that the corresponding document is relevant (or non-relevant). Suppose there are 20 relevant documents in total in the whole collection. Document ID Relevance Judgment D1 + D2 + D3 + D4 - D5 - D6 - D7 - D8 - D9 - D10 + Compute the following measures for this list of retrieval results (no need to reduce an expression to a decimal or the simplest form of a fraction): (a) Precision = (b) Recall = (c) Average Precision = (d) Precision at 5 documents = (e) F1 = 2. Pooling: Suppose we use the pooling strategy to create a test collection for evaluating forum search, and we have 5 different systems to contribute to our pool. If we judge the top 10 documents from each of the systems, what would be the maximum possible number and minimum possible number of documents that the assessors would have to judge? 1

3. Bayes Rule Helen and Tom are trying to surf the Web to find some ideas about what they are going to do together this weekend. They decided to use Google with a query written collectively in the following way: (a) Each query word is written independently. (b) When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as HEAD 80% of the time. If the coin shows up as HEAD, then Helen would write the word, otherwise, Tom would write the word. (c) If it is Helen s turn to write, she would write the word by simply drawing a word according to multinomial word distribution θ H. Similarly, if it is Tom s turn to write, he would write the word by drawing a word according to multinomial word distribution θ T. Suppose the two distributions θ H and θ T are defined as follows: Word w p(w θ H ) p(w θ T ) the 0.3 0.3 football 0.1 0.2 game 0.1 0.1 opera 0.2 0.1 music 0.2 0.1......... (a) What is the probability that they would write the as the first word of the query? Show your calculation. (b) Suppose we observe that the first word they wrote is music, what is the probability that this word was written by Helen (i.e., p(θ H w = music ))? Show your calculation (providing a formula is sufficient). 2

(c) Suppose we don t know θ H, but observed a query Q known to be written solely by Helen. That is, the coin somehow always showed up as HEAD when they wrote the query. Suppose Q = the opera music the music game the opera music game and we would like to use the maximum likelihood estimator to estimate multinomial word distribution θ H. Fill in the values for the following estimated probabilities: Word w p(w θ H ) the football game opera music 3

4. Probability ranking principle: One assumption made by the probability ranking principle is that the relevance (or usefulness) of a document is independent of that of other documents that a user may have already seen. Give an example/scenario of search results that are optimal according to the probability ranking principle, but not ideal from a user s perspective. 5. Vector space model: Point out three most important reasons why the following vector-space retrieval function is unlikely effective: score(q,d) = c(w,q) c(w,d) log N w + 1 M + 1 w Q,w D where c(w,q) and c(w,d) are the counts of word w in query Q and document D, respectively, N w is the number of documents in the collection that contain word w, and M is the total number of documents in the whole collection. 4

6. Robertson-Sparck-Jones (RSJ) model: The following equations show the initial steps in deriving the RSJ model based on probability ranking principle. O(R = 1 Q, D) denotes the odds ratio that document D is relevant to query Q. O(R = 1 Q,D) = = = p(r = 1 Q,D) p(r = 0 Q,D) p(q,d R = 1)p(R = 1) p(q,d R = 0)p(R = 0) p(q,d R = 1) p(q,d R = 0) p(d Q,R = 1)p(Q R = 1) p(d Q,R = 0)p(Q R = 0) p(d Q,R = 1) p(d Q,R = 0) (1) (2) (3) (4) (5) (6) According to the probability ranking principle, we would like to rank documents based on p(r = 1 Q,D), but this is equivalent to ranking documents based on O(R = 1 Q,D), so we started the derivation with O(R = 1 Q,D), instead of p(r = 1 Q,D). Applying Bayes rule to both the numerator and the denominator allows us to obtain Equation (2) from Equation (1). (a) What is the justification for going from Equation (2) to Equation (3)? (b) What is the justification for going from Equation (4) to Equation (5)? (c) Now suppose our vocabulary has just four words: w 1,w 2,w 3,w 4. Using the Bernoulli model, we would model the presence and absence of each word in a document D. Thus we can represent document D as a bit vector (d 1,d 2,d 3,d 4 ), where d i indicates whether w i occurs in document D. Write the bit vector for document D 0 = w 1 w 2, i.e., D 0 just contains two words. (d) Let p i = (d i = 1 Q,R = 1) be the probability of seeing w i in a relevant document and q i = p(d i = 1 Q,R = 0) be the probability of seeing term w i in a non-relevant document. Assume again D 0 = w 1 w 2. Express p(d0 Q,R=1) p(d 0 Q,R=0) in terms of q i s and p i s. 5

(e) Suppose we have the following table with examples of relevant and non-relevant documents for query Q where we show the relevance status of each example and whether each word occurs in a document. Rel and NonRel mean that the corresponding document is relevant or non-relevant, respectively; a value of 1 (or 0) indicates that the corresponding word occurs (or does not occur) in the corresponding document. Document ID Relevance Status w 1 w 2 w 3 w 4 D1 Rel 1 0 1 0 D2 Rel 1 1 1 0 D3 Rel 1 0 1 0 D4 Rel 1 0 0 0 D5 NonRel 1 1 0 1 D6 NonRel 1 1 0 1 D7 NonRel 1 1 1 1 D8 NonRel 1 1 1 1 Write down formulas to show how you can estimate p 1,p 3,p 4 and q 1,q 3,q 4 based on these examples. (You don t need to calculate the exact values for them.) 6

7. KL-divergence retrieval model The (negative) KL-divergence retrieval function is score(q,d) = w V p(w θ Q )log p(w θ Q) p(w θ D ) where V is the set of all the words in our vocabulary, and p(w θ Q ) and p(w θ D ) are query language model and document language model, respectively. Show that this ranking function is equivalent to the basic multinomial query likelihood retrieval function if we estimate the query language model p(w θ Q ) based on the empirical word distribution in the query, i.e., p(w θ Q ) = c(w,q) Q where c(w,q) is the count of word w in query Q, and Q is query length. 7

8. Query likelihood and Dirichlet prior smoothing Let Q = q 1...q m be a query and D be a document. (a) Show that if we use the multinomial query-likelihood scoring method (i.e., p(q D) = p(q 1 D)...p(q m D)) and the Dirichlet prior smoothing method, we can rank documents based on the following scoring function: score(q, D) = c(w,q)log(1 + c(w,d) µp(w C) ) µ + Q log µ + D w Q,w D where c(w,d) and c(w,q) are the counts of word w in D and Q respectively, D is the length of document D, p(w C) is a collection/reference language model, and µ is the smoothing parameter of the Dirichlet prior smoothing method. 8

(b) Briefly explain why the formula above is more efficient to compute with an inverted index than the original query likelihood formula. 9

9. Mixture model Instead of modeling documents with multinomial distributions, we may also model documents with multiple multivariate Bernoulli distributions where we would represent each document D as a bit vector indicating whether a word occurs or does not occur in the document. Specifically, suppose our vocabulary set is V = {w 1,...,w N } with N words. A document D will be represented as D = (d 1,d 2,...,d N ), where d i {0,1} indicates whether word w i is observed (d i = 1) in D or not (d i = 0). Suppose we have a collection of documents C = {D 1,...,D M }, and we would like to model all the documents with a mixture model with two multi-variate Bernoulli distributions θ 1 and θ 2. Each of them has N parameters corresponding to the probability that each word would show up in a document. For example, p(w i = 1 θ 1 ) means the probability that word w i would show up when using θ 1 to generate a document. Similarly, p(w i = 0 θ 1 ) means the probability that word w i would NOT show up when using θ 1 to generate a document. Thus, p(w i = 0 θ i ) + p(w i = 1 θ i ) = 1. Suppose we choose θ 1 with probability λ 1 and θ 2 with probability λ 2 (thus λ 1 + λ 2 = 1). (a) Write down the log-likelihood function for p(d) given such a two-component mixture model. (b) Suppose λ 1 and λ 2 are fixed to constants. Write down the E-step and M-step formulas for estimating θ 1 and θ 2 using the Maximum Likelihood estimator. 10

11

scratch 12